AI test case generation: where it helps and where it fails
AI is good at generating the test cases you'd have thought of anyway, and bad at the ones that actually catch bugs. That's not a flaw to fix with a better prompt — it's the shape of the tool. Here's how to use it accordingly.
Feed an LLM a spec and ask for test cases, and you'll get a tidy, comprehensive-looking list in seconds. It's genuinely useful — and genuinely limited, in a way that's easy to miss precisely because the output looks so complete. The honest position isn't "AI test generation is great" or "it's useless"; it's that it's strong at coverage of the expected and weak at the unexpected, which happens to be where the bugs live. Use it for what it's good at, supply what it can't, and it's a real time-saver. This is the test-design companion to a wider view of AI-generated tests.
Where it helps
- The obvious cases, fast. Happy paths, standard validation, the common error cases — the cases you would have written but that take time to enumerate. AI does this well and quickly, clearing the boilerplate off your plate.
- A starting checklist. As a first draft to react to, it's valuable — it lists the predictable so you can spend your energy on the rest. Editing a list is faster than starting from blank.
- Coverage of documented behaviour. Given a spec or OpenAPI doc, it enumerates the stated cases thoroughly — good for making sure you didn't miss an obvious documented path.
- Format and volume. Turning a rough idea into well-structured cases, or generating lots of input variations (boundary values, data permutations) on demand.
Where it fails
- The non-obvious bug. The case that catches real defects usually comes from suspicion — "what if this races?", "what if the permission's only checked in the UI?", "what if the network dies mid-request?". That comes from experience and context the model doesn't have. It generates the expected; bugs hide in the unexpected.
- Real-world and domain context. It doesn't know your system's fragile integration, your users' weird-but-common workflow, last quarter's incident, the gotcha everyone on the team carries in their head. The cases that come from knowing this product aren't in a generic spec.
- Plausible filler. It can pad the list with cases that look thorough but are low-value or don't really test anything distinct — coverage theatre. More cases isn't more catching-bugs.
- Judging importance. It struggles to say which cases matter most — risk-based prioritisation needs an understanding of impact it doesn't have. It'll treat a cosmetic case and a money-path case as equals.
Using AI for test-case generation
- Use it to enumerate the obvious/expected cases fast — treat the output as a first draft
- Add the cases that come from suspicion and experience — races, auth gaps, network failures, the unexpected
- Supply domain and system context it can't have — fragile integrations, known incidents, real user workflows
- Prune the plausible filler; more cases isn't more coverage
- You prioritise by risk — it can't tell the money path from a cosmetic edge
- Verify each generated case actually tests something distinct and correct
- Net: let it cover the predictable, spend your time on the bugs it would never think of
My opinion
The framing that makes this tool pay off is division of labour: AI handles the breadth of the predictable, you handle the depth of the unexpected. It's a fast junior generating the obvious cases so your scarce expert attention goes to the suspicious, context-dependent, high-risk ones it structurally cannot produce. The mistake — and it's an easy one because the output looks so complete — is mistaking a comprehensive-looking list for comprehensive testing. A list that covers every documented path and none of the ways this specific system actually breaks is exactly the list that ships the bug.
So use it, genuinely — the time it saves on the predictable is real, and that time is best reinvested in the testing only a human with context can do. Don't ask it to replace the judgement that finds bugs; ask it to clear the ground so you can apply that judgement where it counts. The cases that catch the bugs that matter will, for now, still mostly come from you.
// RELATED QA.CODES RESOURCES
Course
Tool
// related
Why mobile bugs escape web-first QA teams
Web-first teams carry assumptions that quietly break on mobile — permissions, offline state, lifecycle, and updates.
Why average response time lies
The average response time is the metric most likely to make a slow system look fine. Here is what to watch instead.