AI Test Generation — Draft, Review, and Curate

Use an AI tool (or an LLM) to generate test cases from a spec, then do the part that actually matters: review, correct, and curate the output into a trustworthy suite — practising the human-in-the-loop discipline AI testing depends on.

Role

QA engineer

Difficulty

Intermediate

Time limit

~90 min

Scenario

Your team is under deadline pressure and a lead has suggested 'just get the AI to write the tests.' You agree AI can help — but you've seen teams ship AI-generated suites full of plausible-looking tests that don't actually catch anything, and worse, give false confidence. Your task is to demonstrate the RIGHT way to use AI for test generation: let it produce a fast first draft, then apply the review-and-curate discipline that turns that draft into a suite you'd actually trust. The deliverable isn't 'tests an AI wrote' — it's a documented before/after showing what you kept, cut, fixed, and added, and why.

Requirements

1.Pick a small target: a public API endpoint (e.g. Reqres user creation, or a JSONPlaceholder resource) or a short user story you write down with acceptance criteria.
2.Prompt an AI tool/LLM to generate test cases for it, and capture the RAW output verbatim — this is your 'before'. Note how many cases it produced and roughly how long it took.
3.Review every generated case and classify each as: KEEP (correct and useful), CUT (irrelevant/duplicate/not real risk), or FIX (right idea, wrong detail) — with a one-line reason for each classification.
4.Identify at least three cases the AI MISSED — domain-specific or undocumented-behaviour scenarios it couldn't have known — and add them. This is the human-knowledge contribution AI can't make.
5.Find at least one generated test that is PLAUSIBLE BUT WRONG (asserts something that isn't actually true of the system) and explain how a team that trusted it blindly would have been misled.
6.Turn the curated set (kept + fixed + added) into actually-runnable tests and execute them against the target, confirming they pass/fail for the right reasons.
7.Write a short reflection (at least five sentences) on what the AI was good at (breadth, speed) versus what it couldn't do (judge correctness, know your domain), and where you'd use it again.
8.Produce a before/after summary table: count of generated vs kept vs cut vs fixed vs human-added, so the curation is visible at a glance.

Starter data

›Public APIs good for this: reqres.in (user create/list, returns predictable data), jsonplaceholder.typicode.com (posts/users).
›A generation prompt shape: 'Generate test cases for [endpoint/story] covering happy path, negative cases, and edge cases. For each, give the input and expected result.'
›Remember: the AI does not know the real behaviour of the system — it predicts plausible tests. Your job is to ground them.

Expected deliverables

✓The raw AI-generated output ('before'), with case count and time noted.
✓The classified review (KEEP/CUT/FIX with reasons) plus the human-added cases.
✓The identified plausible-but-wrong case with an explanation of the risk.
✓A runnable curated suite, executed, with results.
✓The reflection and the before/after summary table.

Evaluation rubric

Dimension	What reviewers look for
Review rigour (not blind acceptance)	Does the candidate critically classify every generated case? A weak submission runs the AI output as-is and calls it done. A strong one shows a real KEEP/CUT/FIX pass with reasons — demonstrating that the value is in the review, not the generation.
Catching plausible-but-wrong	Did the candidate find a generated test that looks right but isn't? A weak answer assumes everything the AI produced is valid. A strong one identifies a confidently-wrong case and articulates exactly how blind trust would mislead the team — the core risk of AI test generation.
Human-knowledge contribution	Are the added cases genuinely things the AI couldn't know? A weak answer adds more generic cases. A strong one adds domain-specific or undocumented-behaviour scenarios, showing where human judgement is irreplaceable.
Tests actually run	Did the curated suite get executed and pass/fail for the right reasons? A weak submission leaves tests as text. A strong one runs them against the real target and confirms the assertions hold.
Honest reflection on AI's role	Does the reflection accurately separate what AI is good at (breadth, speed) from what it can't do (judge correctness, know the domain)? A weak answer is either AI-hype or AI-dismissal. A strong one is calibrated and says where it would and wouldn't use AI again.
Before/after transparency	Is the curation quantified? A weak answer hand-waves. A strong one shows the generated/kept/cut/fixed/added counts so the human contribution to the final suite is visible and honest.

Sample solution outline

›Target: reqres.in POST /api/users (create user). Acceptance: returns 201 with id + createdAt.
›Before: AI generated 12 cases in ~30s — happy path, missing fields, wrong types, very long values, etc. (captured verbatim).
›Review: KEEP 6 (valid happy/negative), CUT 3 (duplicate happy paths, an irrelevant auth case the API doesn't have), FIX 3 (correct idea, wrong expected status / wrong field name).
›Plausible-but-wrong: AI asserted POST returns 200; the API actually returns 201 — a team trusting it would have a test that fails for the wrong reason or masks a real status change.
›Human-added: rate-limit behaviour, the API's known quirk that it echoes any field sent, and a data-cleanup consideration — none derivable from the spec alone.
›Run: curated ~9 tests in Postman/supertest against reqres; confirm pass/fail for right reasons.
›Reflection: AI excelled at breadth and speed; failed on the real status code and on domain quirks; would reuse for first-draft breadth, never as final suite.
›Table: generated 12 / kept 6 / cut 3 / fixed 3 / human-added 3 -> final 12, but only 6 of the original survived unmodified.

Common mistakes

Running the AI-generated suite as-is and treating green as success — the exact false-confidence failure the assignment exists to prevent.
Asserting on plausibility instead of verifying against the real system — accepting a confidently-wrong test because it reads correctly.
Adding only more generic cases instead of the domain-specific ones the AI genuinely couldn't know.
Never executing the tests, leaving them as text the AI produced.
An uncalibrated reflection — either 'AI writes all our tests now' or 'AI is useless' — instead of a grounded view of where it helps.
Hiding the curation — not showing how much of the AI output was actually kept unmodified.

Submission checklist

Raw AI-generated output captured ('before') with count and time
Every generated case classified KEEP/CUT/FIX with a reason
At least three human-added cases the AI couldn't have known
At least one plausible-but-wrong case identified and explained
Curated suite turned into runnable tests and executed
Reflection (5+ sentences) on AI's strengths vs limits
Before/after summary table quantifying the curation

Extension ideas

+Repeat with a second AI tool and compare which produced a better first draft and why.
+Feed the AI your app's actual error-handling code and see whether grounding it in real behaviour reduces the plausible-but-wrong rate.
+Wire the curated suite into CI and gate on it, proving the human-reviewed tests catch a regression you introduce deliberately.