Code Review of AI-Generated Tests

8 min read

AI-generated test code looks confident. It compiles. It often runs. It is sometimes wrong in ways that only become visible when the feature it was meant to catch breaks silently. Reviewing AI-generated tests requires a specific lens — not just "does it run" but "would it catch a regression." This lesson gives you the checklist and the failure patterns.

The review checklist

Run every AI-generated test through these questions before merging:

Selectors — Are they stable? Preference order: data-testid > getByRole/getByLabel > text content > CSS class. AI often defaults to CSS selectors or text content that will break on minor copy changes.

Assertions — Are they meaningful? await expect(locator).toBeVisible() tells you the element exists. It does not tell you the element contains the right text, the form submitted successfully, or the API returned what you expected. Look for assertions that verify behaviour, not just presence.

Test data — Is it isolated? Each test should create its own data or use fixtures designed for isolation. AI tends to hardcode IDs or emails that conflict across parallel test runs.

Timing — Are waits condition-based? await page.waitForTimeout(3000) is an arbitrary sleep. It should be await expect(page.getByText('Payment confirmed')).toBeVisible() or await page.waitForResponse('**/api/payment'). AI produces timeout-based waits frequently.

Cleanup — Does the test clean up after itself? Tests that create database records, upload files, or modify shared state need afterEach cleanup. AI often generates the action without the teardown.

Naming — Does the test name describe the scenario precisely? "should work" is not a useful test name. "should display error message when card number is invalid" is.

Coverage — Does the test cover what it claims? Read the test name, then read the assertions. Do they match? AI test names and AI assertion content sometimes diverge.

Common AI mistakes in test code

Asserting the obvious. Claude generates expect(page).toBeTruthy() or expect(button).toBeDefined(). These pass always and test nothing.

Testing implementation over behaviour. Assertions on internal component state or DOM structure rather than user-visible outcomes. The test should reflect what a user would observe.

Catching and swallowing errors.

try {
  await page.click('[data-testid="submit"]');
} catch {
  // ignore
}

This pattern in a generated test hides real failures. The test always passes regardless of whether the click succeeded.

Using deprecated APIs. AI training data includes old documentation. Claude sometimes suggests cy.server() in Cypress (removed in Cypress 10), or Selenium APIs that were deprecated years ago. Verify against the current framework docs.

Missing negative assertions. The happy path is covered; the error paths have no assertions. "User sees an error message" often becomes code that submits the invalid form but asserts nothing about the resulting state.

Using Claude Code to review its own output

The "review your own work" prompt catches a significant fraction of issues:

> Review tests/checkout/refund.spec.ts — you just generated it.
> Apply our checklist:
> - Are selectors stable? (data-testid preferred)
> - Are assertions meaningful or just presence checks?
> - Is there proper cleanup in afterEach?
> - Are any waits time-based instead of condition-based?
> - Does the test name match what the test actually verifies?
> Suggest specific improvements.

This is not abdication of review — you still read both the original and the critique. But it catches the mechanical issues (missing teardown, timeout-based waits) before you spend your review time on them.

AI-generated test quality — before and after review

Unreviewed AI test

  • Asserts element is visible — passes always

  • waitForTimeout(3000) — hides timing issues

  • Hardcoded user@test.com — breaks in parallel

  • No afterEach — pollutes subsequent tests

  • Passes in CI. Misses the regression it was written for.

Reviewed and refined

  • Asserts confirmation text content — catches real failures

  • waitForResponse('/api/payment') — robust wait

  • UserFactory.random() — safe for parallel runs

  • afterEach cleans created records

  • Passes in CI. Catches the regression it was written for.

Building review into the team workflow

For teams adopting Claude Code at scale, encode the checklist in your PR process:

  • Add a pull request template checkbox: "AI-generated tests reviewed against quality checklist"
  • Require at least one human reviewer who was not the one who prompted Claude Code
  • Track metrics for the first 60 days: what fraction of AI-generated tests needed changes before merge? The number shrinks as your CLAUDE.md gets better.

The right mental model

Claude Code is a capable junior pair programmer. Fast, knowledgeable about framework APIs, tireless for boilerplate. Weak on domain-specific edge cases, business rules, and the things that only your team knows about how your system actually behaves. Reviewing AI output is not a formality — it is the human contribution that makes the pairing productive.

⚠️ Common Mistakes

  • Rubber-stamping tests because they pass. A test that runs successfully and asserts nothing meaningful has negative value — it adds maintenance cost while providing no coverage signal. Running is necessary, not sufficient.
  • Reviewing for style but not for correctness. "This looks like valid TypeScript" is a different check from "if this feature broke, would this assertion catch it?" Both are worth doing; the second is the one that matters.
  • Skipping review under time pressure. The review is cheapest when the code is fresh. A subtle assertion bug discovered three months later — after the test has been green through multiple deploys — is far more expensive to diagnose.

🎯 Practice Task

Apply the review checklist to a generated test. 15 minutes.

  1. Use Claude Code to generate a test for a feature in a project you know well.
  2. Apply the checklist: selectors, assertions, test data, timing, cleanup, naming, coverage.
  3. Find at least one issue — there almost always is one.
  4. Ask Claude Code to fix it using the specific checklist language.
  5. Compare the before and after: what changed?

The next lesson covers the practical side of sustaining Claude Code use — cost management, model selection, and the workflows worth avoiding.

// tip to track lessons you complete and pick up where you left off across devices.