AI-generated tests are useful — but not for the reason you think

qa.codes · 8 January 2026 · 9 min read

Intermediate

aicopilottesting

The pitch: 'AI will write your tests.' The reality: AI writes 80% of a test 80% of the way, and the remaining 20% is exactly the part that makes it a test. Here's where AI actually saves time, where it's a trap, and the distinction that separates the two.

part ofTesting AI products

The demo always shows the happy path

Every AI testing demo follows the same script. Open a function. Ask the AI to generate tests. Watch it produce a describe block with three nested it blocks in about four seconds. The audience nods. The function is tested.

Look at those generated tests carefully. They test that the function returns a value when given valid input. They test a null case. Maybe they test an empty string. The assertions are: thing exists, thing is truthy, thing equals the obvious expected value.

What they don't test: the second-order effects, the boundary conditions that actually break in production, the interaction with external state, the error message quality, the contract with the caller. The AI doesn't know those things because it can't read your requirements document, doesn't understand your user's mental model, and can't reason about what would hurt if it broke.

The assertion is the test. Everything else — the setup, the teardown, the mock configuration, the selector, the data — is scaffolding. The assertion is the specification: it says "this is what this code promises to do, and a failure means the promise is broken." AI is excellent at scaffolding. AI is mediocre to actively misleading at assertions.

Where AI genuinely saves time

The scaffolding work is real and significant. Writing it manually is tedious, error-prone, and not where the intellectual work of testing happens. AI is good here:

Test file structure from a function signature. Given a function signature and its type definitions, an AI can produce the describe/it/beforeEach/afterEach structure correctly, with imports configured and test runner setup in place. This alone saves 5–10 minutes per file.

Realistic mock data generation. Asking an AI to generate 10 realistic user objects with varied names, emails, dates, and edge-case values (empty strings, international characters, unusual date formats) produces better test data than most hand-written fixtures.

Test parametrisation. If you have a test that currently runs for one case, an AI can expand it into a parametrised test.each table covering a dozen cases in seconds. The cases are predictable — typical values, empty, null, very long, special characters — but they're real cases.

Selector hunting. In browser automation, finding the right locator for an element is mechanical work. An AI given the rendered HTML can suggest appropriate getByRole, getByLabel, or data-testid selectors faster than reading through the DOM manually.

Boilerplate and framework code. Mock setup, spy configuration, async handling patterns, before/after lifecycle hooks — AI writes this correctly and quickly. It's familiar code that follows established patterns.

Where AI is actively harmful

The assertion is where AI confabulates. Not always, not obviously — but reliably enough that every AI-generated assertion needs a human check.

The failure mode is subtle: the AI generates an assertion that passes, asserts something real, and looks correct on inspection. But it asserts the obvious, not the important. It checks that user.id is defined, not that user.id matches the seeded value. It checks that the response is truthy, not that the response has the structure the caller expects. It checks that the function doesn't throw, not that it doesn't throw for the right reason.

// AI-generated assertion (looks fine, tests very little)
expect(result).toBeTruthy();
expect(result.data).toBeDefined();
 
// Human assertion (specifies the contract)
expect(result.data).toEqual({
  id: expect.any(String),
  email: 'test@example.com',
  role: 'admin',
  createdAt: expect.any(String),
});
expect(result.errors).toBeUndefined();

The AI version doesn't fail when the role is wrong, when the email is missing, or when an unexpected errors array appears. It passes. The test suite is green. The bug ships.

The deeper problem: AI-generated assertion logic is hard to audit at speed. Because the code looks like a real test, reviewers treat it like a real test. The false confidence is worse than no test at all — it signals "this path is covered" when it's only superficially exercised.

The pattern that works vs the one that doesn't

Works: human writes the assertion in plain English or pseudocode, AI fills in the framework code around it.

Human: "Test that createUser returns a user object with the provided email 
and a system-generated UUID as the id, and that calling it twice with the 
same email throws DuplicateEmailError."

AI: [writes the describe block, imports, mock setup, and two it blocks with 
the assertions structured as specified]

The human has done the intellectual work — defined what the function promises, identified the error case, named the specific error type. The AI has done the scaffolding work — structured it into a test file that runs. That's a productive division of labour.

Doesn't work: "Write all tests for this file."

The output is quantity over quality: a test file that achieves coverage metrics by testing every function's happy path with assertions so shallow they'd pass against a mock that returned fixed values regardless of input. The test file looks comprehensive. It tests almost nothing.

The useful mental model: AI is a junior developer who writes very fast, makes predictable mistakes, and needs explicit instruction about what matters. The instruction is your job.

The false confidence risk

The most expensive AI testing failure isn't a test that fails incorrectly — that's visible and fixable. It's a test that passes incorrectly: one that gives the team confidence that a path is covered when it isn't.

False confidence compounds. A developer sees the AI-generated test suite is green, ships the feature, and the bug reaches users. The post-mortem asks why the tests didn't catch it. The answer — "the assertion was too shallow to catch this" — creates less trust in the entire test suite, not just the AI-generated parts.

There's a question I've started asking about every AI-generated test: "What would make this test fail incorrectly?" If the answer is "I'm not sure" or requires more than five seconds to arrive at, the test is too vague. A good test has a clear failure mode: it fails when the specific thing it asserts stops being true. If you can't articulate what that specific thing is, the assertion hasn't been written yet — the AI has written the structure around a missing assertion.

Use AI to write the structure faster. Write the assertions yourself. Check the AI's attempts against a simple criterion: if the implementation returned hardcoded fake data, would this test still pass? If yes, the assertion isn't doing the job.

The practical recommendation: integrate AI into the scaffolding phase of test writing, not the specification phase. Let it generate the describe blocks, the mocks, the parametrised data tables. Then write the assertions yourself, or verify each AI-generated assertion against the checklist above. The time savings from the scaffolding are real; the quality risk from the assertions is also real. Managing that boundary is the skill.

// related

Tutorials·30 December 2025 · 10 min read

Using Claude and Copilot for test writing: a practical playbook

The practical playbook for AI-assisted test writing in 2026. The prompts that work, the prompts that don't, and the human-in-the-loop checkpoints that keep AI from writing tests that pass for the wrong reasons.

aiclaudecopilotworkflow

Opinions·12 December 2025 · 8 min read

The test pyramid is a vibe, not a rule

The Cohn test pyramid has been gospel since 2009. It was a useful heuristic for a 2009 monolith Java app. It's been quoted unchanged ever since — and most modern stacks don't fit its shape.

patternstest-pyramidopinionculture