Using Claude and Copilot for test writing: a practical playbook

qa.codes · 30 December 2025 · 10 min read

Intermediate

aiclaudecopilotworkflow

Here's the practical playbook for AI-assisted test writing in 2026. The prompts that work, the prompts that don't, and the human-in-the-loop checkpoints that keep AI from writing tests that pass for the wrong reasons.

part ofTesting AI products

I've been working through what actually produces useful results with AI-assisted test writing — the patterns that save real time versus the ones that generate plausible-looking tests that don't test much. This is the playbook I keep coming back to.

The four jobs AI is good at

Understanding where AI earns its place matters before talking about prompts. These are the test writing jobs where AI produces reliably useful output:

Scaffolding from function signatures. Given a function's signature, types, and a brief description of what it does, AI can produce the correct test file structure — imports, describe blocks, beforeEach setup, afterEach teardown — fast. This is mechanical work that follows patterns; AI is good at patterns.

Generating realistic mock data. "Give me 15 realistic user records with varied names, valid emails, dates across a 5-year range, and at least 3 edge cases (empty middleName, international characters in name, maximum-length email)" produces much better test data than manually writing fixtures. AI generates variety that hand-written data rarely has.

Parametrising existing tests. If you have a test that runs for one case and you want it to run for a dozen, AI can expand it into a test.each table. Provide the single test and describe the additional cases verbally; AI produces the table. The cases are predictable (typical values, empty, null, boundary numbers) but comprehensive enough to catch real bugs.

Refactoring brittle selectors. Paste the rendered HTML of a component and ask AI to suggest stable locators — getByRole, getByLabel, data-testid — ranked by stability. AI reads HTML competently and knows which selectors are more resistant to DOM changes.

The four jobs to keep human

These are the test writing jobs where AI produces plausible output that requires careful verification:

Defining what the test is for. The test spec — what behaviour does this test exist to verify, and why would it matter if that behaviour changed — is the intellectual core of a test. AI doesn't have context about your requirements, your users' expectations, or the history of bugs in this area. You do.

Writing the core assertion. The assertion specifies what the code promises. AI writes assertions that look correct but often assert the obvious rather than the important. As covered in the AI-generated tests opinion post: a test that asserts expect(result).toBeTruthy() isn't testing anything meaningful. Write the assertion yourself or verify AI's attempt explicitly.

Deciding which edge cases matter. AI generates a predictable set of edge cases: empty input, null, maximum length, special characters. The edge cases that actually break things in production are usually domain-specific — a specific date format your API returns, a user with a null address field from a legacy migration, a price value of exactly zero. AI doesn't know these exist unless you tell it.

Judging test value. Is this test worth maintaining? Does it overlap with another test? Does the coverage it provides justify its complexity? These are judgment calls that require knowing your codebase, your risk profile, and your team's maintenance bandwidth. AI has no context for any of these.

The prompt patterns that work

Provide a function signature and one example test, then ask for parameterised variants:

Here's the function:
[paste function signature and implementation]

Here's one test I've written for it:
[paste your existing test]

Generate a test.each table with 8 more cases that exercise:
- boundary values for the numeric parameters
- empty string inputs
- null and undefined where the types allow
- values that should produce different output branches
Keep the same assertion structure as my example test.

This pattern works because you're specifying the assertion structure and delegating the data generation. The AI doesn't write the assertion logic — it generates the cases that run through your assertion.

Provide the rendered HTML and ask for stable locators:

I need to locate these elements in a Playwright test. Provide locators in this order of preference: 
getByRole with accessible name, getByLabel, getByTestId, and as a last resort CSS selector.

[paste relevant HTML]

For each element, explain why you picked that locator over the alternatives.

The explanation request is important — it surfaces reasoning you can check rather than just selectors you have to trust.

Provide a working test and ask for mock data expansion:

This test uses a hardcoded user fixture. Generate 10 alternative user objects that would
exercise different code paths — vary the role field across all four valid values, include 
one user with a null preferredName, include one with an email at exactly 254 characters,
and include two users with createdAt dates in the past 24 hours.

Specific and constrained data generation requests produce usable fixtures. Open-ended "give me test data" requests produce generic data.

The prompt patterns that don't work

"Write all tests for this file." The scope is too broad for AI to produce quality output. AI generates something for every function — a happy path, a null check, maybe an error case — and the coverage looks comprehensive in aggregate. But each individual test is shallow. You get volume, not quality. It's faster to review and accept weak tests than to catch them; the result is a test suite that provides false confidence.

"Write tests to improve coverage to 80%." Coverage is a means, not an end. AI optimising for a coverage metric produces tests that execute lines of code without asserting meaningful behaviours. The 80% is achieved; the tests are useless.

"What tests should I write for this function?" This question asks AI to do the specification work — decide what the function promises and what edge cases matter. AI will produce a list that looks reasonable but is missing the cases that matter to your specific product and user base. Use AI to implement test ideas you've already had, not to generate the ideas.

Claude vs Copilot for test writing

Both are useful; they fit different parts of the workflow.

Copilot (and GitHub Copilot Chat) excels at inline completions while you're writing. You type test('returns 404 when user does not exist and Copilot completes the test body based on context it infers from your codebase. This is fast and contextually aware — it sees your existing test patterns and follows them. The limitation is the inference quality drops for complex assertions and novel test structures.

Claude (in a standalone conversation or IDE extension) excels at larger-scope, higher-direction tasks: generating a complete test file from a spec, reasoning about test structure, comparing two testing approaches, or reviewing a test file for weak assertions. Providing a long context — the function, the types, the requirements, a few example tests — and asking for a complete output produces better results from Claude than from Copilot's autocomplete model.

The practical workflow: use Copilot for inline completion while writing test boilerplate and data. Use Claude for generating complete test files from scratch, refactoring an existing suite's structure, or generating parametrised data sets for complex functions.

The checkpoint that catches bad tests

After AI generates a test (or a batch of tests), run this check before committing:

"If the implementation returned hardcoded fake data that matched the obvious expected value, would this test still pass?"

If yes, the assertion doesn't constrain the implementation enough to be useful. The test passes because the AI generated an assertion that matches whatever the function happens to return, not because the function is doing something specific.

A second check: "What would make this test fail incorrectly?" If you can't answer in five seconds — if the failure condition is unclear — the assertion isn't specific enough.

The most dangerous AI test output is the test that looks thorough (multiple assertions, realistic data, edge cases) but asserts soft conditions throughout: toBeDefined() instead of toBe('specific-expected-value'), toHaveLength(expect.any(Number)) instead of toHaveLength(3), toBeGreaterThan(0) instead of toBe(42). Each individual assertion is technically valid. Together, they specify almost nothing.

Spend 30 seconds per test on the checkpoint. It's faster than debugging a production bug that a meaningless passing test failed to catch.

// related

Opinions·8 January 2026 · 9 min read

AI-generated tests are useful — but not for the reason you think

AI writes 80% of a test 80% of the way, and the remaining 20% is exactly the part that makes it a test. Where AI saves time, where it's a trap, and the distinction that separates the two.

aicopilottesting

Tutorials·10 May 2026 · 7 min read

Custom Cypress commands that actually pay off

Most teams over-abstract too early. Four custom commands are worth writing on every Cypress project — login, seed, intercept, visit. The rest can wait.

cypresstypescriptpatterns