Debugging Failed Tests with AI Assistance

A Playwright test fails in CI with "locator getByText('Welcome') not found within 30s." The message is honest but useless on its own — was the login broken, was the page slow, did the welcome text change, or was it just a flake? Playwright MCP turns that one-line failure into a one-prompt investigation. The assistant re-runs the test against the live environment, captures the page state at every step, and writes back a verdict: real failure, here's why or flake, here's what differed between runs. This lesson covers the prompt patterns for the four most common failure shapes, and the mindset shift that AI is for diagnosing tests, not just authoring them.

The single biggest time-saver in adopting this is the speed at which you can rule out flakes. A 30-second AI investigation that says "ran 5 times, all green" is worth more than an hour of staring at a trace looking for a pattern that isn't there.

A prompt that turns a CI failure into an answer

This Playwright test failed in CI:
 
test('user can log in', async ({ page }) => {
  await page.goto('https://staging.myapp.com/login');
  await page.getByLabel('Email').fill('admin@test.com');
  await page.getByLabel('Password').fill('Admin123!');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Welcome')).toBeVisible();
});
 
Error: Locator getByText('Welcome') not found within 30s.
CI build: 2026-05-08 14:22 UTC, deploy SHA 8a3c9f2.
 
Run this test against https://staging.myapp.com right now and report:
 
1. Did the test fail in the same way? (yes/no/different failure)
2. If yes — investigate the page state at the failure point. What is actually
   on the screen instead of "Welcome"? Login error? Different copy? Server error?
3. If no — was it a flake? Run the test 5 times in a row and report pass/fail
   counts plus anything that differs between runs.
4. Save a Playwright trace for the failure (or one failing run if intermittent).
   Tell me the file path.
5. Propose the fix — either a code change to the test, a code change to the app,
   or a wait/synchronisation adjustment.

Five things matter in that prompt: the test code (verbatim), the failure message (verbatim), the environment (URL + deploy SHA), the four diagnostic questions, and the explicit "propose the fix" close. Without the structure the assistant defaults to vague descriptions; with it, you get a verdict.

The four failure shapes you'll see most

Different failure modes need different investigation patterns. The assistant adapts, but you can prime the right one in the prompt.

Real broken behaviour. The app changed — copy, flow, validation. The test correctly catches it. The fix is in the test (update the expected text) or in the app (revert the regression). The assistant's verdict will read: "Reproduces deterministically. The post-login heading now reads 'Welcome back' instead of 'Welcome'."

Flake from missing synchronisation. The test passes when the network is fast, fails when it isn't. The fix is to wait for the right signal. "Ran 5 times: 3 pass, 2 fail. The failing runs show the welcome text appearing 1.2–2.4 seconds after login. The 30s assertion timeout is enough; the issue is that getByText('Welcome') first matches a footer link with the same text on the homepage redirect path. Scope the locator to the main heading."

Flake from data assumptions. The test passes against a fresh database, fails against one that's been used. "Reproduces in CI but not locally because the test user already has 47 orders by the time CI runs; the post-login dashboard shows the orders list, not the welcome heading. Either reset state in beforeEach or scope the assertion."

Environment-specific failure. Passes on the developer's machine, fails on staging. "Reproduces on staging only. The /login redirect path differs — staging goes through SSO and the welcome text is delayed by the SSO callback. Add a wait for the URL change before asserting."

The assistant figures out which shape it's looking at; you just have to read the answer and act.

The debug-with-AI loop

CI failsSingle-line failure message, possibly a…

Hand the test + error to ClaudeVerbatim code, verbatim error, environme…

AI re-runs against the live envDrives the flow via MCP, captures page s…

Verdict + artefactsReal failure / flake / data issue / env-…

Apply the proposed fixUpdate the test, the locator, the wait s…

Re-run in CIConfirm the fix holds across consecutive…

Specialised prompts for specific failure shapes

For a suspected flake, force the question:

Run this test 5 times in a row against staging. Report pass/fail counts.
For any failure, capture the page state at the moment the assertion failed
and tell me what differs from the passing runs.

For a suspected race condition, make the network slow:

Reproduce this test, but throttle the network to slow 3G for the login request.
Does the failure reproduce now? If yes, the test is missing synchronisation —
identify which call needs an explicit wait.

For an environment skew, run side by side:

Run this test against https://staging.myapp.com and against
https://stg-canary.myapp.com (the canary deploy). Report whether each fails.
If only one fails, capture the deploy SHA and the page state at the
divergence point on each.

The pattern: be specific about the hypothesis, ask the assistant to test it, accept the verdict.

Capturing the artefact you actually want

Always ask for a Playwright trace. The chat narrative is fine for the verdict, but the trace is the artefact you can replay locally:

Reproduce the test, capture a Playwright trace via context.tracing.start /
context.tracing.stop, and save it as failure-trace.zip. Tell me the file path.

Then npx playwright show-trace failure-trace.zip opens the timeline UI — every action, every snapshot, every network call, scrubbable. That's your real debugging surface; the AI's verdict was just the index entry that got you to the right trace fast.

The mindset shift

Most teams adopt Playwright MCP for authoring tests. Authoring is a once-per-test cost; debugging is a many-times-per-quarter cost. Over a year, the debugging side compounds harder — every flake and CI failure is a candidate for AI investigation, and each one shrinks from "open the trace, scrub for ten minutes" to "paste the failure, read the verdict, fix."

The qa.codes Playwright with TypeScript course covers the trace viewer in depth. AI-assisted debugging doesn't replace that skill — it pre-filters which traces actually need the deep dive. Most don't.

⚠️ Common mistakes

Asking "why is this failing?" without the test code or the error message. The assistant guesses, you get a generic answer. Always paste the test verbatim and the failure verbatim. Specificity in is specificity out.
Trusting "this looks like a flake" without the run count. The assistant says "appears intermittent" after one run; that's an assumption, not a measurement. Always ask for an explicit run-N-times count, then judge from the pass/fail ratio.
Skipping the trace capture. A verdict without a trace is half a debug. The first time the proposed fix doesn't actually solve it, you'll wish you had a replayable record. Always have the agent save the trace, every time.

🎯 Practice task

Use AI to triage three real CI failures. 30 minutes.

Open your CI dashboard and find three recent failed Playwright tests — ideally a mix of "clearly a real bug," "probably a flake," and "???".
For each, paste the test code, the error message, and the environment URL into a prompt using the structure above. Ask all four diagnostic questions explicitly.
Read the verdicts. For "flake" verdicts, double-check by asking the assistant to run 10 more times and report the wider sample. For "real failure" verdicts, open the captured trace and confirm the failure point matches the description.
Apply the proposed fix on a branch. Push, wait for CI, confirm green. Note for each whether the AI's diagnosis was right, partially right, or wrong — calibrating your own trust takes a handful of runs.
Stretch: wire a small script that, on a CI failure, automatically pastes the test + error into a chat and runs the diagnostic prompt — emit the verdict and trace path back as a CI comment. The convenience this buys is the difference between "I'll investigate Monday" and "already fixed by lunch."

That closes Chapter 4. The remaining chapters tighten the lens on integration with your existing suite, the cost-and-latency reality, the security envelope, and the capstone project that ties everything together.