Exploratory Testing with AI Agents

9 min read

Exploratory testing — unscripted, hypothesis-driven probing — is the kind of testing that finds the bugs scripted suites miss. It's traditionally manual, time-consuming, and depends heavily on the tester's instincts. Playwright MCP changes the economics: an AI agent can run a charter in minutes, systematically explore the input space a human wouldn't have patience for, and write up findings in a format developers can act on. This lesson covers the prompt structure that produces useful output, where AI exploration genuinely outperforms human testers, where it doesn't, and how to combine both for the best coverage.

The honest framing: AI is good at breadth — long, monotonous probing of edge cases — and weaker at judgement, the "this feels off" heuristic an experienced tester develops. Use it for what it's good at, keep humans in the loop for what they're good at, and the combination outperforms either alone.

The structure of a useful charter

A charter is a focused exploration brief. "Test the site" gives the assistant nowhere to start and no way to stop. A real charter looks like this:

You are an exploratory tester. Spend the next ten minutes finding bugs on
https://demo.myshop.com. Focus on:
 
- Signup flow — try edge cases: empty fields, very long strings, special characters,
  emoji, leading/trailing whitespace, repeated submissions.
- Cart — many items at once, very high quantities, rapid add/remove cycles, mixed
  in-stock and out-of-stock items.
- Checkout — invalid card numbers, mismatched billing/shipping countries, missing
  required fields, special characters in address fields.
 
Constraints:
- Use the test account demo@test.com / demo123. Do not create new real accounts.
- Do not submit any payments. Stop at the final review screen.
- Stay on the demo subdomain.
 
For every issue you find, document:
1. Steps to reproduce (precise, copy-pasteable)
2. Expected behaviour
3. Observed behaviour
4. Severity assessment (Critical, High, Medium, Low) with one-sentence rationale
5. Any console errors or failed network requests captured at the moment of failure
 
Format the final output as a numbered list of findings. If you found no bugs, say so
explicitly and list what you tried.

Five things make this prompt good:

  • A persona. "You are an exploratory tester" sets the lens.
  • A scoped target. Specific URL, specific areas, specific test data.
  • A budget. Ten minutes — bounds session length and cost.
  • Explicit constraints. No real payments, no new accounts, no off-domain wandering. Without these the agent can roam.
  • A reporting format. Severity rationale, console errors, network failures. The output is triable, not just descriptive.

What the assistant is actually good at

Two areas where AI exploration genuinely outperforms a human:

  • Systematic input-space coverage. Submitting a form with empty, short, long, very long, emoji, RTL, leading-whitespace, control-character, and SQL-like payloads — every combination, no fatigue, no skipped permutations. Any tester who's tried to run that grid manually for an hour knows the boredom skipping that follows.
  • Parallel breadth. Run three sessions against three different starting states (logged out, logged in as user, logged in as admin) at the same time. A single tester is single-threaded; AI sessions aren't.

These are the two rails most teams adopt this for. "Run a 30-minute boundary-value charter on the new signup form" used to be a Tuesday afternoon. It's now a coffee break.

What the assistant is bad at

  • Subjective UX. "This feels sluggish" / "this layout is confusing" / "the copy here is wrong for this audience" — the model has no taste and no domain context. An experienced tester catches these in a glance.
  • Domain risk modelling. "Refund flows are critical because of regulatory exposure" is something the team knows; the model doesn't. Without explicit charter direction, the AI will spend equal time on a frontend banner and a financial transaction.
  • Distinguishing real bugs from intended behaviour. "The form rejected my emoji name" — bug or feature? Without product context, the model defaults to bug, producing false positives. The charter has to specify what counts as a defect.

The honest division of labour: AI for breadth and edge cases; humans for UX, taste, and judgement of significance.

The AI-plus-human split, visualised

Exploratory Testing — Best Coverage
  • – Systematic edge-case enumeration
  • – Parallel sessions, no fatigue
  • – Captures console + network signals
  • – Consistent reporting format
  • – Subjective UX judgement
  • – Domain risk modelling
  • – Distinguishing bug from feature
  • – Stakeholder framing of findings
  • – Pre-release smoke (AI breadth + human review)
  • – Post-incident probing of related flows
  • – New-feature charters before scripted tests exist
  • Final accept-for-release decisions –
  • Anything tied to brand or legal copy –
  • Performance and accessibility *feel* –

Capturing artefacts so findings survive review

The agent's chat report is fine for reading but evaporates once you close the window. Three things to add to every charter prompt:

  • Save Playwright traces of any reproduction. "For each finding, capture a Playwright trace and tell me the file path." Each trace is a fully replayable record of the failing flow — far more useful in triage than a written description.
  • Export the report as Markdown. "Format the final output as a Markdown file I can paste into Linear/Jira." Headings, code fences for repro steps, severity tags. Triable on first read.
  • Note environment context. "Include the staging deploy SHA, browser version, viewport size, and time of run with each finding." Without this, "can't repro" is a frequent reply when a developer tries the next day against a different deploy.

Combining with your regression suite

The exploratory session ends with a list of findings. Two outcomes:

  • Real bugs → file as tickets, attach traces, prioritise.
  • Real bugs that should never have escaped to exploration in the first place → write a deterministic Playwright test (Chapter 3) so the next regression catches it pre-release. The exploratory session paid for itself the moment the bug was found; the regression test is the dividend.

That second loop is where AI-augmented QA compounds. Each exploration cycle feeds the regression suite, which shrinks the surface the next exploration has to cover.

⚠️ Common mistakes

  • Running open-ended charters with no time budget. "Find bugs on the site" with no scope or deadline produces hour-long sessions, ballooning costs, and reports padded with non-issues. Always specify a budget (10 minutes is a fine default), an area, and what counts as a finding.
  • Trusting the severity assessments at face value. The model's "Critical" often correlates with "this surprised me" rather than "this loses revenue." Re-grade each finding against your team's real severity rubric before triaging — especially before paging anyone.
  • Skipping the artefact-capture step. A finding without a reproducible trace is half a bug report. The developer can't act on it without re-doing the discovery work, and the AI session has already been billed. Always have the agent save traces and export the report; that's where the value is preserved.

🎯 Practice task

Run a charter on a real area of your app. 30 minutes.

  1. Pick one focused area — signup form, checkout, settings page. Smaller targets produce sharper findings.
  2. Write a charter using the structure above: persona, target, time budget, explicit constraints, reporting format. Spend five minutes on the prompt; it pays back.
  3. Run the session against staging with disposable credentials. Watch the tool calls flow — note when the agent tries something a human tester would also try, and when it tries something a human wouldn't (boundary fuzzing, repeated-submit hammering).
  4. Read the final report. For each finding, ask: real bug? false positive? known limitation? Tag accordingly. Re-grade severity against your team's real rubric.
  5. Convert one finding to a regression test using the prompt patterns from Chapter 3 — the AI-generated reproduction is your seed for a deterministic test. This is the loop that compounds across cycles.
  6. Stretch: run the same charter from a different persona — "You are an attacker probing for input-validation issues" — and compare the findings. Different framings surface different defect classes.

The next lesson zooms in on one specific exploratory use case — bug reproduction — where AI agents shorten triage from hours to minutes.

// tip to track lessons you complete and pick up where you left off across devices.