On this page8 sections
ConceptsIntermediate6-8 min reference

AI in Testing

How AI is actually used in software testing today, what each approach is good for, and the one discipline that separates useful AI testing from dangerous AI testing: review the output. This is about using AI to test — for testing AI systems themselves, that's a different topic.

The approaches at a glance

ApproachWhat the AI doesWhere it helpsThe catch
Test generationDrafts test cases from specs/code/NLSpeed, breadth, coverage of cases you'd missGenerates plausible-but-wrong tests; must review
Self-healingRepairs locators when the UI changesCuts brittle-test maintenanceA "healed" test may no longer assert what you meant
Codeless / NL authoringTurns plain English into testsLets non-coders automateAmbiguous phrasing → misinterpreted tests
Visual AICompares renderings like a human eyeSuppresses pixel-diff false positivesStill needs baseline approval
AgenticAutonomous agents plan and run testsHandles open-ended testing tasksNon-deterministic; output is a draft, not a verdict

The single thread running through every row: AI changes how fast you produce tests, not whether they're correct. Correctness is still your job.

Test generation — the most common use

AI (usually an LLM) drafts test cases from a user story, an API spec, or existing code. It's genuinely useful for breadth — it'll suggest edge cases a tired human skips.

Prompt: "Given this acceptance criterion, list test scenarios covering
happy path, negative cases, and edge cases: [criterion]"
-> AI returns 15-20 candidate scenarios in seconds.

The workflow that makes it safe:

  1. Use the AI output as a draft checklist, not a finished plan.
  2. Cut scenarios that don't reflect real risk in your app.
  3. Add domain-specific cases the model couldn't know (undocumented business rules).
  4. Only then turn the survivors into actual tests.

The failure mode is treating generated tests as correct because they're plausible. AI doesn't know your app's real behaviour — "upload fails gracefully when the server is down" looks as confident as "upload succeeds," but only one is grounded in your error-handling.

Self-healing — useful, with a sharp edge

Self-healing tools (Testim, mabl, Functionize) re-identify UI elements when selectors change, so tests don't break on every cosmetic tweak. This genuinely cuts maintenance.

The sharp edge: a healed test can pass while asserting the wrong thing. If the AI re-binds to a different element than intended, the test stays green but no longer tests what you meant. The discipline: review healed steps, and treat a sudden drop in failures with the same suspicion as a sudden rise.

Codeless / natural-language authoring

Tools like testRigor and Testsigma let you write tests in plain English. The win is accessibility — manual testers automate without code. The catch is ambiguity: natural language is imprecise, so phrasing that reads fine to you can be misinterpreted by the AI. Keep statements specific and verify the interpreted test does what you intended.

Agentic testing — the newest, least settled

Agentic tools (TestZeus/Hercules, Agentic QE Framework) use autonomous AI agents that plan and execute testing tasks with minimal scripting. They're powerful for open-ended exploration, but they are non-deterministic — the same task can produce different runs. Treat agent output as a capable draft requiring review, never as an authoritative verdict, and watch for run-to-run variability when triaging failures.

The one rule for all of it

Every AI testing approach is a force multiplier on production, not on judgement. It makes you faster at writing, healing, and exploring — it does not make the result correct. So:

  • Review AI-generated tests before trusting them.
  • Confirm self-healed tests still assert the original intent.
  • Verify NL tests were interpreted as you meant.
  • Approve visual baselines deliberately.
  • Treat agentic output as a draft.

An unreviewed AI test that passes proves nothing. The teams that get value from AI testing are the ones that keep a human in the loop on correctness while letting AI absorb the toil.

Where AI testing falls short (test these yourself)

  • Undocumented business rules — not in the spec, so not in the AI's output.
  • Real correctness — the AI judges plausibility, not your app's actual behaviour.
  • Non-determinism — agentic/LLM-backed runs can vary; flakiness can hide here.
  • Data sensitivity — some tools send code/requests to an AI service; check policy.

Quick checklist for adopting an AI testing tool

  • Clear on which approach it is (generation / self-healing / codeless / visual / agentic)
  • A human reviews AI-generated tests before they gate anything
  • Self-healed steps are reviewed, not blindly trusted
  • Generated/NL tests are checked against real, domain-specific risk
  • Non-determinism accounted for in agentic/LLM-backed tools
  • Data/privacy policy allows sending code or requests to the AI service
  • The tool augments the team's judgement rather than replacing it