// AI · Non-Deterministic

Testing in the age of AI agents.

Two distinct problems get lumped under "AI testing" and they are not the same problem. Pick the path that matches what you're shipping, or browse by category below.

> reviewed = 2026-05-18 · 37 guides · 8 paths · refresh quarterly

>search AI guides, prompts & skills…⌘K

$ what_are_you_trying_to_do --pick-one

path_05 · ai for test docsI need AI to help with test documentation4 guides path_03 · ai for test dataI need better test data at scale5 guides path_04 · ai in ci/cdI want AI in my CI/CD pipeline5 guides path_06 · ai for automation scriptingI want AI to write my automation scripts5 guides path_01 · most commonMy product ships an AI feature5 guides path_02 · testing with ai agentsI want AI agents to do the testing6 guides path_07 · testing the model itselfI need to test AI models for bias, drift, or fairness4 guides path_08 · governance & complianceI need to defend AI in front of a regulator4 guides

37 guides totalfresh32monitoring5

// the headline insight

When output is probabilistic, the test pyramid tilts. Unit tests at the base shrink. Evaluation rises to take their place.

// path_05 · ai for test docs4 guides · 32m · reviewed May 2026

AI for test documentation

Test plans, acceptance criteria, traceability matrices — the documentation overhead that quietly consumes a third of every sprint. AI is uniquely well-suited here because the output is reviewed by humans before it matters, the inputs are unstructured prose, and the cost of a mediocre draft is low.

AI-generated test plans●8m

From acceptance criteria to test plan, then human review.

Requirements → test cases with AI●9m

Patterns for turning a user story into a useful set of test cases without the LLM hallucinating tests for features that do not exist.

AI-assisted traceability matrices●8m

Mapping requirements to test coverage at scale — and keeping the map fresh when requirements churn.

AI-augmented bug reports●7m

From "it's broken" to a reproducible defect report with logs, environment, and likely root cause.

// path_03 · ai for test data5 guides · 47m · reviewed May 2026

AI for test data generation

Realistic test data is one of the harder problems in QA — production data is sensitive, manually crafted data misses edge cases, and synthetic data tooling has matured fast. AI is now the default approach for generating users, edge cases, adversarial inputs, and PII-safe substitutes that preserve shape without leaking real customers.

Synthetic test data with LLMs●11m

Generate realistic users, edge cases, and adversarial inputs without touching production data.

Edge case discovery with AI●9m

Boundary-value generation that catches what manual case design misses.

PII-safe synthetic data▲10m

Synthetic data that preserves shape without leaking real users — for regulated industries.

AI-augmented test data management●8m

From flat fixtures to AI-managed data sets — versioning, drift, refresh cadence.

AI for data quality validation●9m

Detecting schema drift, anomalies, and contract violations in test data pipelines.

// path_04 · ai in ci/cd5 guides · 50m · reviewed May 2026

AI in CI/CD

The five highest-leverage places to add AI to a CI/CD pipeline: predictive test selection, flaky-test classification, failure triage, risk-based run ordering, and AI-generated tests on PR open. Each is a distinct problem with a distinct tooling landscape.

Intelligent test selection●10m

Models that pick which tests to run for a given change — coverage, cost, and false negatives.

AI for flaky test detection●9m

Classifying genuine failures vs intermittent infrastructure noise — and managing the quarantine debt.

AI-driven failure triage and root-cause analysis●11m

From red CI run to actionable diagnosis — what the model actually produces, and how to read it correctly.

Risk-based test prioritisation with AI▲8m

AI models that order test runs by predicted business risk, not alphabetical order.

AI test generation on pull-request▲12m

When the PR opens, the agent writes the tests. What it produces, where it fails, and the review discipline required.

// path_06 · ai for automation scripting5 guides · 51m · reviewed May 2026

AI for automation scripting

The most-cited GenAI use case in QA — 63% of practitioners in WQR 2025-26. The question isn't whether to use Copilot, Cursor, or Claude Code to write Playwright. It's where they reliably win, where they reliably fail, and what the prompt patterns look like when you're 6 months into using them daily.

AI-generated automation scripts▲12m

When Claude/Copilot/Cursor author Playwright/Cypress, and where they still need a human in the loop.

Prompt patterns for test authoring●10m

Reusable prompt structures that produce maintainable test code, not throwaway scripts.

MCP and Agent Skills for testing workflows●9m

Beyond Playwright MCP — connecting Jira, GitHub, TestRail to the coding agent for full-workflow automation.

AI-driven refactoring of test suites●9m

Selectors, page objects, fixtures — what coding agents can refactor cleanly and what they break.

Legacy-to-modern migrations with AI▲11m

Selenium → Playwright, Java → TypeScript, etc. What an AI-driven migration pipeline actually looks like and where it fails.

// path_01 · most common5 guides · 44m · reviewed May 2026

Testing AI features in your product

The test pyramid changes shape when output is non-deterministic. Exact-match assertions break. The evaluation layer — curated datasets, rubric scoring, LLM-as-judge — rises to fill the gap. Most teams discover this the hard way, months into a project.

The new test pyramid●8m

Where exact-match assertions break and rubric scoring takes over.

Evaluation methods for AI features●10m

Golden datasets · LLM-as-judge · human-in-the-loop. Three methods that work together, not alternatives.

Failure modes you must catch▲12m

Hallucination · jailbreak · prompt injection · data leakage. AI-specific bugs that cost more than typical defects.

RAG, agents, and observability●14m

Three production surfaces. Different failure modes. Different instrumentation.

External resources for AI testing→ browse

Curated links to canonical guides, papers, frameworks beyond qa.codes.

// path_02 · testing with ai agents6 guides · 66m · reviewed May 2026

Using AI agents to test

An AI agent driving a real browser session is doing testing work — it decides what to click, observes what happened, and iterates toward a goal. An AI coding assistant that generates test code is helping you do testing work. These are categorically different architectures.

Agentic testing: what it is, what it isn't●9m

Why agent-driven ≠ AI-generated tests. The 200-test floor and the 30,000-token-per-test ceiling.

Playwright MCP vs Stagehand vs Browser Use vs Computer Use●15m

Four production-grade stacks dominate 2026 browser automation for AI agents. Honestly compared, no vendor bias.

MCP vs CLI+SKILLs: when each pattern wins●8m

Microsoft now recommends CLI+SKILLs over MCP for coding agents. Here's why, and when MCP is still right.

Braintrust vs Langfuse vs Laminar vs Arize Phoenix●13m

The eval-and-observability space fragmented in 2026 around four philosophies. Match the tool to the workflow that hurts.

Cost, latency, and caching for agent-driven tests▲10m

Token budgets, model selection, action caching — the economics that decide whether agentic testing pays off.

Agentic testing in production: case studies●11m

What we actually know about teams shipping agentic test suites, read critically.

// path_07 · testing the model itself4 guides · 43m · reviewed May 2026

Testing the AI model itself

Distinct from testing AI features in your product — this band is about validating the model as an artefact: accuracy, bias, fairness, drift, robustness, evaluation frameworks. The audience extends beyond traditional QA into ML engineering and red-teaming.

Evaluating AI models●11m

Benchmarks, frameworks, and the difference between "model passed eval" and "model is good enough to ship."

Eval platforms and tooling▲12m

What's actually available in 2026 — and the category shift no one's talking about openly.

Red-teaming and adversarial evaluation●10m

Where AI models break under deliberate pressure — and what a useful red-team session actually looks like.

Bias and fairness testing●10m

The four fairness metrics that practitioners actually need, why you cannot satisfy all of them simultaneously, and the tooling that has stayed stable.

// path_08 · governance & compliance4 guides · 40m · reviewed May 2026

AI governance, compliance, and red-teaming

EU AI Act bias-monitoring obligations land 2 August 2026. NIST AI RMF adoption is accelerating. ISO 42001 audit pressure is real. This band covers the QA practitioner side of AI governance — what to test, what to document, what to defend.

NIST AI RMF for QA practitioners●10m

The Risk Management Framework's four functions — GOVERN, MAP, MEASURE, MANAGE — and what each means for testing teams.

Audit trails, model cards, and datasheets●9m

The paperwork side of AI compliance — what to produce, when, and how to keep it fresh as the model retrains.

Setting up an internal AI red-team▲10m

Beyond external red-teaming — what a small internal QA-led red-team looks like, what cadence, what scope.

AI procurement and supplier audits●11m

What to ask of AI suppliers when your product embeds someone else's model — and the framework-first posture that survives regulatory churn.