Q9 of 21 · Testing AI systems

What does a layered test strategy look like for an AI system?

Testing AI systemsMidtesting-ai-systemstest-strategytest-pyramidevaluationlayered-testing

Short answer

Short answer: Unit tests for deterministic logic, component tests per LLM call (mock the model, test the surrounding code), integration tests for the full pipeline against a golden eval set, and production monitoring with property checks on live sampled traffic. Each layer has different speed, cost, and confidence trade-offs.

Detail

The classic test pyramid does not map cleanly onto AI systems because the LLM itself is a black box with non-deterministic output. The adapted pyramid:

Layer 1 — Unit tests: all deterministic code (parsers, formatters, routers, validators) tested in isolation. Fast, cheap, fully reliable.

Layer 2 — Component tests with mocked LLM: test each component that calls the LLM using a pre-recorded or mocked response. Verifies that your prompt template, response parser, and error handling work correctly for known outputs. Fast and deterministic.

Layer 3 — Integration / eval tests with real LLM: run the golden eval set against the full pipeline with a live model. Slower and carries API cost. Run on pre-release, not every PR.

Layer 4 — Production monitoring: property checks on live sampled traffic. LLM-as-judge scoring on a daily sample. Alert on aggregate score drops.

See New test pyramid for AI for the full model.

// WHAT INTERVIEWERS LOOK FOR

Four layers: unit, component (mocked LLM), integration (real LLM + eval set), production monitoring. Speed/cost trade-off per layer. Knowing mocked LLM tests cover the surrounding code, not model quality.