Testing AI Systems
Testing an LLM or AI feature is not like testing deterministic software: the same input can produce different outputs, "correct" is a spectrum, and a passing example proves little. This sheet covers how to evaluate AI systems on their own terms. It's the inverse of AI in Testing (using AI to test) — here the AI is the system under test.
Why testing AI is different
- Non-determinism — the same prompt can yield different outputs (non-determinism); a single run tells you almost nothing. Evaluate over a dataset and look at rates.
- No single right answer — quality is graded (relevant/faithful/safe), not pass/fail. You score outputs, not assert equality.
- Hallucination — models state false things confidently (hallucination); plausibility ≠ correctness.
- Prompt sensitivity — small wording changes shift behaviour; guard against prompt regression when you edit prompts.
- New attack surface — prompt injection and jailbreaks are security issues unique to LLM apps.
What to test
| Dimension | Question |
|---|---|
| Quality / correctness | Are answers relevant, accurate, complete? |
| Faithfulness (RAG) | Is the answer grounded in the retrieved context, not invented? |
| Safety | Does it refuse harmful requests and resist jailbreaks? |
| Robustness | Does it hold up across paraphrases, edge cases, adversarial input? |
| Consistency | Similar inputs → similar quality, run to run? |
| Cost / latency | Tokens and response time within budget? |
Evaluation methods
| Method | How | When |
|---|---|---|
| Reference-based | Compare to a known-good answer (exact, similarity, metrics) | You have ground-truth labels |
| LLM-as-judge | A model scores outputs against a rubric | Scaling subjective quality without humans |
| Human evaluation | People rate outputs | Ground truth, calibration, high-stakes |
| Assertion / rule-based | Regex, schema, must-contain/must-not | Format, safety keywords, structure |
Build a labelled eval dataset of representative inputs (incl. edge and adversarial cases), run the system over it, and score with one or more methods. The dataset is your test suite; grow it from real failures.
Testing RAG systems
Retrieval-augmented generation has two failure points — retrieval and generation — so test both:
- Retrieval: are the right documents fetched? (context precision/recall)
- Faithfulness: is the answer supported by the retrieved context, or invented?
- Answer relevance: does it actually address the question?
Tools like Ragas and DeepEval provide these RAG metrics out of the box.
Testing agents
Agents (multi-step, tool-using) add behavioural testing on top of output testing:
- Does it pick the right tool and call it with valid arguments?
- Does it recover from a tool error or bad result?
- Does it terminate (no infinite loops) and stay within step/cost budgets?
- Trace the full run — agent bugs hide in the steps, not just the final answer.
Safety and red-teaming
- Red-team the system with adversarial prompts: jailbreaks, prompt injection, data-exfiltration attempts, harmful-content requests.
- Assert it refuses what it should and doesn't leak its system prompt or tools.
- Treat safety testing as a first-class, regression-guarded suite — not a one-off.
Observability for LLM apps
Production LLM behaviour drifts and surprises, so trace it: capture prompts, completions, tokens, latency, tool calls and user feedback. Agent observability tools (Langfuse, LangSmith, Phoenix, Laminar) let you debug real failures and mine production traffic for new eval cases.
Evals in CI
Treat evals like tests: run the eval dataset on every prompt/model/code change and gate on score thresholds and regression (don't let a prompt edit silently drop faithfulness). Because runs vary, gate on aggregate scores over the dataset, not a single output. Pin model versions so a provider-side model update doesn't quietly change results.
The tool landscape
| Need | Tools |
|---|---|
| LLM eval frameworks | DeepEval, Ragas, PromptFoo, OpenAI Evals, TruLens, Giskard |
| Eval + observability platforms | LangSmith, Langfuse, Arize Phoenix, Laminar, Braintrust |
| App frameworks | LangChain, LlamaIndex |
| ML lifecycle / experiment tracking | MLflow, Weights & Biases, Great Expectations |
Quick AI-testing checklist
- A labelled eval dataset of representative + edge + adversarial inputs
- Evaluation method chosen per dimension (reference / LLM-judge / human / rules)
- Scored over the dataset as rates, not judged from single runs
- RAG: retrieval, faithfulness and answer-relevance tested separately
- Agents: tool choice, error recovery, termination and traces checked
- Safety/red-team suite for jailbreaks and prompt injection
- Tracing/observability capturing prompts, outputs, tokens, feedback
- Evals run in CI, gating on thresholds + regression
- Model versions pinned so provider updates don't silently change results
- Eval set grows from real production failures