Testing AI Systems

Testing an LLM or AI feature is not like testing deterministic software: the same input can produce different outputs, "correct" is a spectrum, and a passing example proves little. This sheet covers how to evaluate AI systems on their own terms. It's the inverse of AI in Testing (using AI to test) — here the AI is the system under test.

Why testing AI is different

Non-determinism — the same prompt can yield different outputs (non-determinism); a single run tells you almost nothing. Evaluate over a dataset and look at rates.
No single right answer — quality is graded (relevant/faithful/safe), not pass/fail. You score outputs, not assert equality.
Hallucination — models state false things confidently (hallucination); plausibility ≠ correctness.
Prompt sensitivity — small wording changes shift behaviour; guard against prompt regression when you edit prompts.
New attack surface — prompt injection and jailbreaks are security issues unique to LLM apps.

What to test

Dimension	Question
Quality / correctness	Are answers relevant, accurate, complete?
Faithfulness (RAG)	Is the answer grounded in the retrieved context, not invented?
Safety	Does it refuse harmful requests and resist jailbreaks?
Robustness	Does it hold up across paraphrases, edge cases, adversarial input?
Consistency	Similar inputs → similar quality, run to run?
Cost / latency	Tokens and response time within budget?

Evaluation methods

Method	How	When
Reference-based	Compare to a known-good answer (exact, similarity, metrics)	You have ground-truth labels
LLM-as-judge	A model scores outputs against a rubric	Scaling subjective quality without humans
Human evaluation	People rate outputs	Ground truth, calibration, high-stakes
Assertion / rule-based	Regex, schema, must-contain/must-not	Format, safety keywords, structure

Build a labelled eval dataset of representative inputs (incl. edge and adversarial cases), run the system over it, and score with one or more methods. The dataset is your test suite; grow it from real failures.

Testing RAG systems

Retrieval-augmented generation has two failure points — retrieval and generation — so test both:

Retrieval: are the right documents fetched? (context precision/recall)
Faithfulness: is the answer supported by the retrieved context, or invented?
Answer relevance: does it actually address the question?

Tools like Ragas and DeepEval provide these RAG metrics out of the box.

Testing agents

Agents (multi-step, tool-using) add behavioural testing on top of output testing:

Does it pick the right tool and call it with valid arguments?
Does it recover from a tool error or bad result?
Does it terminate (no infinite loops) and stay within step/cost budgets?
Trace the full run — agent bugs hide in the steps, not just the final answer.

Safety and red-teaming

Red-team the system with adversarial prompts: jailbreaks, prompt injection, data-exfiltration attempts, harmful-content requests.
Assert it refuses what it should and doesn't leak its system prompt or tools.
Treat safety testing as a first-class, regression-guarded suite — not a one-off.

Observability for LLM apps

Production LLM behaviour drifts and surprises, so trace it: capture prompts, completions, tokens, latency, tool calls and user feedback. Agent observability tools (Langfuse, LangSmith, Phoenix, Laminar) let you debug real failures and mine production traffic for new eval cases.

Evals in CI

Treat evals like tests: run the eval dataset on every prompt/model/code change and gate on score thresholds and regression (don't let a prompt edit silently drop faithfulness). Because runs vary, gate on aggregate scores over the dataset, not a single output. Pin model versions so a provider-side model update doesn't quietly change results.

The tool landscape

Need	Tools
LLM eval frameworks	DeepEval, Ragas, PromptFoo, OpenAI Evals, TruLens, Giskard
Eval + observability platforms	LangSmith, Langfuse, Arize Phoenix, Laminar, Braintrust
App frameworks	LangChain, LlamaIndex
ML lifecycle / experiment tracking	MLflow, Weights & Biases, Great Expectations