Q18 of 21 · Testing AI systems
What does an eval harness look like and how do you build a minimal one?
Short answer
Short answer: An eval harness loads a golden eval set, runs each input through the system under test, evaluates each output against defined criteria (property checks, LLM-as-judge, or reference comparison), and produces a structured report of pass rates, score distributions, and per-example breakdowns.
Detail
A minimal eval harness has five components.
1. Eval dataset loader: reads the golden set from JSON or CSV — input, optional ground truth, optional human rating. 2. System under test adapter: calls the actual LLM pipeline (not a mock) and captures the full output including metadata (latency, token usage). 3. Evaluators: a set of functions that score each output — property checks (fast, deterministic), reference comparison (exact match or embedding similarity), and optionally LLM-as-judge. 4. Results aggregator: computes aggregate metrics (pass rate, mean score, per-category breakdowns) and produces a diff against the stored baseline. 5. Report generator: structured JSON or HTML output with per-example details so failures can be reviewed and root-caused.
Existing tools (promptfoo, LangSmith evaluations, Braintrust) provide this out of the box. Build your own only if you need tight integration with proprietary evaluation criteria or data handling requirements.
Run the harness pre-release and on a daily schedule against production traffic samples. The harness is a first-class software artefact — version-controlled and maintained alongside the application. See Eval platforms and tooling and Eval platform decision.