What does an eval harness look like and how do you build a minimal one?

Question

Accepted Answer

An eval harness loads a golden eval set, runs each input through the system under test, evaluates each output against defined criteria (property checks, LLM-as-judge, or reference comparison), and produces a structured report of pass rates, score distributions, and per-example breakdowns. A minimal eval harness has five components. Eval dataset loader: reads the golden set from JSON or CSV — input, optional ground truth, optional human rating. System under test adapter: calls the actual LLM pipeline (not a mock) and captures the full output including metadata (latency, token usage). Evaluators: a set of functions that score each output — property checks (fast, deterministic), reference comparison (exact match or embedding similarity), and optionally LLM-as-judge. Results aggregator: computes aggregate metrics (pass rate, mean score, per-category breakdowns) and produces a diff against the stored baseline. Report generator: structured JSON or HTML output with per-example details so fail

What does an eval harness look like and how do you build a minimal one?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR