Eval harness

AI & LLM Testing

// Definition

Software that runs an LLM-backed system against a dataset of inputs, scores the outputs against criteria (exact match, similarity, LLM-as-judge, custom rubric), and tracks how scores change across model versions, prompts, or code changes. Eval harnesses are to AI features what test runners are to deterministic code: the place CI calls into, the place regressions get caught, the place quality is measured rather than asserted. The 2026 ecosystem has fragmented rather than consolidated — Braintrust is eval-first, Langfuse is prompt-first (acquired by Clickhouse in January), Laminar is built for agent debugging, Arize Phoenix is OpenTelemetry-native. Most teams pick one platform per workflow rather than expecting one tool to cover everything.

// Related terms