Q4 of 21 · Testing AI systems

What is LLM-as-judge and how does it work?

Testing AI systemsMidtesting-ai-systemsllm-as-judgeevaluationrubricai-evaluation

Short answer

Short answer: LLM-as-judge uses a second language model to evaluate the output of the first against a rubric. It scales quality evaluation to volumes impossible for human review, at the cost of requiring calibration against human judgments to verify the judge is itself reliable.

Detail

The basic setup: you have output from your production model, and you want to evaluate whether it's high quality (helpful, accurate, on-topic, appropriately concise). A human reviewer can do this but not at 10,000 samples per day. An LLM judge can.

Typical implementation: a judge prompt that defines the rubric ("Rate the following response 1–5 for accuracy, helpfulness, and tone. Justify each score."), the system response and optionally a ground-truth reference, and a request for a score. Parse the numeric score from the judge's response and aggregate across the eval set.

Key biases to account for: Positivity bias: LLMs tend to rate responses higher than humans would. Calibrate the judge against a sample of human ratings before trusting it. Length bias: longer responses often score higher regardless of quality. Control for length explicitly in the rubric. Self-serving bias: a model from the same family as the judge will be scored more generously. Use a different model family as the judge where possible.

See Evaluating AI models and Eval platforms and tooling.

// WHAT INTERVIEWERS LOOK FOR

Correct description of the judge setup. Three biases (positivity, length, self-serving). Calibration against humans as a mandatory step before trusting the judge.