Q4 of 21 · Testing AI systems
What is LLM-as-judge and how does it work?
Short answer
Short answer: LLM-as-judge uses a second language model to evaluate the output of the first against a rubric. It scales quality evaluation to volumes impossible for human review, at the cost of requiring calibration against human judgments to verify the judge is itself reliable.
Detail
The basic setup: you have output from your production model, and you want to evaluate whether it's high quality (helpful, accurate, on-topic, appropriately concise). A human reviewer can do this but not at 10,000 samples per day. An LLM judge can.
Typical implementation: a judge prompt that defines the rubric ("Rate the following response 1–5 for accuracy, helpfulness, and tone. Justify each score."), the system response and optionally a ground-truth reference, and a request for a score. Parse the numeric score from the judge's response and aggregate across the eval set.
Key biases to account for: Positivity bias: LLMs tend to rate responses higher than humans would. Calibrate the judge against a sample of human ratings before trusting it. Length bias: longer responses often score higher regardless of quality. Control for length explicitly in the rubric. Self-serving bias: a model from the same family as the judge will be scored more generously. Use a different model family as the judge where possible.