What is LLM-as-judge and how does it work?

Question

Accepted Answer

LLM-as-judge uses a second language model to evaluate the output of the first against a rubric. It scales quality evaluation to volumes impossible for human review, at the cost of requiring calibration against human judgments to verify the judge is itself reliable. The basic setup: you have output from your production model, and you want to evaluate whether it's high quality (helpful, accurate, on-topic, appropriately concise). A human reviewer can do this but not at 10,000 samples per day. An LLM judge can. Typical implementation: a judge prompt that defines the rubric ("Rate the following response 1–5 for accuracy, helpfulness, and tone. Justify each score."), the system response and optionally a ground-truth reference, and a request for a score. Parse the numeric score from the judge's response and aggregate across the eval set. Key biases to account for: Positivity bias: LLMs tend to rate responses higher than humans would. Calibrate the judge against a sample of human ratings befo

What is LLM-as-judge and how does it work?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR