LLM-as-judge

AI & LLM Testing

// Definition

An evaluation pattern where one language model grades another model's output. The judge model is given the input, the output to evaluate, and a rubric — and returns a score or pass/fail verdict. Useful for evaluating qualities that are hard to test deterministically: tone, factual accuracy, helpfulness, refusal of unsafe requests. The catch is that judges are themselves LLMs with their own biases and failure modes — they need to be calibrated against human raters before you trust them at scale. Good for triage and trend-spotting; not a replacement for human eval on critical paths.

// Related terms

Golden dataset
A curated set of inputs paired with known-correct outputs, used to evaluate an AI system's performance over time. For an LLM-backed product, a golden dataset might be 100 representative user questions plus the ideal answer for each. You run the system against the dataset on every release and compare current output to the gold answer — either with exact match, similarity scoring, or LLM-as-judge. Without a golden dataset you have vibes, not evaluation. Building and maintaining one is foundational QA work for AI products.
Large Language Model (LLM)
A neural network trained on massive text datasets to predict the next word in a sequence. Modern LLMs like Claude, GPT-4, and Gemini can answer questions, write code, summarise documents, and follow multi-step instructions — but they don't 'know' anything, they predict plausible continuations from patterns in training data. This is why they sometimes produce confident-sounding falsehoods (hallucinations) and why prompt design matters so much. In QA, LLMs are useful for generating test scaffolding, summarising bug reports, and drafting documentation — but their output always needs human review before it ships.
Hallucination
When an AI model generates output that is fluent, confident, and completely wrong. In QA work this often looks like an LLM inventing a method that doesn't exist on a real API, citing a documentation page that was never written, or producing a test assertion that doesn't actually verify the behaviour described in the prompt. Hallucinations aren't a bug — they're a consequence of how language models work, predicting likely text rather than retrieving facts. The mitigations are: ground the model in real context (paste the actual API spec, not its name), verify generated code by running it, and treat any AI-produced reference (URLs, function names, citations) as untrusted until checked.