RAG Evaluation

AI & LLM Testingadvancedaka RAG Eval

// Definition

Measuring a Retrieval-Augmented Generation system on two axes that a plain answer-check misses: retrieval quality (did it fetch the right context?) and faithfulness (is the answer grounded in that context, or hallucinated despite it?). A RAG system can retrieve perfectly and still hallucinate, or answer correctly from the wrong source — so both must be scored separately.

// Why it matters

Teams ship RAG and test only the final answer, missing that retrieval is broken or that the model is ignoring its context. RAG evaluation isolates which stage fails, so you fix the retriever or the prompt rather than guessing. It's the difference between "the chatbot is sometimes wrong" and "the retriever misses 30% of relevant docs."

// How to test

// Score retrieval AND faithfulness separately, per query.
for (const { query, goldDocs, goldAnswer } of ragEvalSet) {
  const { retrieved, answer } = await ragSystem(query)
  // 1. Retrieval: did we fetch the right context?
  const recall = retrieved.filter((d) => goldDocs.includes(d.id)).length / goldDocs.length
  expect(recall, `retrieval recall: ${query}`).to.be.gte(0.8)
  // 2. Faithfulness: is the answer grounded in retrieved context, not invented?
  expect(await isGrounded(answer, retrieved)).to.be.true
}

// Common mistakes

  • Testing only the final answer, blind to whether retrieval or generation failed
  • No faithfulness check — accepting a fluent answer that ignored the context
  • Evaluating on questions whose answers the base model already knows (RAG adds nothing there — test on context-dependent queries)

// Related terms