On this page5 sections
ReferenceAdvanced5-7 min reference

RAG Testing

RAG (Retrieval-Augmented Generation) answers from retrieved documents instead of the model's memory. The key insight for testing: retrieval and generation fail independently — a perfect answer from wrong docs, or a wrong answer from the right docs — so test them separately. This sheet shows how; see LLM Evaluation and Testing AI Systems for the broader frame (linked below).

The two halves

HalfQuestionMetrics
RetrievalDid we fetch the right context?Recall@k, precision@k, MRR, hit rate
GenerationIs the answer grounded in that context?Faithfulness, relevance, completeness

Retrieval checks

  • The relevant chunk(s) appear in the top-k results (recall@k).
  • Irrelevant chunks aren't crowding them out (precision).
  • Chunking/embedding handles synonyms, acronyms, multi-part questions.
  • Out-of-corpus questions return nothing rather than a confident wrong doc.

Generation checks (given the context)

  • Faithfulness: every claim traces to a retrieved chunk — no invention.
  • Answer relevance: addresses the question, not just the topic.
  • Citations: references point to the chunk that supports the claim.
  • No-context behaviour: when nothing relevant is retrieved, it says "I don't know" instead of hallucinating.

Failure isolation

SymptomLikely half
Answer wrong, right docs retrievedGeneration
Answer wrong, wrong docs retrievedRetrieval
"I don't know" but answer was in corpusRetrieval
Confident answer, nothing retrievedGeneration (should refuse)

Common mistakes

  • Testing the end answer only, never which half failed.
  • No "unanswerable / out-of-corpus" cases (where RAG should refuse).
  • Ignoring chunking/embedding as a test variable.
  • Treating a fluent answer as correct without checking groundedness.