ReferenceAdvanced5-7 min reference

RAG Testing

RAG (Retrieval-Augmented Generation) answers from retrieved documents instead of the model's memory. The key insight for testing: retrieval and generation fail independently — a perfect answer from wrong docs, or a wrong answer from the right docs — so test them separately. This sheet shows how; see LLM Evaluation and Testing AI Systems for the broader frame (linked below).

The two halves

Half	Question	Metrics
Retrieval	Did we fetch the right context?	Recall@k, precision@k, MRR, hit rate
Generation	Is the answer grounded in that context?	Faithfulness, relevance, completeness

Retrieval checks

The relevant chunk(s) appear in the top-k results (recall@k).
Irrelevant chunks aren't crowding them out (precision).
Chunking/embedding handles synonyms, acronyms, multi-part questions.
Out-of-corpus questions return nothing rather than a confident wrong doc.

Generation checks (given the context)

Faithfulness: every claim traces to a retrieved chunk — no invention.
Answer relevance: addresses the question, not just the topic.
Citations: references point to the chunk that supports the claim.
No-context behaviour: when nothing relevant is retrieved, it says "I don't know" instead of hallucinating.

Failure isolation

Symptom	Likely half
Answer wrong, right docs retrieved	Generation
Answer wrong, wrong docs retrieved	Retrieval
"I don't know" but answer was in corpus	Retrieval
Confident answer, nothing retrieved	Generation (should refuse)

Common mistakes

Testing the end answer only, never which half failed.
No "unanswerable / out-of-corpus" cases (where RAG should refuse).
Ignoring chunking/embedding as a test variable.
Treating a fluent answer as correct without checking groundedness.

// Related resources