ReferenceAdvanced5-7 min reference
RAG Testing
RAG (Retrieval-Augmented Generation) answers from retrieved documents instead of the model's memory. The key insight for testing: retrieval and generation fail independently — a perfect answer from wrong docs, or a wrong answer from the right docs — so test them separately. This sheet shows how; see LLM Evaluation and Testing AI Systems for the broader frame (linked below).
The two halves
| Half | Question | Metrics |
|---|---|---|
| Retrieval | Did we fetch the right context? | Recall@k, precision@k, MRR, hit rate |
| Generation | Is the answer grounded in that context? | Faithfulness, relevance, completeness |
Retrieval checks
- The relevant chunk(s) appear in the top-k results (recall@k).
- Irrelevant chunks aren't crowding them out (precision).
- Chunking/embedding handles synonyms, acronyms, multi-part questions.
- Out-of-corpus questions return nothing rather than a confident wrong doc.
Generation checks (given the context)
- Faithfulness: every claim traces to a retrieved chunk — no invention.
- Answer relevance: addresses the question, not just the topic.
- Citations: references point to the chunk that supports the claim.
- No-context behaviour: when nothing relevant is retrieved, it says "I don't know" instead of hallucinating.
Failure isolation
| Symptom | Likely half |
|---|---|
| Answer wrong, right docs retrieved | Generation |
| Answer wrong, wrong docs retrieved | Retrieval |
| "I don't know" but answer was in corpus | Retrieval |
| Confident answer, nothing retrieved | Generation (should refuse) |
Common mistakes
- Testing the end answer only, never which half failed.
- No "unanswerable / out-of-corpus" cases (where RAG should refuse).
- Ignoring chunking/embedding as a test variable.
- Treating a fluent answer as correct without checking groundedness.
// Related resources