Q16 of 21 · Testing AI systems

How do you test an AI feature that uses retrieval-augmented generation (RAG)?

Testing AI systemsSeniortesting-ai-systemsragretrievalgroundednessllmevaluation

Short answer

Short answer: Test each layer independently: retrieval quality (does the right context come back for a given query?), groundedness (does the generated response only use facts from the retrieved context?), and end-to-end quality (is the final answer accurate and helpful?). Retrieval failure and hallucination are the two dominant failure modes.

Detail

RAG systems have two sources of failure that require separate test strategies.

Retrieval layer: given a test query, does the system retrieve the most relevant documents? Evaluate using recall@K (are the expected documents in the top K results?) and mean reciprocal rank. This is deterministic and can be fully automated against a labelled evaluation set. Poor retrieval is the most common RAG failure — the model cannot answer correctly if it does not receive the right context.

Generation layer: given retrieved context, does the model's answer use that context faithfully? Groundedness checks verify each factual claim appears in the retrieved documents. Test with injected incorrect context to verify the model does not override the context with its training data.

End-to-end: does the final answer correctly answer the question? Golden Q&A pairs with expected answers allow accuracy scoring. For open-ended questions, LLM-as-judge with a groundedness and correctness rubric.

Failure mode testing: empty retrieval (no relevant documents found — does the model correctly say "I don't know"?), conflicting context (two retrieved documents contradict each other). See RAG agents and observability.

// WHAT INTERVIEWERS LOOK FOR

Three layers: retrieval (recall@K), groundedness, end-to-end accuracy. Retrieval as the dominant failure mode. Empty retrieval and conflicting context as specific failure mode tests.