RAG Evaluation
// Definition
Measuring a Retrieval-Augmented Generation system on two axes that a plain answer-check misses: retrieval quality (did it fetch the right context?) and faithfulness (is the answer grounded in that context, or hallucinated despite it?). A RAG system can retrieve perfectly and still hallucinate, or answer correctly from the wrong source — so both must be scored separately.
// Why it matters
Teams ship RAG and test only the final answer, missing that retrieval is broken or that the model is ignoring its context. RAG evaluation isolates which stage fails, so you fix the retriever or the prompt rather than guessing. It's the difference between "the chatbot is sometimes wrong" and "the retriever misses 30% of relevant docs."
// How to test
// Score retrieval AND faithfulness separately, per query.
for (const { query, goldDocs, goldAnswer } of ragEvalSet) {
const { retrieved, answer } = await ragSystem(query)
// 1. Retrieval: did we fetch the right context?
const recall = retrieved.filter((d) => goldDocs.includes(d.id)).length / goldDocs.length
expect(recall, `retrieval recall: ${query}`).to.be.gte(0.8)
// 2. Faithfulness: is the answer grounded in retrieved context, not invented?
expect(await isGrounded(answer, retrieved)).to.be.true
}// Common mistakes
- Testing only the final answer, blind to whether retrieval or generation failed
- No faithfulness check — accepting a fluent answer that ignored the context
- Evaluating on questions whose answers the base model already knows (RAG adds nothing there — test on context-dependent queries)
// Related terms
Retrieval-Augmented Generation (RAG)
A pattern where an LLM is given relevant context retrieved from an external source (a vector database, a search index, a document store) before being asked to generate an answer. The LLM doesn't 'know' the answer from training — it reads what was retrieved and synthesises a response. RAG is how chatbots answer questions about your company's docs without those docs being baked into the model. From a QA perspective, RAG systems have two failure surfaces: retrieval (did the system find the right context?) and generation (did the LLM use the context faithfully, or did it hallucinate?). Testing must cover both, separately.
Embedding
A numerical vector representation of text (or images, or audio) that captures meaning in a way machines can compare. Two sentences with similar meaning produce embeddings that are close together in vector space. Embeddings power retrieval in RAG systems, semantic search, and clustering. In QA work, knowing about embeddings matters because they determine what gets retrieved in a RAG pipeline — and bad retrieval is one of the most common reasons AI products give wrong answers.
Hallucination
When an AI model generates output that is fluent, confident, and completely wrong. In QA work this often looks like an LLM inventing a method that doesn't exist on a real API, citing a documentation page that was never written, or producing a test assertion that doesn't actually verify the behaviour described in the prompt. Hallucinations aren't a bug — they're a consequence of how language models work, predicting likely text rather than retrieving facts. The mitigations are: ground the model in real context (paste the actual API spec, not its name), verify generated code by running it, and treat any AI-produced reference (URLs, function names, citations) as untrusted until checked.
Eval Set
A curated collection of input/expected-output pairs used to measure an LLM system's quality on each change — the AI equivalent of a regression suite. Because model output is non-deterministic, you score the system against the whole set (pass rate, not a single exact match), which turns "did the prompt change help?" into a measurable answer instead of a vibe.