Retrieval-Augmented Generation (RAG)

AI & LLM Testing

// Definition

A pattern where an LLM is given relevant context retrieved from an external source (a vector database, a search index, a document store) before being asked to generate an answer. The LLM doesn't 'know' the answer from training — it reads what was retrieved and synthesises a response. RAG is how chatbots answer questions about your company's docs without those docs being baked into the model. From a QA perspective, RAG systems have two failure surfaces: retrieval (did the system find the right context?) and generation (did the LLM use the context faithfully, or did it hallucinate?). Testing must cover both, separately.

// Related terms

Embedding
A numerical vector representation of text (or images, or audio) that captures meaning in a way machines can compare. Two sentences with similar meaning produce embeddings that are close together in vector space. Embeddings power retrieval in RAG systems, semantic search, and clustering. In QA work, knowing about embeddings matters because they determine what gets retrieved in a RAG pipeline — and bad retrieval is one of the most common reasons AI products give wrong answers.
Large Language Model (LLM)
A neural network trained on massive text datasets to predict the next word in a sequence. Modern LLMs like Claude, GPT-4, and Gemini can answer questions, write code, summarise documents, and follow multi-step instructions — but they don't 'know' anything, they predict plausible continuations from patterns in training data. This is why they sometimes produce confident-sounding falsehoods (hallucinations) and why prompt design matters so much. In QA, LLMs are useful for generating test scaffolding, summarising bug reports, and drafting documentation — but their output always needs human review before it ships.
Hallucination
When an AI model generates output that is fluent, confident, and completely wrong. In QA work this often looks like an LLM inventing a method that doesn't exist on a real API, citing a documentation page that was never written, or producing a test assertion that doesn't actually verify the behaviour described in the prompt. Hallucinations aren't a bug — they're a consequence of how language models work, predicting likely text rather than retrieving facts. The mitigations are: ground the model in real context (paste the actual API spec, not its name), verify generated code by running it, and treat any AI-produced reference (URLs, function names, citations) as untrusted until checked.