Hallucination

AI & LLM Testingintermediate

// Definition

When an AI model generates output that is fluent, confident, and completely wrong. In QA work this often looks like an LLM inventing a method that doesn't exist on a real API, citing a documentation page that was never written, or producing a test assertion that doesn't actually verify the behaviour described in the prompt. Hallucinations aren't a bug — they're a consequence of how language models work, predicting likely text rather than retrieving facts. The mitigations are: ground the model in real context (paste the actual API spec, not its name), verify generated code by running it, and treat any AI-produced reference (URLs, function names, citations) as untrusted until checked.

// Why it matters

A hallucination is fluent, confident output that's factually wrong or unsupported — the model's most dangerous failure because it looks right. QA can't assert exact strings against a non-deterministic model, so testing shifts to grounding and evaluation: does the answer cite real sources, stay within provided context, and pass an eval set rather than a single golden string?

// How to test

// You can't assert exact text on a probabilistic model — assert grounding.
cy.request({ method: 'POST', url: '/api/ask', body: { q: 'What is our refund window?' } })
  .then((res) => {
    // answer must be grounded in retrieved policy, not invented
    expect(res.body.sources, 'cited sources').to.have.length.greaterThan(0)
    expect(res.body.answer).to.match(/\d+\s*days/) // grounded fact present
  })
// Scale this with an eval set + LLM-as-judge, not one-off assertions.

// Common mistakes

  • Asserting exact output strings against a non-deterministic model (flaky by design)
  • No grounding check — accepting a confident answer with zero sources
  • One golden example instead of an eval set across many inputs

// Related terms