Q1 of 21 · Testing AI systems
Why can't you use exact-match assertions when testing an LLM-powered feature?
Short answer
Short answer: LLMs produce different text on every call even with the same input — temperature and sampling mean output varies by design. Exact-match assertions would fail on every run not because the feature is broken but because the phrasing changed.
Detail
An LLM given "summarise this article" might return "The article discusses climate policy." one run and "This piece covers environmental regulation." the next. Both are correct summaries. An assertion like expect(output).toBe("The article discusses climate policy.") would fail the second run — a meaningless failure that trains the team to ignore test results.
The fix is to stop thinking about what the output IS and start thinking about what it MUST satisfy:
- Is the output a valid JSON object with the required fields?
- Is the length within acceptable bounds?
- Does it avoid banned content (PII, profanity, competitor brand names)?
- Is it grounded in the source document — no fabricated facts?
These are property checks that hold regardless of which valid phrasing the model chose. This shift — from exact-match to property-based — is the foundational conceptual change when moving from testing deterministic software to testing LLM outputs.