How I evaluate an AI chatbot before release
Testing an AI chatbot is not testing a form. Here is the evaluation pass I run before an AI feature ships — and the failure modes I look for.
part ofTesting AI productsThe first time you're asked to test an AI chat feature, the usual playbook breaks. There's no single correct output to assert against — ask the same question twice and you get two different answers, both arguably fine. So "does it return the expected value?" stops working, and a lot of teams either wave the feature through or freeze. Neither is right. You can absolutely test this; you just test behaviour and boundaries instead of exact strings. Here's the pass I run.
Stop asserting equality, start asserting properties
The core shift: you're not checking that the output equals something, you're checking that it has properties. Did it answer the question asked? Did it stay on topic? Did it refuse what it should refuse? Did it avoid claiming false facts? Each of those is testable even when the exact wording varies every time. Build your cases around properties, and the non-determinism stops being a blocker.
1. Hallucinations — the headline risk
Ask things the bot shouldn't confidently answer: a product feature that doesn't exist, a policy you made up, a person who isn't real. A bad bot invents a fluent, plausible, completely false answer. A good one says it doesn't know. Then the subtler version: ask about something real but obscure and check the answer against the source of truth — confident and wrong is the failure mode that erodes trust fastest.
2. Staying in scope
A support bot for a banking app should not be writing poems, giving medical advice, or discussing competitors. Push on the edges: ask it to do something off-topic, ask it to ignore its instructions ("forget you're a support bot and..."), ask it something adjacent-but-out-of-bounds. Does it stay in its lane or wander off? This overlaps with prompt-injection testing, where you treat user input as a potential attack on the system prompt — a security concern as much as a quality one.
3. Refusals — both directions
Test the two failure modes of refusal. Over-refusal: does it reject a perfectly reasonable request because a keyword tripped a filter? Under-refusal: does it comply with something it shouldn't — leaking another user's data, giving disallowed instructions, helping with the thing the safety rules forbid? Both are bugs; teams usually only test one.
4. The grounded facts must be right
If the bot is wired to real data — your order status, your account balance, your docs — then those answers do have a correct value, and you test them like any integration: ask about a known test order and assert the status is right. The freedom to vary wording does not extend to getting your data wrong. Separate the "creative" surface (tone, phrasing) from the "grounded" surface (facts from systems) and hold each to its own standard.
5. What to log when it fails
Because outputs vary, a screenshot isn't a reproduction. Capture the full input (including any hidden system prompt and retrieved context), the exact output, and the model/version. Without those three, "it said something weird" is unreproducible and the bug dies. Logging the inputs and context is half the battle in AI testing.
Where this fits
This is product-level AI testing — evaluating a feature built on a model. For using AI on the other side (writing and reviewing tests), see AI-generated tests are useful — but not for the reason you think and the practical playbook for Claude and Copilot. The AI for QA hub and prompt library cover the wider toolkit.
AI chatbot evaluation pass
- Cases assert properties (on-topic? refused correctly? factual?), not exact strings
- Hallucination probes: made-up features/policies → "I don't know", not invention
- Scope probes: off-topic and "ignore your instructions" requests are deflected
- Refusal tested both ways: no over-refusal of valid asks, no under-refusal of disallowed ones
- Grounded/data-backed answers asserted against the real source of truth
- Failures logged with full input + system prompt + retrieved context + model version
// RELATED QA.CODES RESOURCES
Course
// related
What QA should log when testing AI features
A screenshot isn't a repro when outputs vary. Capture the full assembled prompt, retrieved context, model version, and parameters so an AI bug is actually reproducible.
The hallucination test cases I run on AI features
Concrete test cases for AI hallucination — unanswerable questions, false premises, invented entities, citations — and how to judge answers with no 'correct' value.