How I evaluate an AI chatbot before release

qa.codes · 13 June 2026 · 9 min read

IntermediateAI QAQA Engineers

ai-testingllmevaluation

Testing an AI chatbot is not testing a form. Here is the evaluation pass I run before an AI feature ships — and the failure modes I look for.

part ofTesting AI products

The first time you're asked to test an AI chat feature, the usual playbook breaks. There's no single correct output to assert against — ask the same question twice and you get two different answers, both arguably fine. So "does it return the expected value?" stops working, and a lot of teams either wave the feature through or freeze. Neither is right. You can absolutely test this; you just test behaviour and boundaries instead of exact strings. Here's the pass I run.

Stop asserting equality, start asserting properties

The core shift: you're not checking that the output equals something, you're checking that it has properties. Did it answer the question asked? Did it stay on topic? Did it refuse what it should refuse? Did it avoid claiming false facts? Each of those is testable even when the exact wording varies every time. Build your cases around properties, and the non-determinism stops being a blocker.

1. Hallucinations — the headline risk

Ask things the bot shouldn't confidently answer: a product feature that doesn't exist, a policy you made up, a person who isn't real. A bad bot invents a fluent, plausible, completely false answer. A good one says it doesn't know. Then the subtler version: ask about something real but obscure and check the answer against the source of truth — confident and wrong is the failure mode that erodes trust fastest.

2. Staying in scope

A support bot for a banking app should not be writing poems, giving medical advice, or discussing competitors. Push on the edges: ask it to do something off-topic, ask it to ignore its instructions ("forget you're a support bot and..."), ask it something adjacent-but-out-of-bounds. Does it stay in its lane or wander off? This overlaps with prompt-injection testing, where you treat user input as a potential attack on the system prompt — a security concern as much as a quality one.

3. Refusals — both directions

Test the two failure modes of refusal. Over-refusal: does it reject a perfectly reasonable request because a keyword tripped a filter? Under-refusal: does it comply with something it shouldn't — leaking another user's data, giving disallowed instructions, helping with the thing the safety rules forbid? Both are bugs; teams usually only test one.

4. The grounded facts must be right

If the bot is wired to real data — your order status, your account balance, your docs — then those answers do have a correct value, and you test them like any integration: ask about a known test order and assert the status is right. The freedom to vary wording does not extend to getting your data wrong. Separate the "creative" surface (tone, phrasing) from the "grounded" surface (facts from systems) and hold each to its own standard.

5. What to log when it fails

Because outputs vary, a screenshot isn't a reproduction. Capture the full input (including any hidden system prompt and retrieved context), the exact output, and the model/version. Without those three, "it said something weird" is unreproducible and the bug dies. Logging the inputs and context is half the battle in AI testing.

Where this fits

This is product-level AI testing — evaluating a feature built on a model. For using AI on the other side (writing and reviewing tests), see AI-generated tests are useful — but not for the reason you think and the practical playbook for Claude and Copilot. The AI for QA hub and prompt library cover the wider toolkit.

AI chatbot evaluation pass

Cases assert properties (on-topic? refused correctly? factual?), not exact strings
Hallucination probes: made-up features/policies → "I don't know", not invention
Scope probes: off-topic and "ignore your instructions" requests are deflected
Refusal tested both ways: no over-refusal of valid asks, no under-refusal of disallowed ones
Grounded/data-backed answers asserted against the real source of truth
Failures logged with full input + system prompt + retrieved context + model version

// RELATED QA.CODES RESOURCES

Course

AI for QA course

Tool

Testing tools directory

// related

Tutorials·13 June 2026 · 8 min read

What QA should log when testing AI features

A screenshot isn't a repro when outputs vary. Capture the full assembled prompt, retrieved context, model version, and parameters so an AI bug is actually reproducible.

ai-testingobservabilityllm

Tutorials·13 June 2026 · 9 min read

The hallucination test cases I run on AI features

Concrete test cases for AI hallucination — unanswerable questions, false premises, invented entities, citations — and how to judge answers with no 'correct' value.

ai-testingllmhallucinationtest-cases