How I evaluate an AI chatbot before release
A practical evaluation pass for AI chat features: hallucinations, refusals, prompt injection, and the cases with no single right answer.
A practical evaluation pass for AI chat features: hallucinations, refusals, prompt injection, and the cases with no single right answer.
AI writes plausible Playwright tests that pass for the wrong reasons. Here is the review checklist that catches them.
LLMs can't reliably separate instructions from data, so user input can hijack the model. Direct and indirect injection, what to check for, and how to report it QA-safe.
A screenshot isn't a repro when outputs vary. Capture the full assembled prompt, retrieved context, model version, and parameters so an AI bug is actually reproducible.
Concrete test cases for AI hallucination — unanswerable questions, false premises, invented entities, citations — and how to judge answers with no 'correct' value.
Get the speed of an AI agent on your test repo without the mess: work on a branch, review every change like a junior's PR, and make tests fail first to catch assert-nothing tests.
AI covers the expected cases fast and misses the suspicion-driven ones that catch bugs. Division of labour: let it handle breadth of the predictable; you handle the unexpected.