Testing AI systems

// 21 QUESTIONS · UPDATED MAY 2026

Interview questions on QA for non-deterministic AI features: why exact-match assertions break, output property checks, LLM-as-judge, golden eval sets and regression detection, layered test strategies for AI systems, failure modes (hallucination, prompt injection, drift, bias), testing agentic decision chains, eval harness design, human-in-the-loop decisions, and regulatory obligations for high-risk AI.

Level

Showing 21 of 21 questions

Why can't you use exact-match assertions when testing an LLM-powered feature?Junior
LLMs produce different text on every call even with the same input — temperature and sampling mean output varies by design. Exact-match a…
How would you test a feature powered by an LLM, given the same input can produce different outputs?Mid
Clarify the feature's job first, then layer: deterministic parts get normal unit tests; LLM outputs get property checks (schema, format,…
What are output property checks and how do you use them to test LLM responses?Mid
Property checks test invariants that must hold on every valid output regardless of phrasing: required JSON fields exist, response length…
What is LLM-as-judge and how does it work?Mid
LLM-as-judge uses a second language model to evaluate the output of the first against a rubric. It scales quality evaluation to volumes i…
How do you build a golden eval set for an LLM feature and use it to detect regressions?Mid
Curate 50–200 representative input/quality-expectation pairs spanning happy paths, edge cases, and known failure modes. Baseline the curr…
How do you test the deterministic parts of an LLM-powered system separately?Mid
Isolate the non-LLM layers — input parsing, routing, retrieval, output formatting, error handling — and test them with standard unit and…
What is prompt injection and how do you test for it?Mid
Prompt injection is an attack where malicious input overrides or hijacks the system prompt, causing the model to ignore its instructions…
How do you detect and test for hallucination in an LLM feature?Mid
Provide inputs with known ground truth and check whether the model's output contradicts or fabricates beyond that truth. For RAG features…
What does a layered test strategy look like for an AI system?Mid
Unit tests for deterministic logic, component tests per LLM call (mock the model, test the surrounding code), integration tests for the f…
How do you test an agentic system that makes tool calls and takes multi-step actions?Senior
Test each tool in isolation with unit tests, test the agent's decision logic by providing known state and verifying it selects the right…
How do you test for model or prompt-version drift when the underlying LLM changes?Senior
Re-run your golden eval set against the new model or prompt version and compare aggregate quality scores using a significance test. A sta…
How do you structure a red-teaming exercise for an LLM-powered product?Senior
Red-team with a defined harm scope, a structured attack taxonomy (prompt injection, jailbreaks, bias elicitation, data extraction, misuse…
How do you validate the reliability of an LLM-as-judge setup?Senior
Calibrate the judge against a human-rated sample of 100–200 examples. Measure agreement using Cohen's kappa or Spearman correlation. A ju…
What is statistically significant regression in an eval context and how do you detect it?Senior
A statistically significant regression is a quality score drop unlikely to be explained by sampling noise. Use a binomial or paired t-tes…
How do you test for bias and fairness in an AI feature?Senior
Construct demographically paired inputs that differ only on protected attributes (name, gender, race, nationality) and measure whether th…
How do you test an AI feature that uses retrieval-augmented generation (RAG)?Senior
Test each layer independently: retrieval quality (does the right context come back for a given query?), groundedness (does the generated…
How do you replay and sandbox an agent's decision chain for debugging?Senior
Record the full input state, tool responses, and model decisions at each step as a structured trace. A replay environment loads the trace…
What does an eval harness look like and how do you build a minimal one?Senior
An eval harness loads a golden eval set, runs each input through the system under test, evaluates each output against defined criteria (p…
How do you decide when to use human-in-the-loop for a high-stakes AI feature?Lead
Use human-in-the-loop when the cost of an incorrect AI decision — in financial loss, safety risk, reputational harm, or regulatory exposu…
How do you approach regulatory and compliance testing for a high-risk AI system?Lead
Map the system to applicable frameworks (EU AI Act risk tiers, NIST AI RMF, sector-specific regulation), identify which testing obligatio…
How do you scale eval coverage without re-running every prompt against every model change?Lead
Maintain a tiered eval set: a small fast tier for every change, a medium tier for pre-release, and the full set for major model or archit…

// Continue exploring

Mobile QA

Appium architecture, real devices vs emulators, locator strategies, gestures, CI for mobile.

Accessibility

WCAG, ARIA, screen readers, contrast, keyboard navigation, axe-core, CI integration.

Security

OWASP Top 10 for QA, injection, XSS, access control, session security, SAST/DAST, CI.