// Interview Prep/Industry Questions/AI Product QA

🟦 AI Product QA

8 questions · full model answers. Non-determinism, safety guardrails, and model drift — testing products where there's no single correct output and the model changes under you.

// What they weigh

What a AI Product QA interviewer is actually probing for — beyond generic QA.

  • 01

    Non-determinism and evaluation

    There's no single correct output, so you can't assert equality. Interviewers want eval sets, rubrics, and property-based checks instead of brittle string matches.

  • 02

    Quality and safety guardrails

    Hallucination, prompt injection, and jailbreaks are core risks. They listen for groundedness checks and adversarial testing of the guardrails.

  • 03

    Model and data drift over time

    A frozen test set rots as models, prompts, and data change. Strong candidates test for regression after version bumps and monitor drift, cost, and latency.

// Junior · 1

How do you test a feature whose output isn't deterministic — the same input can produce different text each time?

Junior

Stop asserting equality. Assert properties the output must hold (format, grounding, contains required facts, excludes forbidden content) and evaluate against a graded rubric rather than a fixed string.

// What interviewers look for

The mental shift from exact-match assertions to property- and eval-based checks: you test characteristics of a good answer, not one canonical answer.

Common pitfall

Trying to assert the output equals a fixed expected string, producing a flaky test that fails on every valid rewording.

Model answer

The core shift is that there's no single correct output, so exact-match assertions are wrong by construction — they'd flag every valid rephrasing as a failure. Instead I assert properties a good answer must satisfy: structural ones (valid JSON, required fields, length bounds), content ones (mentions the required facts, stays grounded in the provided context, doesn't include forbidden or unsafe content), and behavioural ones (refuses out-of-scope requests). For subjective quality I use a graded rubric — often an LLM-as-judge or human rating against criteria — and assert a score threshold across an eval set rather than pass/fail on one example. I'd run multiple samples per input to characterise the distribution, since one input has a range of outputs, and watch variance, not just the mean. I'd also pin what should be deterministic (temperature 0 for extraction tasks) so I'm only tolerating non-determinism where it's inherent. The principle is testing the qualities of a correct answer, not a canonical answer.

non-determinismevaluationproperty-basedrubric

// Mid-level · 3

How would you build an evaluation set for an LLM-powered feature?

Mid-level

Curate a representative, versioned set of inputs with graded success criteria covering typical, edge, and adversarial cases — and don't use real user prompts without consent.

// What interviewers look for

That an eval set is the test suite for an AI feature: representative coverage, clear grading, versioning, and responsible data sourcing.

Common pitfall

Building a tiny happy-path set, or pulling real production prompts into the eval without consent — a privacy issue and a skewed sample.

Model answer

An eval set is the AI feature's test suite, so I'd treat curation seriously. Coverage: typical cases that represent real usage, edge cases (ambiguous inputs, long context, mixed languages, empty/garbage), and adversarial cases (injection attempts, requests for unsafe content). Each item has a graded success criterion — a reference answer, a rubric, or a checkable property — so scoring is consistent. I'd version the set and keep it frozen as a baseline so I can detect regression across model/prompt changes, while growing it as new failure modes appear in production. On data sourcing I'm careful: I don't pull real user prompts into the eval without consent, both for privacy and because raw production data can be unrepresentative or sensitive; I prefer synthetic or consented, de-identified examples. I'd balance the set so no single category dominates the aggregate score, and report per-category results, not just an overall number, so a regression in one slice isn't masked. The eval set plus its grading is the artifact everything else measures against.

evaluationeval setcoverageconsent

How do you test for hallucination in a RAG (retrieval-augmented generation) answer?

Mid-level

Assert groundedness: every claim in the answer should be supported by the retrieved context, citations should resolve to real sources, and the model should decline when the context doesn't contain the answer.

// What interviewers look for

Understanding that hallucination is a grounding failure: the answer must be traceable to retrieved context, and 'I don't know' is a correct output when context is absent.

Common pitfall

Checking only whether the answer sounds right, instead of verifying each claim is actually supported by the retrieved documents.

Model answer

Hallucination in RAG is a grounding failure, so I test whether the answer is actually supported by what was retrieved. For each answer I'd verify claim-level groundedness — every factual statement traces to a passage in the retrieved context — often using an automated faithfulness judge plus spot human checks. Citations must resolve to real, relevant sources, not fabricated or mismatched ones. A crucial case: when the retrieved context doesn't contain the answer, the correct behaviour is to decline or say it doesn't know, not to invent something, so I seed questions whose answer is absent from the corpus and assert a graceful refusal. I'd test context the model should ignore (irrelevant retrieved passages) and conflicting sources. I'd also separate the failure modes — was it bad retrieval (the right doc wasn't fetched) or bad generation (the doc was there but ignored) — because the fix differs. The assertion is faithfulness to retrieved context, with honest abstention as a valid answer.

hallucinationraggroundingfaithfulness

How do you test prompt-injection and jailbreak resistance?

Mid-level

Feed adversarial inputs that try to override the system prompt, leak it, or elicit forbidden output, and assert the guardrails hold — including injection delivered through retrieved or user-supplied content.

// What interviewers look for

Adversarial testing of the safety boundary: direct and indirect injection, system-prompt leakage, and jailbreak patterns — treating the guardrail as the thing under test.

Common pitfall

Only testing direct, obvious attacks and missing indirect injection — malicious instructions embedded in a document or webpage the model ingests.

Model answer

I'd treat the safety guardrail as the system under test and attack it like an adversary. Direct injection: inputs that say 'ignore previous instructions', attempts to extract or leak the system prompt, and role-play framings that try to bypass refusals. Indirect injection is the one people miss — malicious instructions hidden inside content the model ingests, like a retrieved document, a webpage, or a user-uploaded file — so I plant instructions in that data and assert the model treats it as data, not commands. I'd test for forbidden-output elicitation (unsafe, harmful, or policy-violating content) and assert consistent refusal, plus that refusals don't over-trigger on benign requests. I'd maintain an adversarial suite that grows as new jailbreak patterns emerge, and run it on every model/prompt change since resistance can regress. I'd also check that a successful injection can't escalate — e.g. trigger an unauthorised tool call or data access. The mindset is adversarial and ongoing, because the attack surface evolves.

prompt injectionjailbreaksafetysecurityadversarial

// Senior · 3

A model version or prompt is updated. How do you catch regressions introduced by the change?

Senior

Run the frozen eval suite against old and new and compare scores per category; assert no significant drop, and watch for drift on cases that previously passed.

// What interviewers look for

Treating a model/prompt bump like a code change that needs regression testing: a versioned baseline, A/B comparison, and per-slice analysis so a localized regression isn't hidden by the average.

Common pitfall

Eyeballing a few outputs and shipping, so a regression in a specific category is masked by an unchanged or improved overall average.

Model answer

I treat a model or prompt change exactly like a code change: it needs regression testing against a baseline. I'd run the frozen, versioned eval suite on both the current and the new version and compare scores per category, not just the aggregate, because a model can improve overall while regressing badly on one slice — and the average hides it. I'd flag any statistically meaningful drop and specifically diff the cases that previously passed and now fail. Because outputs are stochastic, I'd run multiple samples and compare distributions, not single runs, to avoid chasing noise. I'd watch the non-quality dimensions too — latency, token cost, and refusal rate can all regress on a version bump. For higher-stakes changes I'd do an A/B or shadow run in production with online metrics before full rollout. And I'd keep the eval set current so it reflects real failure modes. The discipline is baseline-versioned, per-slice regression with distributional comparison.

regressiondriftevalversioning

How do you test RAG retrieval quality separately from generation quality?

Senior

Evaluate the retriever on its own — precision/recall of fetching the right passages for a query — independently of what the LLM does with them, so you can localise failures.

// What interviewers look for

Decomposing the pipeline: retrieval and generation are separate stages with separate metrics, and isolating them tells you which one to fix.

Common pitfall

Only evaluating the final answer, so a retrieval failure and a generation failure look identical and you can't tell which component is broken.

Model answer

RAG is a two-stage pipeline, and evaluating only the final answer conflates two different failures, so I test the stages separately. For retrieval, I build a set of queries with known relevant documents and measure precision and recall — did the retriever fetch the right passages, in the right rank order, with sufficient recall to contain the answer — independently of generation. That isolates retrieval problems: bad embeddings, chunking, or indexing. For generation, I feed gold/ideal context and evaluate whether the model uses it faithfully, which isolates generation problems like ignoring context or hallucinating despite having the answer. With both, an end-to-end failure becomes diagnosable: if retrieval recall is low, fix retrieval; if context was present but the answer ignored it, fix generation. I'd also test the interaction — good retrieval with distracting passages, and the no-relevant-document case where the system should abstain. Decomposing the metrics is what makes the failures actionable.

ragretrievalprecision recalldecomposition

How do you test cost and latency as quality attributes of an AI feature?

Senior

Treat token cost and response latency as testable budgets: assert p95 latency and per-request cost stay within thresholds, and verify timeout/fallback behaviour under slow or failed model calls.

// What interviewers look for

That for AI products, cost and latency are first-class quality attributes with budgets and fallbacks — not just correctness.

Common pitfall

Focusing only on answer quality and ignoring that an accurate feature can be unshippable if it's too slow or too expensive per call.

Model answer

For AI products, an answer that's correct but too slow or too expensive is still a failed feature, so I test cost and latency as budgets with thresholds. Latency: I'd assert p50 and p95 response times against an SLA, measure how they scale with prompt/context length and output length, and test streaming so perceived latency is acceptable even when total time is high. Cost: I'd assert per-request token usage stays within budget, watch for prompts that balloon with long context or tool-call loops, and track aggregate cost under realistic traffic. Resilience: a slow or failed model call must hit a timeout and a defined fallback — a cached response, a smaller/faster model, or a graceful degraded message — rather than hanging, so I test the timeout and fallback paths explicitly. I'd also test rate-limit handling and retries with backoff. These become monitored SLOs in production, not just pre-release checks. The point is that cost and latency are quality attributes with budgets, fallbacks, and alarms, exactly like correctness.

costlatencyperformancefallback

// Lead · 1

Design a continuous-evaluation strategy for an AI product running in production.

Lead

Move beyond a static eval set: sample live traffic for ongoing grading, monitor quality/safety/cost/latency SLOs, alarm on drift, and keep humans in the loop on a sampled basis — framing it as a product-shipping discipline.

// What interviewers look for

A production-grade, ongoing evaluation system — online eval, drift detection, human-in-the-loop sampling, safety SLAs — owned as a continuous capability, not a one-time test pass.

Common pitfall

Relying solely on a frozen pre-release eval set, which goes stale as inputs, models, and the world change, so production quality silently degrades.

Model answer

A static pre-release eval set is necessary but not sufficient, because real inputs, the model, and the world all drift, so I'd build evaluation as a continuous production capability. I'd sample live traffic and grade it on a rolling basis — automated judges for scale plus human review on a sampled and on flagged cases — and track quality, safety, groundedness, cost, and latency as SLOs with dashboards and alarms. I'd add drift detection: shifts in input distribution, output quality, refusal rate, or user signals (thumbs-down, regenerations, escalations) trigger investigation. Safety gets tighter SLAs and fast alerting because a guardrail regression is high-harm. New production failure modes feed back into the frozen eval set so the offline suite keeps catching them. I'd guard rollouts with A/B or shadow evaluation before full ramp. Human-in-the-loop stays for ambiguous and high-stakes cases. I'd frame this to the team as a product-shipping discipline — owning quality over time — which is the complement to the technique-level depth in our Testing-AI-systems topic. The strategy is ongoing measurement, not a one-off gate.

strategycontinuous evaluationmonitoringdrifthuman-in-the-loop

// Go deeper

These questions pair with the in-depth AI Product QA QA guide — the risk areas, signature bugs, and test strategies the questions are drawn from.