Q2 of 21 · Testing AI systems

How would you test a feature powered by an LLM, given the same input can produce different outputs?

Testing AI systemsMidtesting-ai-systemsllmevaluationproperty-checkstest-strategy

Short answer

Short answer: Clarify the feature's job first, then layer: deterministic parts get normal unit tests; LLM outputs get property checks (schema, format, banned content, groundedness); LLM-as-judge handles quality rubrics; a golden eval set tracks regression. Manage risk statistically, not as single-run pass/fail.

Detail

Clarify first: what is this feature supposed to do? Is there a right answer, or a range? What's the cost of a bad output? A customer-facing summary where hallucinated facts could mislead users is a higher bar than an internal draft reviewed by a human before publishing.

Layer the tests:

  • Deterministic parts (input parsing, output formatting, routing logic) get normal unit tests.
  • Per-call output quality: property checks — required fields present, length within bounds, no PII in the response, claims grounded in the source.
  • Rubric-based quality: LLM-as-judge evaluates helpfulness, accuracy, and tone, sampled and spot-checked against humans.
  • Regression: a golden eval set of representative inputs. Flag statistically significant drops in quality score — not single-run noise.

Adversarial: prompt injection attempts, jailbreak probes, inputs designed to trigger hallucination.

Close: the eval harness is a first-class deliverable. Human-in-the-loop for high-stakes outputs. Manage risk statistically, not as a binary pass/fail. See Evaluation methods for the full taxonomy.

// WHAT INTERVIEWERS LOOK FOR

Clarify-first structure. Four-layer approach (unit, property, rubric, regression). Statistical thinking rather than single-run pass/fail. Adversarial testing. Eval harness as a deliverable.