Q19 of 38 · Test design

How does test design change for an AI/ML model output vs deterministic code?

Test designSeniorml-testingaipropertiesmetamorphicsenior

Short answer

Short answer: Deterministic code: assert exact outputs, use EP/BVA on inputs, branch coverage. ML models: assert *distributions* and *invariants* (output stays in valid range, monotonic in expected direction, robust to small perturbations), monitor drift in production, and use property-based testing more than example-based.

Detail

Testing an ML model output is fundamentally different because the model isn't a deterministic function — it's a learned approximation, and the right answer for a given input is usually probabilistic.

What changes:

  1. Assertions become invariants, not equalities. Deterministic: assert classify(image) == "cat". ML: assert classify(image).confidence > 0.5 for the obvious case; assert classify(rotated_image).top_class == classify(image).top_class for invariance under rotation.

  2. Test data becomes the test suite. A deterministic suite has 50 test cases. An ML test suite has hundreds or thousands of input-output pairs (a labelled dataset), and the metric is aggregate (accuracy, F1, precision/recall by class), not per-case.

  3. EP/BVA become slice-based testing. Instead of "test one value per equivalence class", you test slices of the input distribution: model performance on rare classes, on minority demographic groups (fairness), on out-of-distribution inputs. Each slice has its own metrics.

  4. Property-based and metamorphic testing dominate. Properties: "the output should be monotonic in price." Metamorphic: "if I add a benign augmentation (resize, mild rotation), the output should not change drastically." These are testable invariants without knowing the exact correct answer.

  5. Robustness testing is first-class. Adversarial inputs: tiny perturbations to the input should not flip the output. This is unique to ML — deterministic code doesn't have this property to test.

  6. Production monitoring substitutes for some test coverage. Drift detection: are the input distribution and output distribution today different from training? Performance monitoring: is the model's accuracy in production stable? Many ML failures aren't catchable by pre-deployment tests; they only manifest under data drift.

  7. Test sets need refresh. The world changes; a 3-year-old test set may not reflect current production data.

A senior interview answer also acknowledges: uncertainty as a test target (calibrated confidences); latency, cost, and model size as part of the contract; and reproducibility (random seeds, data versioning, pipeline determinism).

// WHAT INTERVIEWERS LOOK FOR

Awareness that exact assertions don't apply, naming property-based and metamorphic testing, and treating production monitoring as part of the test strategy.

// COMMON PITFALL

Trying to apply EP/BVA mechanically to an ML model — the input space is too high-dimensional and the right answer is statistical.