Q32 of 38 · Manual & exploratory

Describe a time when production caught a bug your testing missed — what did you change?

Manual & exploratorySeniorpost-incidentlearningescape-ratebehavioural

Short answer

Short answer: Use STAR. The specific bug matters less than what you learned: the gap in your test design, what you added (a test, a process, a coverage area), and how you measure that the gap stays closed. End with 'and we haven't seen anything in that category since.'

Detail

This question probes whether you treat production bugs as systemic learning or one-off fires. A strong answer has four beats:

1. The bug: be specific. "A race condition in our checkout flow caused 0.3% of orders to be charged twice when users double-clicked submit. It surfaced over a long weekend and triggered £45k in refunds."

2. The miss: why didn't tests catch it? Be honest. "Our suite had no concurrency tests on the checkout endpoint — we'd tested 'click submit' as a single deterministic action and never tested two clicks within the debounce window."

3. The change: what did you do? Be specific. "Three things: (a) added concurrency tests for all idempotent endpoints — 11 new tests, deliberately racing with k6; (b) changed our PR template to require explicit consideration of idempotency; (c) added a runbook for double-charge detection in production monitoring."

4. The verification: how do you know the gap is closed? "Same category of bug hasn't happened in 18 months. We've caught 4 similar issues in test instead of production."

What signals seniority: owning the miss without blaming the team ("our tests" not "the developers' code"), systemic thinking (adding one test wouldn't have helped; a whole category was missing), and verification (measuring whether the change worked, not just trusting it).

What signals junior thinking: "we added a test for that specific bug" (brittle — the next bug will be different), or "the developer didn't follow the spec" (blame, not learning).

// WHAT INTERVIEWERS LOOK FOR

Specific story, blameless post-mortem framing, systemic change rather than one-off fix, and verification.

// COMMON PITFALL

A vague 'we improved our process' answer with no specifics. Or blaming the developers — interviewers see that as a culture red flag.