Q24 of 32 · Behavioural
Describe a difficult production incident you investigated. What was your role?
Short answer
Short answer: Walk through the timeline in real time — what you did at each step, how you coordinated, what was unclear, and what you eventually pinned it on. Show calm under uncertainty and honest follow-through after the fire was out.
Detail
Incidents are crucible-moments. The interviewer wants to see how you behave when the system is on fire and the answer isn't obvious.
STAR walkthrough — sample answer:
Situation: A previous platform I worked on had a 90-minute outage of a customer-facing search feature. Symptom: searches were returning empty for some users while other users got results. The first reports came in via support; the team's monitoring hadn't fired because aggregate search-API success rate looked normal.
Task: I was the on-call QA lead for the area. I joined the incident channel within 5 minutes of the page going up.
Action: Real-time walkthrough:
0-15 min — establish what's happening. The on-call engineer was looking at the search service logs; nothing obvious. I started reproducing from the user side — used three test accounts to query, found the bug repro'd intermittently. Critical data point: the failures correlated with users on a specific data-residency region.
15-30 min — narrow the failure pattern. I checked the user accounts that had reported issues — all in the same region. Searched recent deploys: a config change to the regional routing layer had gone out 2 hours before. I shared this in the channel and tagged the engineer who'd shipped that change.
30-45 min — confirm the cause. The engineer pulled up the change. The new config had a subtle bug: routing rules were case-sensitive on a header value that was mixed-case in some clients. About 30% of users in that region were hitting the broken path. The engineer initiated a rollback.
45-60 min — verify recovery. I monitored from the user side as rollback rolled out — re-ran my test queries every minute, watched the regional success rate climb. Cleared the all-clear once five consecutive minutes were green.
60-90 min — communications and write-up. Helped the on-call engineer draft the customer comms. After the comms went out, started the post-incident document with timeline, root cause, what worked, what didn't.
Result: Outage was resolved in 60 minutes for the affected user segment. The post-incident doc led to two changes: monitoring at the regional cohort level (not just aggregate, which had hidden the failure), and a CI check that caught case-sensitivity bugs in routing config. I drove both follow-ups over the next two weeks.
What I learned: incidents are won by narrowing fast, not by being right immediately. The early reproduce-from-the-user-side move was disproportionately valuable — it told us what aggregate metrics were missing. After that, the cause was findable.
Why this works: detailed timeline, named the candidate's specific contributions (regional pattern recognition, user-side repro, post-incident follow-through), and shows incident behaviours that translate to other situations.
// WHAT INTERVIEWERS LOOK FOR
// COMMON PITFALL
// Related questions
Describe a time when production caught a bug your testing missed — what did you change?
Manual & exploratory
Tell me about a time you missed a bug that reached production. What did you do?
Behavioural
Tell me about a time you found a critical bug — what happened?
Behavioural
A teammate accidentally force-pushed to `main` and lost 3 commits that were never backed up. How do you recover them?
Git