Q24 of 32 · Behavioural

Describe a difficult production incident you investigated. What was your role?

BehaviouralSeniorbehaviouralincidentinvestigationproductionstarsenior

Short answer

Short answer: Walk through the timeline in real time — what you did at each step, how you coordinated, what was unclear, and what you eventually pinned it on. Show calm under uncertainty and honest follow-through after the fire was out.

Detail

Incidents are crucible-moments. The interviewer wants to see how you behave when the system is on fire and the answer isn't obvious.

STAR walkthrough — sample answer:

Situation: A previous platform I worked on had a 90-minute outage of a customer-facing search feature. Symptom: searches were returning empty for some users while other users got results. The first reports came in via support; the team's monitoring hadn't fired because aggregate search-API success rate looked normal.

Task: I was the on-call QA lead for the area. I joined the incident channel within 5 minutes of the page going up.

Action: Real-time walkthrough:

0-15 min — establish what's happening. The on-call engineer was looking at the search service logs; nothing obvious. I started reproducing from the user side — used three test accounts to query, found the bug repro'd intermittently. Critical data point: the failures correlated with users on a specific data-residency region.

15-30 min — narrow the failure pattern. I checked the user accounts that had reported issues — all in the same region. Searched recent deploys: a config change to the regional routing layer had gone out 2 hours before. I shared this in the channel and tagged the engineer who'd shipped that change.

30-45 min — confirm the cause. The engineer pulled up the change. The new config had a subtle bug: routing rules were case-sensitive on a header value that was mixed-case in some clients. About 30% of users in that region were hitting the broken path. The engineer initiated a rollback.

45-60 min — verify recovery. I monitored from the user side as rollback rolled out — re-ran my test queries every minute, watched the regional success rate climb. Cleared the all-clear once five consecutive minutes were green.

60-90 min — communications and write-up. Helped the on-call engineer draft the customer comms. After the comms went out, started the post-incident document with timeline, root cause, what worked, what didn't.

Result: Outage was resolved in 60 minutes for the affected user segment. The post-incident doc led to two changes: monitoring at the regional cohort level (not just aggregate, which had hidden the failure), and a CI check that caught case-sensitivity bugs in routing config. I drove both follow-ups over the next two weeks.

What I learned: incidents are won by narrowing fast, not by being right immediately. The early reproduce-from-the-user-side move was disproportionately valuable — it told us what aggregate metrics were missing. After that, the cause was findable.

Why this works: detailed timeline, named the candidate's specific contributions (regional pattern recognition, user-side repro, post-incident follow-through), and shows incident behaviours that translate to other situations.

// WHAT INTERVIEWERS LOOK FOR

Calm under uncertainty, narrowing the problem fast, clear contribution to coordination, and post-incident follow-through. Bonus signal: catching the gap monitoring had missed and driving the structural fix.

// COMMON PITFALL

Hero narratives where the candidate single-handedly found the bug. Interviewers know real incidents are team efforts; the strong answer credits the team and shows your specific contribution within it.