Root cause analysis

5 Whys, fishbone, timeline of events, contributing factors, action items.

750 wordsRCA5 WhysFishbonePost-mortem

Root Cause Analysis — Incident ID / Title

Severity: SEV-1 / SEV-2 / SEV-3 Author: Name Review date: YYYY-MM-DD Status: Draft / In review / Final


1. Incident Summary

Field Detail
What happened One or two sentences describing the failure mode
When it started YYYY-MM-DD HH:MM UTC (real start — not when it was detected)
When it was detected YYYY-MM-DD HH:MM UTC
Who detected it Person or monitoring system
When it was resolved YYYY-MM-DD HH:MM UTC
Total duration X h Y min
User impact Number of users or % affected; what they could not do

2. Timeline of Events

Time (UTC) Event Source / owner
HH:MM Incident start — what happened in the system Monitoring / log
HH:MM Alert fired PagerDuty / alerting tool
HH:MM On-call acknowledged Name
HH:MM Severity declared and incident channel opened Name
HH:MM First hypothesis formed Name
HH:MM Mitigation action attempted (describe) Name
HH:MM Mitigation successful / unsuccessful — describe Name
HH:MM Incident resolved Name
HH:MM Comms sent to customers / status page updated Name

3. Root Cause Analysis

3.1 Five Whys

Problem statement: Describe the observable failure in one sentence.

  1. Why did [the failure] occur? Answer.
  2. Why did [answer 1] happen? Answer.
  3. Why did [answer 2] happen? Answer.
  4. Why did [answer 3] happen? Answer.
  5. Why did [answer 4] happen? Answer — this is typically the root cause.

Root cause: State the root cause in one clear sentence.

3.2 Contributing Factors

Technical

  • e.g. Lack of connection pool monitoring alert
  • e.g. No circuit breaker on the downstream dependency

Process

  • e.g. Configuration change was not peer-reviewed
  • e.g. Runbook did not cover this failure mode

Organisational

  • e.g. On-call rotation had a single point of failure — no trained backup

Human

  • e.g. Engineer was not aware that the setting had a production impact

4. What Worked Well

  • e.g. The alerting fired within 2 minutes of the error rate exceeding the threshold
  • e.g. The team assembled quickly and communication was clear throughout
  • e.g. The rollback procedure worked as documented

5. What Did Not Work Well

  • e.g. Detection relied on a customer complaint rather than an alert — 18-minute gap
  • e.g. The runbook did not describe how to verify recovery, causing confusion
  • e.g. Status page was not updated until 45 minutes after the incident was confirmed

6. Action Items

Action Owner Priority Target date Status
Fix: describe technical fix Name P1 / P2 / P3 Date Open
Add monitoring alert for [X] Name P1 / P2 / P3 Date Open
Update runbook to cover [scenario] Name P1 / P2 / P3 Date Open
Process improvement: [describe] Name P1 / P2 / P3 Date Open

7. Lessons Learned

  • Generalised lesson that applies beyond this incident — e.g. "Configuration changes to shared infrastructure should always be peer-reviewed and applied to a staging environment first."
  • Lesson 2
  • Lesson 3