Incident response runbook

On-call playbook: severity ladder, triage flow, comms templates, escalation paths.

800 wordsOn-callSEVCommsEscalation

Incident Response Runbook — Service / Team Name

Version: 1.0 Owner: On-call lead / team name Last reviewed: YYYY-MM-DD


1. Severity Ladder

Severity Definition User impact Response time Comms cadence
SEV-1 Total outage or data loss All / most users affected Immediate — page primary on-call Every 15 min internally; status page updated immediately
SEV-2 Significant degradation of a critical feature Large subset of users; core workflow impaired Within 15 min Every 30 min internally; status page updated within 30 min
SEV-3 Partial degradation or non-critical feature down Subset of users; workaround exists Within 2 h during business hours Hourly internally; status page update optional
SEV-4 Minor bug or cosmetic issue in production Minimal; no service disruption Next business day Tracked in issue tracker; no comms required

2. Initial Triage (First 5 Minutes)

  1. Acknowledge the alert in the alerting tool (PagerDuty / OpsGenie / etc.) to stop escalation.
  2. Assess severity using the table above. When in doubt, start at SEV-2 and downgrade if appropriate.
  3. Open an incident channel: create #inc-YYYY-MM-DD-brief-description in Slack / Teams.
  4. Start a timeline doc: paste the current time and the alert trigger as the first entry.
  5. Notify the team: post the incident severity and channel link in #on-call / #engineering immediately.

3. Communications

3.1 Internal Updates

Post updates in the incident channel at the cadence defined in the severity ladder. Format:

[HH:MM UTC] Status: <investigating / identified / mitigating / monitoring / resolved>
Current impact: <brief description>
Last action: <what just happened>
Next update: <HH:MM UTC>

3.2 Status Page Updates

  • SEV-1: update the status page within 5 minutes of confirming the incident. Use plain language — no jargon.
  • SEV-2: update within 30 minutes.
  • SEV-3 and below: update at your discretion.

3.3 Stakeholder Comms

Who When How What to say
Engineering management On SEV-1 immediately; SEV-2 within 15 min Slack DM / call Severity, impact, current status
Customer Success / Support On SEV-1 and SEV-2 within 30 min #support channel Talking points for customer queries
Legal / DPO If personal data may be affected Direct message or email Nature of potential data exposure

3.4 Customer Comms Templates

Initial (SEV-1 / SEV-2):

We are aware of an issue affecting [feature]. Our team is investigating and we will provide an update by [time]. We apologise for the disruption.

Update:

We have identified the cause of the [feature] issue and are working on a fix. Current ETA: [time]. We will update you again at [time].

Resolution:

The [feature] issue has been resolved as of [HH:MM UTC]. We apologise for the inconvenience and will share a full post-incident summary shortly.


4. Escalation Paths

Situation Escalate to How Time threshold
No progress after 30 min on SEV-1 Engineering lead / CTO Phone call 30 min
Database corruption suspected Database administrator PagerDuty — DBA on-call Immediately on suspicion
Security breach suspected Security lead Direct call; do not use public channels Immediately on suspicion
Regulatory notification required Legal / Compliance Email + call Within 1 h of confirmation
Customer SLA breach imminent Account manager / Customer Success lead Slack DM 30 min before SLA window closes

5. Mitigation Actions

Try these in order before escalating to code changes:

  1. Rollback: revert to the previous deployment if the incident coincides with a recent deploy.
  2. Disable feature flag: if the affected feature is behind a flag, disable it immediately.
  3. Scale up: increase the number of instances / pods if the issue is capacity-related.
  4. Failover: switch to the secondary region or standby database if the primary is down.
  5. Throttle / circuit-break: enable rate limiting or open the circuit breaker to protect downstream services.
  6. Restart: restart the affected service as a last resort if none of the above apply.

Document every action taken and its outcome in the incident timeline.


6. Recovery

  1. Verify recovery against the alerting conditions that fired: confirm metrics return to normal thresholds.
  2. Monitor for 15 minutes after mitigation before declaring the incident resolved.
  3. Declare resolved: post a resolution message in the incident channel and update the status page.
  4. Leave the incident channel open for 24 hours in case of recurrence.

7. Post-incident

  1. Schedule an RCA / post-mortem session within 48 hours (within 24 hours for SEV-1).
  2. Use the Root cause analysis template to structure the write-up.
  3. Ensure action items from the RCA are assigned and tracked in the issue tracker.
  4. Update this runbook if the incident revealed a gap.