Incident response runbook
On-call playbook: severity ladder, triage flow, comms templates, escalation paths.
Incident Response Runbook — Service / Team Name
Version: 1.0 Owner: On-call lead / team name Last reviewed: YYYY-MM-DD
1. Severity Ladder
| Severity | Definition | User impact | Response time | Comms cadence |
|---|---|---|---|---|
| SEV-1 | Total outage or data loss | All / most users affected | Immediate — page primary on-call | Every 15 min internally; status page updated immediately |
| SEV-2 | Significant degradation of a critical feature | Large subset of users; core workflow impaired | Within 15 min | Every 30 min internally; status page updated within 30 min |
| SEV-3 | Partial degradation or non-critical feature down | Subset of users; workaround exists | Within 2 h during business hours | Hourly internally; status page update optional |
| SEV-4 | Minor bug or cosmetic issue in production | Minimal; no service disruption | Next business day | Tracked in issue tracker; no comms required |
2. Initial Triage (First 5 Minutes)
- Acknowledge the alert in the alerting tool (PagerDuty / OpsGenie / etc.) to stop escalation.
- Assess severity using the table above. When in doubt, start at SEV-2 and downgrade if appropriate.
- Open an incident channel: create #inc-YYYY-MM-DD-brief-description in Slack / Teams.
- Start a timeline doc: paste the current time and the alert trigger as the first entry.
- Notify the team: post the incident severity and channel link in #on-call / #engineering immediately.
3. Communications
3.1 Internal Updates
Post updates in the incident channel at the cadence defined in the severity ladder. Format:
[HH:MM UTC] Status: <investigating / identified / mitigating / monitoring / resolved>
Current impact: <brief description>
Last action: <what just happened>
Next update: <HH:MM UTC>
3.2 Status Page Updates
- SEV-1: update the status page within 5 minutes of confirming the incident. Use plain language — no jargon.
- SEV-2: update within 30 minutes.
- SEV-3 and below: update at your discretion.
3.3 Stakeholder Comms
| Who | When | How | What to say |
|---|---|---|---|
| Engineering management | On SEV-1 immediately; SEV-2 within 15 min | Slack DM / call | Severity, impact, current status |
| Customer Success / Support | On SEV-1 and SEV-2 within 30 min | #support channel | Talking points for customer queries |
| Legal / DPO | If personal data may be affected | Direct message or email | Nature of potential data exposure |
3.4 Customer Comms Templates
Initial (SEV-1 / SEV-2):
We are aware of an issue affecting [feature]. Our team is investigating and we will provide an update by [time]. We apologise for the disruption.
Update:
We have identified the cause of the [feature] issue and are working on a fix. Current ETA: [time]. We will update you again at [time].
Resolution:
The [feature] issue has been resolved as of [HH:MM UTC]. We apologise for the inconvenience and will share a full post-incident summary shortly.
4. Escalation Paths
| Situation | Escalate to | How | Time threshold |
|---|---|---|---|
| No progress after 30 min on SEV-1 | Engineering lead / CTO | Phone call | 30 min |
| Database corruption suspected | Database administrator | PagerDuty — DBA on-call | Immediately on suspicion |
| Security breach suspected | Security lead | Direct call; do not use public channels | Immediately on suspicion |
| Regulatory notification required | Legal / Compliance | Email + call | Within 1 h of confirmation |
| Customer SLA breach imminent | Account manager / Customer Success lead | Slack DM | 30 min before SLA window closes |
5. Mitigation Actions
Try these in order before escalating to code changes:
- Rollback: revert to the previous deployment if the incident coincides with a recent deploy.
- Disable feature flag: if the affected feature is behind a flag, disable it immediately.
- Scale up: increase the number of instances / pods if the issue is capacity-related.
- Failover: switch to the secondary region or standby database if the primary is down.
- Throttle / circuit-break: enable rate limiting or open the circuit breaker to protect downstream services.
- Restart: restart the affected service as a last resort if none of the above apply.
Document every action taken and its outcome in the incident timeline.
6. Recovery
- Verify recovery against the alerting conditions that fired: confirm metrics return to normal thresholds.
- Monitor for 15 minutes after mitigation before declaring the incident resolved.
- Declare resolved: post a resolution message in the incident channel and update the status page.
- Leave the incident channel open for 24 hours in case of recurrence.
7. Post-incident
- Schedule an RCA / post-mortem session within 48 hours (within 24 hours for SEV-1).
- Use the Root cause analysis template to structure the write-up.
- Ensure action items from the RCA are assigned and tracked in the issue tracker.
- Update this runbook if the incident revealed a gap.
Incident Response Runbook — AcmeCorp Payments Service
Version: 2.1 Owner: Platform On-call Team Last reviewed: 2024-05-01
1. Severity Ladder
| Severity | Definition | User impact | Response time | Comms cadence |
|---|---|---|---|---|
| SEV-1 | Total payment processing outage; no transactions completing | All checkout users affected; revenue at a standstill | Immediate — page primary on-call and backup | Every 15 min internally; status page updated within 5 min |
| SEV-2 | Significant payment degradation: error rate > 1%, or p95 > 5 s for > 5 min | Large subset of users experiencing checkout failures or severe slowdowns | Within 15 min | Every 30 min internally; status page updated within 30 min |
| SEV-3 | Elevated error rate < 1%, or a single payment method down (e.g. only Apple Pay affected) | Subset of users; workaround (use alternative payment method) exists | Within 2 h during business hours | Hourly internally; status page update at on-call engineer's discretion |
| SEV-4 | Minor UI glitch on payment confirmation screen; no transaction failures | Cosmetic; no user blocked | Next business day | Logged in Jira; no comms required |
2. Initial Triage (First 5 Minutes)
- Acknowledge the alert in PagerDuty to stop escalation to the backup on-call.
- Assess severity using the table above. For payment alerts, default to SEV-2 if you are unsure — it is easier to downgrade than to under-respond.
- Open an incident channel: create
#inc-YYYY-MM-DD-payments-<brief>in Slack. - Start a timeline doc: duplicate the template at Notion/On-call/Timeline-Template and paste the PagerDuty alert link as the first entry.
- Notify the team: post in
#platform-oncallwith the severity, channel link, and a one-line description of the alert.
3. Communications
3.1 Internal Updates
Post in the incident channel every 30 min (SEV-2). Use this format:
[HH:MM UTC] Status: investigating
Current impact: Checkout error rate 2.3% — approximately 140 failed transactions in the last 30 min
Last action: Checked recent deploys — no deploy in last 4 h. Reviewing DB connection metrics now.
Next update: 14:30 UTC
3.2 Status Page Updates
- SEV-1: update status.acmecorp.io within 5 minutes of confirming the incident.
- SEV-2: update within 30 minutes.
- Use "Investigating" status initially; move to "Identified" once root cause is known.
3.3 Stakeholder Comms
| Who | When | How | What to say |
|---|---|---|---|
| Engineering Director (Fatima Yusuf) | SEV-1 immediately; SEV-2 within 15 min | Slack DM | Severity, impact in £/min if estimable, current status and ETA |
| Customer Success (Lena Park) | SEV-1 and SEV-2 within 30 min | #customer-success channel | Talking points: "Payment processing is degraded. Customers experiencing errors should try again in X minutes. We are working on it." |
| Legal / DPO (Marcus Webb) | Only if cardholder data may be at risk | Direct call | Nature of potential exposure — do not speculate publicly |
3.4 Customer Comms Templates
Initial (SEV-2):
We are aware that some customers are experiencing difficulties completing checkout. Our team is investigating. We apologise for the disruption and will provide an update by [time].
Update:
We have identified the cause of the checkout issue — a database configuration problem — and are deploying a fix. Current ETA for resolution: [time]. We will update you again at [time].
Resolution:
Checkout is now fully restored as of [HH:MM UTC]. We apologise for the inconvenience. A full post-incident summary will be shared within 48 hours.
4. Escalation Paths
| Situation | Escalate to | How | Time threshold |
|---|---|---|---|
| No progress after 30 min on SEV-2 | Engineering Director (Fatima Yusuf) | Phone call | 30 min |
| DB corruption or data loss suspected | DBA on-call (paged via PagerDuty policy "DBA-oncall") | PagerDuty | Immediately on suspicion |
| Suspected card data exposure (PCI DSS) | Security lead (Tom Nkrumah) + Legal | Direct call; no Slack | Immediately on suspicion |
| Payment provider (Stripe) is the source of errors | Stripe status page + Stripe support case | dashboard.stripe.com → Support | Within 15 min of confirming provider is the source |
| SLA breach imminent for enterprise customer | Account manager (check CRM for enterprise SLAs) | Slack DM | 30 min before SLA window closes |
5. Mitigation Actions
- Rollback: check
#deploys— if a deploy happened in the last 4 hours, roll back viaacmecorp deploy rollback payments --env prod. - Disable feature flag: if the error started after a flag was enabled, toggle it off in LaunchDarkly — flag namespace
payments.*. - Scale up: if CPU or memory is the trigger, run
acmecorp scale payments --instances 8 --env prod(default is 4). - Failover to secondary DB: if the primary RDS instance is unhealthy, follow the DB failover runbook at Notion/Runbooks/DB-Failover.
- Enable rate limiting: if the API is being hit unusually hard, enable the emergency rate limit profile in Cloudflare under the Payments WAF rule group.
- Circuit-break Stripe: if Stripe is returning errors, enable the
PAYMENTS_USE_FALLBACK_PROVIDERfeature flag to route to the backup provider (PaySafe) — note: PaySafe does not support Apple Pay.
6. Recovery
- Verify recovery: confirm the
CheckoutErrorRateDatadog monitor returns to green (< 0.1% for 5 consecutive minutes). - Monitor for 15 minutes after the error rate returns to normal.
- Declare resolved: post resolution in
#inc-*, update status.acmecorp.io to "All Systems Operational", and DM stakeholders. - Leave the incident channel open for 24 hours — archive it after that if no recurrence.
7. Post-incident
- Schedule RCA session within 24 hours for SEV-1; within 48 hours for SEV-2.
- Use the Root cause analysis template — assign the author in the incident channel before signing off.
- Ensure all RCA action items are in Jira under the
INFRAproject with a due date. - Update this runbook with any gaps discovered during the incident.
// Related templates
Root cause analysis
5 Whys, fishbone, timeline of events, contributing factors, action items.
On-call handover
Context for the next on-call rotation: open incidents, hot systems, deferred work, watch-outs.
Release readiness checklist
Pre-release sign-off checklist covering testing, monitoring, rollback, and comms.