Incident response runbook

On-call playbook: severity ladder, triage flow, comms templates, escalation paths.

800 wordsOn-callSEVCommsEscalation

Incident Response Runbook — Service / Team Name

Version: 1.0 Owner: On-call lead / team name Last reviewed: YYYY-MM-DD

1. Severity Ladder

Severity	Definition	User impact	Response time	Comms cadence
SEV-1	Total outage or data loss	All / most users affected	Immediate — page primary on-call	Every 15 min internally; status page updated immediately
SEV-2	Significant degradation of a critical feature	Large subset of users; core workflow impaired	Within 15 min	Every 30 min internally; status page updated within 30 min
SEV-3	Partial degradation or non-critical feature down	Subset of users; workaround exists	Within 2 h during business hours	Hourly internally; status page update optional
SEV-4	Minor bug or cosmetic issue in production	Minimal; no service disruption	Next business day	Tracked in issue tracker; no comms required

2. Initial Triage (First 5 Minutes)

Acknowledge the alert in the alerting tool (PagerDuty / OpsGenie / etc.) to stop escalation.
Assess severity using the table above. When in doubt, start at SEV-2 and downgrade if appropriate.
Open an incident channel: create #inc-YYYY-MM-DD-brief-description in Slack / Teams.
Start a timeline doc: paste the current time and the alert trigger as the first entry.
Notify the team: post the incident severity and channel link in #on-call / #engineering immediately.

3. Communications

3.1 Internal Updates

Post updates in the incident channel at the cadence defined in the severity ladder. Format:

[HH:MM UTC] Status: <investigating / identified / mitigating / monitoring / resolved>
Current impact: <brief description>
Last action: <what just happened>
Next update: <HH:MM UTC>

3.2 Status Page Updates

SEV-1: update the status page within 5 minutes of confirming the incident. Use plain language — no jargon.
SEV-2: update within 30 minutes.
SEV-3 and below: update at your discretion.

3.3 Stakeholder Comms

Who	When	How	What to say
Engineering management	On SEV-1 immediately; SEV-2 within 15 min	Slack DM / call	Severity, impact, current status
Customer Success / Support	On SEV-1 and SEV-2 within 30 min	#support channel	Talking points for customer queries
Legal / DPO	If personal data may be affected	Direct message or email	Nature of potential data exposure

3.4 Customer Comms Templates

Initial (SEV-1 / SEV-2):

We are aware of an issue affecting [feature]. Our team is investigating and we will provide an update by [time]. We apologise for the disruption.

Update:

We have identified the cause of the [feature] issue and are working on a fix. Current ETA: [time]. We will update you again at [time].

Resolution:

The [feature] issue has been resolved as of [HH:MM UTC]. We apologise for the inconvenience and will share a full post-incident summary shortly.

4. Escalation Paths

Situation	Escalate to	How	Time threshold
No progress after 30 min on SEV-1	Engineering lead / CTO	Phone call	30 min
Database corruption suspected	Database administrator	PagerDuty — DBA on-call	Immediately on suspicion
Security breach suspected	Security lead	Direct call; do not use public channels	Immediately on suspicion
Regulatory notification required	Legal / Compliance	Email + call	Within 1 h of confirmation
Customer SLA breach imminent	Account manager / Customer Success lead	Slack DM	30 min before SLA window closes

5. Mitigation Actions

Try these in order before escalating to code changes:

Rollback: revert to the previous deployment if the incident coincides with a recent deploy.
Disable feature flag: if the affected feature is behind a flag, disable it immediately.
Scale up: increase the number of instances / pods if the issue is capacity-related.
Failover: switch to the secondary region or standby database if the primary is down.
Throttle / circuit-break: enable rate limiting or open the circuit breaker to protect downstream services.
Restart: restart the affected service as a last resort if none of the above apply.

Document every action taken and its outcome in the incident timeline.

6. Recovery

Verify recovery against the alerting conditions that fired: confirm metrics return to normal thresholds.
Monitor for 15 minutes after mitigation before declaring the incident resolved.
Declare resolved: post a resolution message in the incident channel and update the status page.
Leave the incident channel open for 24 hours in case of recurrence.

7. Post-incident

Schedule an RCA / post-mortem session within 48 hours (within 24 hours for SEV-1).
Use the Root cause analysis template to structure the write-up.
Ensure action items from the RCA are assigned and tracked in the issue tracker.
Update this runbook if the incident revealed a gap.

Incident Response Runbook — AcmeCorp Payments Service

Version: 2.1 Owner: Platform On-call Team Last reviewed: 2024-05-01

1. Severity Ladder

Severity	Definition	User impact	Response time	Comms cadence
SEV-1	Total payment processing outage; no transactions completing	All checkout users affected; revenue at a standstill	Immediate — page primary on-call and backup	Every 15 min internally; status page updated within 5 min
SEV-2	Significant payment degradation: error rate > 1%, or p95 > 5 s for > 5 min	Large subset of users experiencing checkout failures or severe slowdowns	Within 15 min	Every 30 min internally; status page updated within 30 min
SEV-3	Elevated error rate < 1%, or a single payment method down (e.g. only Apple Pay affected)	Subset of users; workaround (use alternative payment method) exists	Within 2 h during business hours	Hourly internally; status page update at on-call engineer's discretion
SEV-4	Minor UI glitch on payment confirmation screen; no transaction failures	Cosmetic; no user blocked	Next business day	Logged in Jira; no comms required

2. Initial Triage (First 5 Minutes)

Acknowledge the alert in PagerDuty to stop escalation to the backup on-call.
Assess severity using the table above. For payment alerts, default to SEV-2 if you are unsure — it is easier to downgrade than to under-respond.
Open an incident channel: create #inc-YYYY-MM-DD-payments-<brief> in Slack.
Start a timeline doc: duplicate the template at Notion/On-call/Timeline-Template and paste the PagerDuty alert link as the first entry.
Notify the team: post in #platform-oncall with the severity, channel link, and a one-line description of the alert.

3. Communications

3.1 Internal Updates

Post in the incident channel every 30 min (SEV-2). Use this format:

[HH:MM UTC] Status: investigating
Current impact: Checkout error rate 2.3% — approximately 140 failed transactions in the last 30 min
Last action: Checked recent deploys — no deploy in last 4 h. Reviewing DB connection metrics now.
Next update: 14:30 UTC

3.2 Status Page Updates

SEV-1: update status.acmecorp.io within 5 minutes of confirming the incident.
SEV-2: update within 30 minutes.
Use "Investigating" status initially; move to "Identified" once root cause is known.

3.3 Stakeholder Comms

Who	When	How	What to say
Engineering Director (Fatima Yusuf)	SEV-1 immediately; SEV-2 within 15 min	Slack DM	Severity, impact in £/min if estimable, current status and ETA
Customer Success (Lena Park)	SEV-1 and SEV-2 within 30 min	#customer-success channel	Talking points: "Payment processing is degraded. Customers experiencing errors should try again in X minutes. We are working on it."
Legal / DPO (Marcus Webb)	Only if cardholder data may be at risk	Direct call	Nature of potential exposure — do not speculate publicly

3.4 Customer Comms Templates

Initial (SEV-2):

We are aware that some customers are experiencing difficulties completing checkout. Our team is investigating. We apologise for the disruption and will provide an update by [time].

Update:

We have identified the cause of the checkout issue — a database configuration problem — and are deploying a fix. Current ETA for resolution: [time]. We will update you again at [time].

Resolution:

Checkout is now fully restored as of [HH:MM UTC]. We apologise for the inconvenience. A full post-incident summary will be shared within 48 hours.

4. Escalation Paths

Situation	Escalate to	How	Time threshold
No progress after 30 min on SEV-2	Engineering Director (Fatima Yusuf)	Phone call	30 min
DB corruption or data loss suspected	DBA on-call (paged via PagerDuty policy "DBA-oncall")	PagerDuty	Immediately on suspicion
Suspected card data exposure (PCI DSS)	Security lead (Tom Nkrumah) + Legal	Direct call; no Slack	Immediately on suspicion
Payment provider (Stripe) is the source of errors	Stripe status page + Stripe support case	dashboard.stripe.com → Support	Within 15 min of confirming provider is the source
SLA breach imminent for enterprise customer	Account manager (check CRM for enterprise SLAs)	Slack DM	30 min before SLA window closes

5. Mitigation Actions

Rollback: check #deploys — if a deploy happened in the last 4 hours, roll back via acmecorp deploy rollback payments --env prod.
Disable feature flag: if the error started after a flag was enabled, toggle it off in LaunchDarkly — flag namespace payments.*.
Scale up: if CPU or memory is the trigger, run acmecorp scale payments --instances 8 --env prod (default is 4).
Failover to secondary DB: if the primary RDS instance is unhealthy, follow the DB failover runbook at Notion/Runbooks/DB-Failover.
Enable rate limiting: if the API is being hit unusually hard, enable the emergency rate limit profile in Cloudflare under the Payments WAF rule group.
Circuit-break Stripe: if Stripe is returning errors, enable the PAYMENTS_USE_FALLBACK_PROVIDER feature flag to route to the backup provider (PaySafe) — note: PaySafe does not support Apple Pay.

6. Recovery

Verify recovery: confirm the CheckoutErrorRate Datadog monitor returns to green (< 0.1% for 5 consecutive minutes).
Monitor for 15 minutes after the error rate returns to normal.
Declare resolved: post resolution in #inc-*, update status.acmecorp.io to "All Systems Operational", and DM stakeholders.
Leave the incident channel open for 24 hours — archive it after that if no recurrence.

7. Post-incident

Schedule RCA session within 24 hours for SEV-1; within 48 hours for SEV-2.
Use the Root cause analysis template — assign the author in the incident channel before signing off.
Ensure all RCA action items are in Jira under the INFRA project with a due date.
Update this runbook with any gaps discovered during the incident.

// Related templates

Root cause analysis

5 Whys, fishbone, timeline of events, contributing factors, action items.

On-call handover

Context for the next on-call rotation: open incidents, hot systems, deferred work, watch-outs.

Release readiness checklist

Pre-release sign-off checklist covering testing, monitoring, rollback, and comms.