Q16 of 38 · Performance
Walk me through how you'd plan capacity testing for a Black Friday spike.
PerformanceSeniorperformancecapacity-planningblack-fridayspikegame-day
Short answer
Short answer: Model expected peak from historical data (e.g. 5x last year's peak), test 2x peak to verify headroom, run scenarios for spike, sustained, and graceful degradation. Confirm autoscaling triggers, downstream rate limits hold, and rollback works. Finish with a cross-team game-day rehearsal.
Detail
Step 1 — Model the load.
- Pull historical Black Friday data: peak RPS, shape (gradual ramp or 9am hammer), duration, transaction mix.
- Multiply by realistic growth: marketing's 30% growth target, the 2-3 retail-trend uplift years often see, and a safety margin.
- Result: target peak (X RPS), test peak (1.5-2X for headroom).
Step 2 — Build the scenarios.
- Sustained peak: hold target RPS for 4 hours (the actual sale window). Catches connection pool exhaustion, log disk fill-up, slow leaks.
- Sudden spike: 0 → 2X in 2 minutes. Simulates an email blast or 9am door-open.
- Graceful degradation: ramp past 2X until something breaks, observe failure mode.
- Recovery: drop traffic abruptly to baseline. Does the system drain queues? Do circuits reset?
Step 3 — Validate the dependencies.
- Autoscaling: does it react fast enough? Many groups have 5-min scale-up — too slow for a 2-min spike. Pre-warm capacity.
- Downstream services: payment, fraud, inventory, shipping APIs. Each has rate limits; confirm they hold or you have fallbacks.
- Caches: Redis/CDN are doing the heavy lifting. What happens on a cold cache spike?
- Database: read replicas, connection limits, slow query catalogue.
Step 4 — Game day.
- Run the test against staging that mirrors prod sizing (or a controlled canary in prod off-peak). Have eng, SRE, product, and customer support in one room (or one Slack channel). Practise the rollback. Practise the comms.
Step 5 — Document the playbook.
- Triggers: latency p95 > X, error rate > Y, queue depth > Z.
- Actions: scale up, disable non-essential features (recommendation engine, analytics ingest), comms templates.
- Owner: one person with authority to call the kill-switch.
Senior bonus: on the day, watch dashboards together as a team. Most P1s during peak events are not technical — they're communication failures.
// WHAT INTERVIEWERS LOOK FOR
End-to-end planning — load model, scenarios, dependency validation, game day, runbook. The signal of seniority is the cross-functional and operational thinking, not just the test design.
// COMMON PITFALL
Testing only in QA against a small staging env that doesn't represent prod. The day-of failure mode is one only prod-sized infra exposes.