Q20 of 38 · Performance

What's your approach to soak testing memory leaks over 8+ hour runs?

PerformanceSeniorperformancesoak-testmemory-leaksheap-analysislong-running

Short answer

Short answer: Run sustained moderate load (30-50% of peak) for 8-24 hours while capturing heap, RSS, file descriptors, DB connection counts, and GC frequency at intervals. Plot trends — flat = healthy, monotonic rise = leak. Take heap dumps at intervals for diff analysis.

Detail

Why soak tests find what load tests can't: a leak of 1MB per 1000 requests is invisible in a 30-minute load test (< 50MB total) but kills the process after 24 hours (1.4GB). Soak amplifies time enough to see the slope.

Test setup:

  • Load: 30-50% of peak. Goal isn't to stress, it's to exercise every code path repeatedly.
  • Duration: 8 hours minimum, 24 hours ideal. Some leaks (DB-backed caches, external session stores) only manifest after a daily cycle.
  • Variety: rotate through the realistic transaction mix. A single endpoint loop won't exercise the leaking path.

What to monitor (every minute or finer):

  • Heap used — JVM, Node, .NET. Grows monotonically? Leak.
  • RSS (resident set size) — OS-level memory. Diverges from heap? Off-heap allocation leak (DirectByteBuffer, native libs).
  • File descriptor countlsof | wc -l or /proc/<pid>/fd. Climbing? Unclosed sockets/files.
  • DB connections — pool checkout count. Climbing? Connections not being returned.
  • GC stats — frequency, duration, full-GC count. GC working harder over time = heap pressure.
  • Thread count — leaking threads is rarer but devastating.

Analysis:

  • Plot each metric vs. time. Visual inspection beats statistics for slope detection.
  • Take heap dumps at hour 0, hour 4, hour 8 — diff with Eclipse MAT or VisualVM. Objects that grow disproportionately are suspects.
  • Correlate with logs: which transactions ran during the slope? That's where the leak path lives.

Handling false positives:

  • Caches that grow to a steady state (LRU) are not leaks — they level off.
  • Connection pools that ramp to max and stay are not leaks.
  • A genuine leak grows without bound.

CI integration: weekly job, posts results to a dashboard, alerts on slope > X MB/hour. Don't gate PRs on soak — too slow, too noisy.

// WHAT INTERVIEWERS LOOK FOR

Naming soak's specific value, the metrics suite (heap + RSS + fds + connections), heap-dump diffing, and distinguishing real leaks from caches reaching steady state.

// COMMON PITFALL

Stopping the soak when no metric breaches a threshold but never analysing trend. A leak that hasn't OOMed yet is still a leak; the slope tells you.