Q29 of 38 · CI/CD & DevOps

How do you detect flaky tests in CI at scale and manage a quarantine process?

CI/CD & DevOpsMidci-cdflaky-testsquarantinetest-reliabilitypipeline

Short answer

Short answer: Track each test's pass/fail history across runs. Mark a test flaky when it fails then passes on the same commit without a code change. Quarantine it by excluding it from the PR gate while investigation proceeds — but enforce a maximum quarantine period before deletion.

Detail

Flakiness detection requires per-test history, not just per-run pass/fail. Tools like Buildkite Test Analytics, Currents.dev, or a self-hosted JUnit XML database give you a failure rate per test over the last N runs.

A practical quarantine workflow: when a test flips result across two runs on the same commit, open an auto-generated issue and tag it @quarantine. The nightly pipeline includes quarantined tests (to catch ones that are consistently failing), but the PR gate excludes them so they do not block merges.

Quarantine must have a SLA — if a test is quarantined for more than two weeks without a fix, it gets deleted. A quarantine that fills indefinitely becomes a graveyard that erodes confidence in the suite. Active triage beats passive accumulation.

// WHAT INTERVIEWERS LOOK FOR

Understanding that flakiness detection requires historical data, not a single rerun. The quarantine-versus-delete trade-off. A maximum quarantine window to avoid the graveyard anti-pattern.