Q16 of 38 · CI/CD & DevOps
How do you debug a CI pipeline that passes locally but fails in CI?
CI/CD & DevOpsSeniorci-cddebuggingenvironmentreproducibility
Short answer
Short answer: Compare environments — Node/Python version, OS, env vars, file system case-sensitivity, parallel workers, network egress, time zone, locale. SSH into the runner if your CI supports it (Buildkite agents, the tmate action). Reproduce in Docker locally with the runner's image.
Detail
Common causes (in rough frequency order):
- Implicit env vars — local shell has
AWS_PROFILEset; CI doesn't. Or local has a stale.env.localnot in the repo. - OS / file system differences — macOS is case-insensitive; Linux runners aren't. Imports of
./Foowork locally, fail in CI. - Parallel workers — local Jest defaults to 50% CPU, CI to all CPUs. Race conditions only show under more parallelism.
- Different tool versions — Node 18 locally, Node 20 in CI. Browser version drift in Playwright.
- Time zone / locale — local is UTC+0, CI runner is UTC; tests asserting on date strings flap.
- Network egress restrictions — corp proxy locally allows everything, CI runner only sees specific endpoints.
- File handle / process limits — Linux runners often have lower
ulimit; tests opening many connections hit caps. - Random seed not pinned — local
Math.random()happens to produce passing values; CI doesn't.
Debug playbook:
- Read the failure carefully. Don't assume "flaky" — usually the message is informative.
- Reproduce in Docker.
docker run --rm -it -v $PWD:/app -w /app cimg/node:20.10 npm test. Forces the same OS, version, and env. - Get a shell on the runner. GitHub:
mxschmitt/action-tmate. Buildkite: SSH into the agent. CircleCI: rerun with SSH. Five minutes of poking at the live runner beats five hours of speculative fixes. - Bisect what's different. Print env vars at the top of the test. Print Node version, OS, locale, time zone. Compare to local.
- Force the suspected condition locally.
TZ=UTC LANG=C node --experimental-vm-modules ...— recreate the runner's environment locally.
Prevention patterns:
- Pin every tool version (
.nvmrc,.python-version,enginesin package.json). - Run a daily local-dev parity check.
- Avoid relying on
/tmp,HOME, or other environment-specific paths. - Tests should set their own time zone (
process.env.TZ = 'UTC'in setup) rather than inheriting. - Use
docker composefor local dev so the local env matches CI by default.
Senior signal: structured debugging (don't guess), comfort with SSH-into-runner, and prevention practices that stop the next case before it happens.
// WHAT INTERVIEWERS LOOK FOR
Common-cause list, structured playbook (Docker repro, SSH, bisect), and prevention patterns. Bonus for the SSH-into-runner habit; many engineers don't know it exists.
// COMMON PITFALL
Adding `continue-on-error` or retries to make CI pass while leaving the divergence unfixed. Production deploys eventually inherit the bug.