Q16 of 38 · CI/CD & DevOps

How do you debug a CI pipeline that passes locally but fails in CI?

CI/CD & DevOpsSeniorci-cddebuggingenvironmentreproducibility

Short answer

Short answer: Compare environments — Node/Python version, OS, env vars, file system case-sensitivity, parallel workers, network egress, time zone, locale. SSH into the runner if your CI supports it (Buildkite agents, the tmate action). Reproduce in Docker locally with the runner's image.

Detail

Common causes (in rough frequency order):

  1. Implicit env vars — local shell has AWS_PROFILE set; CI doesn't. Or local has a stale .env.local not in the repo.
  2. OS / file system differences — macOS is case-insensitive; Linux runners aren't. Imports of ./Foo work locally, fail in CI.
  3. Parallel workers — local Jest defaults to 50% CPU, CI to all CPUs. Race conditions only show under more parallelism.
  4. Different tool versions — Node 18 locally, Node 20 in CI. Browser version drift in Playwright.
  5. Time zone / locale — local is UTC+0, CI runner is UTC; tests asserting on date strings flap.
  6. Network egress restrictions — corp proxy locally allows everything, CI runner only sees specific endpoints.
  7. File handle / process limits — Linux runners often have lower ulimit; tests opening many connections hit caps.
  8. Random seed not pinned — local Math.random() happens to produce passing values; CI doesn't.

Debug playbook:

  1. Read the failure carefully. Don't assume "flaky" — usually the message is informative.
  2. Reproduce in Docker. docker run --rm -it -v $PWD:/app -w /app cimg/node:20.10 npm test. Forces the same OS, version, and env.
  3. Get a shell on the runner. GitHub: mxschmitt/action-tmate. Buildkite: SSH into the agent. CircleCI: rerun with SSH. Five minutes of poking at the live runner beats five hours of speculative fixes.
  4. Bisect what's different. Print env vars at the top of the test. Print Node version, OS, locale, time zone. Compare to local.
  5. Force the suspected condition locally. TZ=UTC LANG=C node --experimental-vm-modules ... — recreate the runner's environment locally.

Prevention patterns:

  • Pin every tool version (.nvmrc, .python-version, engines in package.json).
  • Run a daily local-dev parity check.
  • Avoid relying on /tmp, HOME, or other environment-specific paths.
  • Tests should set their own time zone (process.env.TZ = 'UTC' in setup) rather than inheriting.
  • Use docker compose for local dev so the local env matches CI by default.

Senior signal: structured debugging (don't guess), comfort with SSH-into-runner, and prevention practices that stop the next case before it happens.

// WHAT INTERVIEWERS LOOK FOR

Common-cause list, structured playbook (Docker repro, SSH, bisect), and prevention patterns. Bonus for the SSH-into-runner habit; many engineers don't know it exists.

// COMMON PITFALL

Adding `continue-on-error` or retries to make CI pass while leaving the divergence unfixed. Production deploys eventually inherit the bug.