Q15 of 38 · Test design
How do you design tests for a feature flag system?
Short answer
Short answer: Test the flag's two states (on/off), all combinations with other flags it interacts with, the rollout mechanism (percentage, user targeting), the off-default fallback, and the cleanup pathway. Don't trust the flag platform — assume it can return wrong values and the system should still degrade gracefully.
Detail
Feature flags are testing's force multiplier — and bug magnet. They expand the cross-product of system state, and naive tests miss the failure modes that flag rollouts cause.
Per-flag binary states. Each flag should be tested in both on and off. If a flag is added with default off, both states need explicit coverage before promotion.
Flag interactions. Two flags A and B both controlling parts of the same flow → 4 combinations. With n interacting flags it's 2^n; use pairwise once n > 4.
Targeting / rollout mechanisms:
- Percentage rollout: user X is in the 10% bucket; assertion that they consistently get the same value across requests.
- User targeting: specific users / cohorts get the flag; verify the targeting condition.
- Geo / device targeting: behaviour for each segment.
Default behaviour when the flag service is down. Critical: if your flag platform (LaunchDarkly, Split, in-house) returns an error or times out, what does the system do? Most platforms recommend "use the configured default value" — test that explicitly.
Flag transitions:
- Flag goes from off → on while user is mid-session: does the UI reflect immediately? Is there a stale-state risk?
- Flag goes on → off (rollback): same question, plus does any data created under "on" survive cleanup?
Cleanup / decommission. Old flags accumulate. Test that the codebase has explicit cleanup paths — when the flag is removed, the "on" code path becomes the default and the "off" branch is deleted. (Most flag bugs come from stale flags.)
Test design moves: run the suite twice in CI, once with each value of any "currently rolling out" flag; canary or shadow tests in production verify the flag's effect matches the expectation for live traffic; audit logs verify the flag's state change is logged with who/when/why.
The senior signal: treating feature flags as a test design dimension, not as an afterthought.