Q30 of 37 · API testing
Walk through how you'd test eventual consistency in a distributed system.
Short answer
Short answer: Tests must wait for convergence, not assume it. Poll for the expected state with a sensible timeout. Don't assert immediately after a write. Test the read-after-write SLA explicitly. Add chaos: simulate replication lag, network partitions, and verify the system converges within bounds.
Detail
In an eventually consistent system, a write to one node isn't immediately visible at all read replicas. Test designs that assume immediate visibility flake under load and break against real production behaviour.
The fundamental shift: replace "assert immediately after write" with "wait for convergence."
// ❌ Wrong — assumes synchronous propagation
test('user appears in list after create', async () => {
const user = await createUser({ email: 'a@x.com' });
const list = await listUsers();
expect(list).toContainEqual(user); // flake on replication lag
});
// ✅ Right — wait for convergence
test('user appears in list after create', async () => {
const user = await createUser({ email: 'a@x.com' });
await waitFor(async () => {
const list = await listUsers();
return list.some((u) => u.id === user.id);
}, { timeout: 5000 });
});
The SLA conversation: every eventually consistent system has a real (or claimed) convergence window — milliseconds for in-region replicas, seconds for cross-region, minutes for some search index pipelines. Tests should encode the SLA:
// Read-your-writes SLA: 1 second
await waitFor(predicate, { timeout: 1000 });
A failing test now means "the SLA was violated," not "tests are flaky."
What to test:
1. Read-your-writes. After a write, a read from the same client should see it (often guaranteed by routing to the primary). Verify the SLA holds.
2. Cross-replica visibility. Write to one region; read from another — measure how long convergence takes. Bonus: assert it's within target.
3. Index lag. After creating a record, query a secondary index (search, materialised view). May take longer; SLA is wider.
4. Convergence under load. Send 1000 writes; assert that within N seconds, all reads return the full set.
5. Conflict resolution (CRDTs, last-write-wins). Concurrent writes from two regions: which wins? Document and test the resolution policy.
6. Failure cases. Replication lag spikes, replica down — does the system surface stale data with a marker, or refuse the read, or retry?
Tooling:
- Polling helpers (
waitFor,eventually,Awaitilityin JVM) — the standard pattern. - Chaos tools (Toxiproxy, Chaos Mesh) — inject latency between replicas to test convergence under stress.
- Test environments with replication — many staging environments are single-node, hiding bugs the production multi-node will surface. Push for at-least-two-replica staging.
The honest test design:
- Slowest expected convergence drives the timeout.
- The faster path is asserted separately.
- Failures point at SLA violation, not test flake.
Anti-patterns:
Thread.sleep(2000)— too short under load, too slow when convergence is fast.- Disabling tests during "known replication delays" — these are real bugs the test should surface.
- Testing only on a single-node dev environment — production is the multi-node case; bugs hide otherwise.
The senior signal: testing convergence as a property with a measurable SLA, using polling with timeouts, and treating slow convergence as the test target rather than a nuisance to wait through.