Q21 of 38 · Performance
How would you generate realistic test data at scale for a marketplace search load test?
Short answer
Short answer: Sanitised production query logs are gold — anonymise PII, then replay actual query distributions. For synthetic generation, model Zipfian distribution for query terms (long tail), realistic price/category mixes, and variation in personalisation signals. Volume by replaying at production-typical RPS.
Detail
Marketplace search is a particularly hard case because the query distribution itself drives cache behaviour, query plan selection, and result-set sizes. Use the wrong distribution and the test misleads.
Source 1 — anonymised production logs. Pull a sample of search logs (1-10M queries), strip PII (geo down to city, no IPs, hash user IDs), and replay. Pros: distributions are exactly correct (term frequency, filter combinations, pagination patterns). Cons: legal/privacy review, sometimes complex sanitisation.
Source 2 — synthetic with realistic distributions.
- Query terms: Zipfian distribution. ~10% of queries account for 90% of volume ("iphone", "shoes"); the remaining 90% are long-tail (typos, niche products). Generators that pick terms uniformly from a vocabulary produce uniform load — wildly unrealistic. Use
numpy.random.zipfor a real query-frequency dump. - Filter combinations: most users apply 0-2 filters; some power users apply 5+. Model the distribution.
- Pagination: 80% don't paginate; 15% go to page 2; 5% deeper. Test only deep pagination if that's your scenario, but match real usage for general load.
- Geographic / personalisation: regional preference and user history change cache hit rate. Vary user IDs across a realistic set.
Source 3 — a small handcrafted set for known-edge-case correctness. Empty searches, single-character terms, 200-char queries, queries with special characters, queries that match millions of items. Mix into the load test at low frequency; they exercise edge code paths.
Volume:
- Cardinality of search terms should approach production unique count — million-plus terms isn't unusual. A test with 100 terms hits caches every time and hides real cache-miss latency.
- Repeat-rate matters: production might have 30% query repetition within 15 minutes; lower repetition means worse cache hit rate.
Tooling: load via SharedArray (k6) or CSV Data Set Config (JMeter). For very large datasets, stream from a file rather than loading entirely.
Validation: before you trust the test, compare its cache hit rate to production's. If prod is 80% and the test is 20%, your distribution is wrong — fix the data, not the system.