QA Process
Test Data Management.
Tests are only as reliable as the data they run on. This guide covers the main sources of test data — static, generated, masked, synthetic, and seeded — plus cleanup, PII safety, and keeping data right per environment.
When to use it
Use this when flaky results trace back to data rather than code, when you're setting up data for a new suite or environment, or when handling production-derived data raises privacy questions.
Who uses it
QA engineers who provision and reset the data their tests need, senior QA who design a data strategy, and the data and privacy owners who must sign off on how production data is used.
On this page12 sections
WHAT IS TEST DATA MANAGEMENT
Test data management is the practice of providing the right data, in the right state, for tests to run reliably — and keeping it consistent, private, and repeatable. It covers where data comes from, how it's refreshed, how it's cleaned up, and how sensitive information is protected.
Done well it's invisible; done badly it's the hidden cause of half your "flaky" failures. A test that passes on Monday and fails on Tuesday because the data underneath it changed isn't a flaky test — it's a data management problem wearing a test's clothes.
WHY IT MATTERS
Unreliable data makes test results untrustworthy. If a case depends on a record that may or may not exist, or on a state left behind by the last run, then a pass or fail tells you about the data, not the product. Repeatability — same input, same result — is the whole foundation of testing, and it lives or dies on data.
It also carries real risk. Test data drawn from production can expose personal information if it isn't handled carefully, turning a testing convenience into a privacy incident. Good test data management is therefore both a quality concern and a compliance one.
STATIC TEST DATA
Static test data is a fixed, hand-crafted set — known users, known products, known edge cases — checked in alongside the tests. Its strength is predictability: the data never changes, so a failure is always about the code. It's ideal for precise, deterministic cases where you need an exact, known starting point.
The trade-off is maintenance and realism. Static sets drift out of date as the schema evolves and rarely capture the messiness of real data, so they're best for targeted functional cases rather than broad, realistic scenarios.
GENERATED TEST DATA
Generated data is created on the fly by a tool or script — fake names, emails, addresses, orders — to whatever volume and shape you need. It's perfect for breadth: load testing, filling lists, and exercising many variations without hand-writing each record.
The catch is that naive generators produce plausible-but-shallow data that misses real-world edge cases and relationships. Good generation respects your constraints and referential integrity (a generated order points to a generated, valid customer), or the data passes type checks while failing to represent anything real.
MASKED PRODUCTION DATA
Masked (or anonymised) production data takes a real dataset and obscures the sensitive fields — names, emails, card numbers — while preserving the shape, volume, and relationships of genuine data. It gives you the realism of production without exposing real people, which is why regulated teams favour it.
Masking has to be done properly to be safe. Reversible or inconsistent masking can leak identities, and masking that breaks referential integrity produces data that no longer behaves like production. The masking must be irreversible, consistent across tables, and verified — not a quick find-and-replace.
SYNTHETIC DATA
Synthetic data is artificially produced to statistically resemble real data without being derived from any real record. Because no real person's information is involved, it sidesteps most privacy concerns while still capturing realistic distributions and patterns — a strong option when production data is too sensitive to use even masked.
Modern approaches increasingly use AI and LLMs to generate synthetic datasets that preserve subtle relationships static generators miss, while keeping the result free of real PII. That AI angle — generating realistic synthetic data and keeping it PII-safe — is covered in the AI resources linked below.
SEEDED DATA
Seeding loads a known baseline dataset into an environment before testing starts, so every run begins from the same defined state. It's what makes results repeatable across runs and across people — the database is reset and re-seeded to a known point, then the tests run against it.
The key is making seeding automated and idempotent: a single command (ideally part of CI or the environment setup) that resets to a clean, known state every time. Seeding by hand, or seeding inconsistently, reintroduces exactly the non-repeatability you were trying to remove.
DATA CLEANUP
Tests that create data must clean up after themselves, or the environment slowly fills with leftover records that interfere with later runs. Cleanup can be per-test (tear down what this test created), per-run (reset the dataset before or after the suite), or scheduled (a periodic reset of the whole environment).
Resetting to a known state is usually more reliable than deleting individual records, because it doesn't depend on every test perfectly tracking what it made. Whatever the approach, cleanup has to be deliberate — "we'll tidy it up later" is how a test environment becomes an unpredictable swamp.
PII CONSIDERATIONS
Personal data in test environments is a genuine risk. Test environments are typically less locked down than production, so copying real customer data into them can breach privacy regulations like GDPR and turn a routine test setup into a reportable incident. The safest default is to keep real PII out of test environments entirely.
In practice that means preferring synthetic or properly-masked data over raw production copies, and treating any real data that must be used as carefully as you would in production. Using AI to generate PII-safe synthetic data is one increasingly common way to get realistic data without the risk — see the AI resources linked below.
ENVIRONMENT-SPECIFIC DATA
Different environments need different data. Dev needs a small, fast set; a QA environment needs enough variety to exercise real scenarios; a staging or pre-prod environment needs production-like volume to surface performance and scale issues. One-size-fits-all data either slows dev down or under-tests staging.
Keep the data appropriate and consistent per environment, and make sure tests point at the right source — a case that quietly runs against the wrong environment's data produces results that mean nothing. Confirming this is part of the environment readiness check linked below.
COMMON MISTAKES
Copying raw production data into test environments.
Mask it irreversibly or use synthetic data. Real PII in a less-secured environment is a privacy incident waiting to happen.
Tests that depend on data state they don't control.
Seed a known baseline before each run. If a test assumes a record exists, it'll fail the moment the data changes.
Never cleaning up created data.
Reset to a known state per run, or tear down what each test creates. Leftover data corrupts later results.
Using the same data everywhere.
Match data to the environment — small for dev, production-like volume for staging — so each does its job.
Treating masking as a quick find-and-replace.
Masking must be irreversible and consistent across tables, preserving relationships — or it both leaks identities and breaks realism.