Test Data Management — Files, Databases, Factories

A test suite that worked perfectly last Tuesday fails every test on Wednesday morning. The investigation reveals that someone manually deleted the "standard test user" from the staging database. Every test that assumed that user existed fails. This is the consequence of test data that isn't managed: tests depending on data that exists by coincidence, not by design. Test data management is the practice of ensuring every test has exactly the data it needs, exactly when it needs it, without depending on data left by other tests or humans. This lesson covers the five data strategies — from hardcoded values to API seeding — and the principles that determine which strategy to use when.

The five data strategies

Strategy 1: Hardcoded inline

The simplest option. Data is a literal string inside the test.

@Test
public void testLogin() {
    loginPage.loginAs("alice@test.com", "secret123");
    // ...
}

When it's acceptable: a single test that exists solely to verify the login form accepts a valid email format. The data is the point.

When it breaks down: the moment two tests use the same email, or the user needs to exist in the database, or the suite runs in parallel and creates collision. Any test that depends on existing data ("this user has 5 past orders") can't work with hardcoded strings alone.

Strategy 2: External files (JSON, CSV, Excel)

Data is stored in a separate file and loaded at test time. Good for stable reference data — valid/invalid input combinations, locale-specific text, product catalogues.

// Java — reading from JSON with Jackson
List<User> users = new ObjectMapper()
    .readValue(new File("testdata/users.json"), new TypeReference<List<User>>() {});
 
// Used as a TestNG DataProvider
@DataProvider(name = "loginUsers")
public Object[][] loginUsers() throws IOException {
    return DataReader.fromJson("testdata/login-cases.json");
}

// testdata/login-cases.json
[
  { "email": "admin@test.com", "password": "Admin123!", "expectedRole": "admin" },
  { "email": "user@test.com",  "password": "User123!",  "expectedRole": "user" },
  { "email": "bad@test.com",   "password": "wrong",     "expectedRole": null }
]

Best use case: parameterised tests with multiple input variations where the data itself is static — form validation tests, localisation tests, permission matrix tests. File-based data is version-controlled, easy to review in PRs, and doesn't require a running system to load.

Limitation: can't represent data that must exist in the database. A JSON file can describe what a user looks like; it can't ensure that user exists in the staging DB before the test runs.

Strategy 3: Factories (programmatic data)

Covered in depth in the Builder and Factory lessons. Factories generate fresh, unique data in code — no file, no database, no external dependency:

# Python — factory generates collision-free test data
def build_order(product_count: int = 1, discount_code: str = None) -> Order:
    return Order(
        customer=UserFactory.standard(),
        items=[ProductFactory.random() for _ in range(product_count)],
        discount_code=discount_code,
        shipping=AddressFactory.us(),
    )

Best use case: any data your test creates programmatically (new user registrations, new orders, new posts). Unique identifiers (UUID/timestamp in emails) make factories parallel-safe by default. The data exists in the test's memory; it doesn't pre-exist in the database.

Limitation: factories can only create data for tests that create that data through the UI or API. A test for "user with 5 past orders" can't use a factory alone — someone has to actually create those 5 orders first.

Strategy 4: Database seeding

Before the test, insert the required data directly into the database. After the test, delete it. This is the fastest way to set up complex pre-existing state.

// Java — JDBC seeding in @BeforeMethod
@BeforeMethod
public void seedTestData() {
    testUserId = TestDb.insertUser("alice@test.com", "tester", true);
    for (int i = 0; i < 5; i++) {
        TestDb.insertOrder(testUserId, "DELIVERED");
    }
}
 
@AfterMethod(alwaysRun = true)
public void cleanupTestData() {
    TestDb.deleteOrdersByUser(testUserId);
    TestDb.deleteUser(testUserId);
}

# Python — pytest fixture with database seeding
@pytest.fixture
def user_with_orders(db):
    user_id = db.insert_user(email=f"user-{uuid4().hex}@test.com")
    for _ in range(5):
        db.insert_order(user_id=user_id, status="delivered")
    yield user_id
    db.delete_orders(user_id=user_id)
    db.delete_user(user_id=user_id)

alwaysRun = true (TestNG) and yield-based pytest fixtures guarantee cleanup even when the test fails — avoiding data pollution between runs.

Best use case: complex pre-existing state that would take minutes to set up through the UI (50 historical orders, a user with specific permissions granted by an admin). Database seeding is fast and precise.

When to avoid: when you don't have direct database access (SaaS environments, test environments without a DB connection), when schema changes invalidate your seeds constantly, or when the seeding creates state that the application's business logic wouldn't normally allow.

Strategy 5: API seeding

Create test data through the application's own API before the test, delete it after. Slower than database seeding, but works without a direct DB connection and can't create invalid state.

// TypeScript — Playwright with API seeding
test.beforeEach(async ({ request }) => {
  const user = await request.post("/api/users", {
    data: { email: `u-${crypto.randomUUID()}@test.com`, role: "tester" },
    headers: { Authorization: `Bearer ${config.adminToken}` },
  });
  testUserId = (await user.json()).id;
});
 
test.afterEach(async ({ request }) => {
  await request.delete(`/api/users/${testUserId}`, {
    headers: { Authorization: `Bearer ${config.adminToken}` },
  });
});

Best use case: when you have a well-documented API, no direct DB access, and need data that must pass the application's own validation rules. API seeding is the right choice when your tests run against a SaaS environment or a third-party staging instance.

Data strategies — from simplest to most powerful

Hardcoded

Simplest, zero setup required
Parallel collisions: same email in 3 threads
Fails if user doesn't exist in DB
Good only for static validation tests

Files + Factories

Unique data per test via UUID
File data is version-controlled
Parallel-safe by design
Can't pre-populate database state

API / DB Seeding

Full control over pre-existing state
Complex state (50 orders) in milliseconds
Cleanup in @AfterMethod / fixture teardown
Best for tests requiring history or complex state

The test data principles

Each test owns its data. A test that needs a user creates or seeds that user itself. It does not depend on a user that another test created, or a user that exists "because someone set it up last week." Ownership means the test controls creation and deletion.

Use unique identifiers for parallelism. Every email, username, and ID generated by a factory must be unique across the suite's lifetime. UUID suffixes, nanosecond timestamps, or randomly generated strings guarantee this. user@test.com shared between 3 parallel threads is a collision waiting to happen.

Clean up after every test. @AfterMethod(alwaysRun = true) and yield-based fixtures ensure cleanup happens even on failure. Without cleanup, failed tests leave debris that causes other tests to fail — and the cause is invisible.

Never use production data. Production data contains real personal information. Using it in test environments violates data protection requirements and privacy expectations. Generate synthetic data or anonymise real data before loading it into a test environment.

Never hardcode user IDs. A test that does driver.get(config.baseUrl() + "/users/42") breaks the moment user 42 doesn't exist. The test that creates the user should use the ID returned by the API or DB seed, not assume a fixed ID.

⚠️ Common mistakes

Tests that depend on data left by previous tests. Test A creates a user and doesn't clean up. Test B assumes that user exists. On a fresh environment, Test B fails. Running tests in a different order causes Test B to fail. The dependency is invisible until the environment is reset.
Seeding data without a teardown. A @BeforeMethod that inserts 50 rows and no @AfterMethod(alwaysRun = true) that deletes them leaves debris after every failed test. The staging database eventually fills with orphaned test records, slowing queries and confusing other tests.
File-based data for dynamic state. A users.json with hardcoded IDs like {"id": 42, "email": "user@test.com"} assumes those exact records exist in the database. When the database is reset, the file is wrong. File-based data should describe input shapes (email format, field values), not database state (IDs, foreign keys).

🎯 Practice task

Implement the right data strategy for each test type — 40 minutes.

Audit your current data strategy. Scan your test suite for hardcoded email addresses, user IDs, and product names. Count how many places use the same value. Run the suite twice simultaneously — do any fail due to unique-constraint violations?
Factory replacement. Pick the 3 tests with the most hardcoded user data. Replace all hardcoded values with factory-generated data (UserFactory.standard(), UserFactory.admin()). Verify both tests pass when run simultaneously 3 times.
Add cleanup. Find any test that creates data (via UI or API) and doesn't clean it up. Add an @AfterMethod(alwaysRun = true) or finally block that deletes the created data. Verify the staging database has no orphan records after 5 test runs.
API seeding for complex state. Find a test that currently spends 3+ minutes setting up state through the UI (e.g., adding 10 items to a cart one by one). Rewrite the setup to call the API directly. Measure the time difference.
Stretch — data independence test. Run your suite in reverse order using TestNG's preserve-order="false" attribute. If any test fails that wouldn't fail in forward order, you've found a data dependency. Fix it: the failing test must create its own data, not depend on data from a previous test.

Chapter 4 is complete. Chapter 5 moves to cross-cutting concerns — the framework features that touch every test: parallel-safe driver management, screenshot capture on failure, retry strategies, and test isolation.