A suite of 50 tests with a 10-minute runtime is a convenience. A suite of 5000 tests with a 45-minute runtime is a blocker. Engineers stop running tests before committing. PRs sit waiting for a slow CI pipeline. Feedback loops stretch from minutes to hours. The framework patterns that work perfectly at 50 tests — sequential execution, a single test runner, one CI job — become bottlenecks at 500 and unsustainable at 5000. Scaling is not an accident. It requires deliberate decisions about how tests are categorised, when they run, how they're distributed across machines, and which tests are kept versus deleted. This lesson covers each of those decisions.
How requirements change at each scale
What matters most at each test suite scale
| Execution | Infrastructure | Strategy | |
|---|---|---|---|
| 50 tests | Sequential OK. One thread, one job. | Local machine or single CI agent | Write good tests. Add page objects. |
| 500 tests | Parallel threads required. 4–8 workers. | Multiple CI agents. Smoke vs full regression. | Tagging system. API setup replaces UI setup. |
| 5000 tests | Distributed grid. Sharding. Selective execution. | Selenium Grid / BrowserStack. Multiple pipelines. | Flakiness tracking. Coverage analytics. Test retirement. |
The scaling levers — applied in sequence
No single lever solves a 45-minute runtime. The approach is layered: apply each lever, measure, then apply the next.
Lever 1: Parallelise within one machine
The cheapest scaling move — more threads, same hardware. thread-count="4" in TestNG XML costs nothing except ensuring tests are isolation-correct:
<suite name="Regression" parallel="methods" thread-count="4" verbose="1">
<test name="All tests">
<packages>
<package name="com.mycompany.tests"/>
</packages>
</test>
</suite>Expected throughput gain: roughly linear up to the CPU/memory ceiling. A 40-minute sequential suite typically reaches 12–15 minutes with 4 properly isolated threads.
In Playwright, parallelism is controlled per-worker:
// playwright.config.ts
workers: process.env.CI ? 4 : undefined, // 4 workers in CI, logical CPUs locallypytest-xdist adds distributed workers to pytest:
pip install pytest-xdist
pytest -n 4 # 4 parallel workersLever 2: Categorise and run subsets
Not every test should run on every trigger. A push to a feature branch shouldn't run 5000 tests — it should run 100 smoke tests in 5 minutes.
TestNG groups:
@Test(groups = {"smoke", "login"})
public void validLoginRedirectsToDashboard() { ... }
@Test(groups = {"regression", "slow", "checkout"})
public void fullCheckoutWithPromoCode() { ... }<!-- Smoke suite for PR triggers -->
<groups>
<run>
<include name="smoke"/>
</run>
</groups>pytest marks:
@pytest.mark.smoke
@pytest.mark.login
def test_valid_login_redirects():
...
# Run only smoke tests
pytest -m smokePlaywright tags:
test("login with valid credentials @smoke @login", async ({ loginPage }) => {
...
});npx playwright test --grep "@smoke"The standard tagging taxonomy:
| Tag | Size | When it runs | Purpose |
|---|---|---|---|
smoke | 50–100 tests | Every PR, every merge | Critical path: can we deploy? |
regression | All tests | Nightly, before releases | Full coverage |
slow | 5–10% of suite | Nightly only | DB-heavy, multi-step flows |
| Feature tags | By area | When feature area changes | Targeted regression |
Lever 3: Distribute across machines
When one machine with 8 threads isn't enough, distribute the suite across multiple machines:
Selenium Grid 4:
# docker-compose.yml — Grid with 4 Chrome nodes
services:
hub:
image: selenium/hub:4.20.0
ports: ["4442:4442", "4443:4443", "4444:4444"]
chrome:
image: selenium/node-chrome:4.20.0
deploy:
replicas: 4
environment:
SE_EVENT_BUS_HOST: hubGitHub Actions matrix — shard across parallel jobs:
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- name: Run tests (shard ${{ matrix.shard }}/4)
run: npx playwright test --shard=${{ matrix.shard }}/4Playwright's --shard=N/M splits the test files into M groups and runs group N. Four parallel jobs each run a quarter of the suite — total runtime drops to 25% of a single-machine sequential run.
Lever 4: Replace slow UI setup with API setup
The biggest individual test performance wins come from replacing UI-based test data setup with API calls. Logging in through the UI for 50 tests that need an authenticated user takes 2–3 seconds per test (100–150 seconds total). Injecting a session cookie or calling the auth API takes 100–200ms:
// Playwright — reuse auth state across tests (set up once per worker)
test.use({ storageState: "playwright/.auth/user.json" });
// In global setup:
await page.goto("/login");
await page.fill("#email", config.userEmail);
await page.fill("#password", config.userPassword);
await page.click("#submit");
await page.context().storageState({ path: "playwright/.auth/user.json" });The 50 tests that previously logged in through the UI each now start authenticated — saving 100+ seconds of browser interaction from the suite.
Lever 5: Retire obsolete tests
Every 6 months, run a coverage analysis. Tests that:
- Duplicate coverage of another test exactly
- Test functionality that was removed from the application
- Have been skipped for more than 3 months
- Retry 50%+ of the time and have never been fixed
...should be deleted. A suite of 4500 well-maintained tests is faster, more reliable, and easier to understand than a suite of 5000 tests where 500 are dead weight.
The 1-hour rule
A guiding constraint: full regression should complete in under 1 hour. This is the maximum feedback cycle that allows a nightly run to be reviewed and acted on before the next work day starts. When the suite exceeds 1 hour:
- Profile for the slowest 10% of tests — optimise or remove the worst offenders first.
- Verify parallel thread count hasn't been limited unnecessarily.
- Add a shard to the CI matrix.
- Check whether
slowtagged tests can be moved to a separate less-frequent pipeline.
⚠️ Common mistakes
- Parallelising before ensuring test isolation. Enabling
thread-count="4"on a suite with shared static state produces race conditions and flaky failures that are far harder to diagnose than the original slow suite. Validate isolation first; parallelise after. - Running all 5000 tests on every PR. This is both slow and expensive. Engineers bypass the CI check ("the PR is green" means "the smoke suite is green, regression is still running"). Define the PR gate as the smoke suite; run full regression nightly.
- Never deleting tests. A test that covers removed functionality still runs, still takes time, and still occasionally breaks on unrelated infrastructure changes. Treat obsolete tests as technical debt — they have a maintenance cost with zero coverage return.
🎯 Practice task
Implement scaling strategies for your suite — 40 minutes.
- Baseline measurement. Time your full suite with a single thread. Record: total time, tests per minute, slowest 5 tests by duration (add timing to your reporter or use TestNG's built-in execution summary).
- Implement smoke tags. Add a
smokegroup or mark to the 10 most critical tests in your suite. Create a separate TestNG XML or pytest marker that runs onlysmoke. Verify these 10 tests run in under 3 minutes. - Enable parallelism. Set
thread-count="3"in your TestNG XML (orworkers=3in Playwright config). Run the full suite. Compare wall-clock time to the baseline. Note any failures — these are isolation violations to fix. - Profile the slow tail. Identify the 5 slowest tests from your timing data. For each: is the slowness from UI setup that could be API setup? Is there an unnecessary full-page navigation? Fix at least one.
- Stretch — GitHub Actions matrix. If your project is on GitHub: add a 2-shard matrix to your CI workflow file. Verify that half the tests run in each shard and the total wall time drops by roughly half versus a single-job run. Record the before and after times.
Next lesson: framework documentation and onboarding — how to make your framework an asset that survives team turnover rather than a mystery that only its creator understands.