Automation Framework Design Exercise

Design an automation framework architecture from first principles: propose the folder structure, tool choices, CI strategy, and trade-offs for a given team and application context.

Role

Senior automation engineer

Difficulty

Advanced

Time limit

60–90 min

Scenario

You are joining a product team as the first dedicated SDET. The team builds a React single-page application backed by a Node.js REST API and a PostgreSQL database. There are 8 engineers total (2 senior, 4 mid, 2 junior), no existing automated test suite, and a 2-week sprint cadence. Deployments go to staging daily and to production once per sprint. The team has no strong language preference but most engineers are comfortable with JavaScript and TypeScript. Your task is to design a test automation framework that the whole team can understand and contribute to — not just QA. Produce a written design document covering architecture, tool selection, test data strategy, CI integration, and flaky-test governance.

Requirements

1.State context before proposing tools: identify the application type, team constraints, testing scope (unit, API/integration, E2E), and any non-negotiable constraints such as language, cloud provider, or budget
2.Propose a folder structure for the automation repository (or monorepo layer) and explain the responsibility of each top-level directory
3.Justify your choice of test runner, assertion library, and browser automation tool — address at least one alternative per choice and explain why you did not select it
4.Describe your test data strategy: how is test data created, scoped per test, and cleaned up? Explicitly address parallel execution safety
5.Define your CI/CD integration: which test layers run on which trigger (commit, PR, merge, nightly)? How are reports surfaced and how do failures block the pipeline?
6.Explain how the framework handles configuration across environments (local, staging, production) without source code changes
7.Define a flaky-test governance policy: how are flaky tests detected, quarantined, and resolved? Include a concrete SLA for quarantined tests

Expected deliverables

✓A written design document (Markdown or PDF) with sections corresponding to each requirement above
✓A folder structure diagram or ASCII tree with a one-line description of each directory's responsibility
✓A tool-selection rationale table: tool category | chosen tool | reason | alternative considered | why alternative was not chosen
✓A test data strategy section (one to two paragraphs) that explicitly addresses parallel-safe creation and cleanup
✓A CI/CD trigger map: a table or diagram showing trigger event → test layer → timeout → failure behaviour
✓A flaky-test governance policy (one page maximum): detection mechanism, quarantine process, SLA, and escalation path

Evaluation rubric

Dimension	What reviewers look for
Context-first thinking	Does the candidate state constraints before recommending tools? Are team size, language preference, and deployment cadence reflected in the choices? A generic 'here is how I always do it' answer without addressing the given context scores poorly.
Layered architecture	Is the folder structure clearly layered by responsibility (pages/step-defs vs tests vs fixtures vs utils vs config)? Is the separation of concerns articulated, not just drawn? Could a developer unfamiliar with test automation navigate the repository?
Tool justification	Are tool choices backed by concrete reasons relevant to this context? Is at least one trade-off acknowledged per major choice (e.g. Cypress's same-origin constraint vs Playwright's multi-tab support; Jest's ecosystem maturity vs Vitest's speed)?
Test data strategy	Is the strategy provably safe for parallel execution? Does it cover both creation and cleanup? Does it avoid shared mutable state (e.g. a single login user reused across all tests)? Is environment isolation addressed?
CI/CD integration	Are different test layers triggered at different cadences (fast feedback on commit, broader coverage on merge)? Is reporting actionable (links to report artifacts, not just exit codes)? Is the pipeline designed to fail fast at the cheapest layer first?
Flaky-test governance	Is there a concrete detection mechanism (not just 'we notice when tests fail')? Is there a quarantine path that removes the flaky test from blocking pipelines without deleting it? Is there a defined SLA with an escalation path if the SLA is breached?

Sample solution outline

›Context: React SPA + Node.js API, 8-person team, TypeScript throughout, GitHub for source control, GitHub Actions for CI, no mobile requirement
›Proposed test layers: Vitest (unit, runs in < 30 s), supertest + Jest or Vitest (API/integration, < 2 min), Playwright (E2E, < 15 min for smoke)
›Tool rationale: Playwright chosen over Cypress because the app uses multiple tabs in the checkout flow and Cypress cannot handle cross-origin iframes needed for the payment widget
›Folder structure: tests/unit/ (Vitest specs co-located with source), tests/api/ (supertest integration tests), tests/e2e/features/ (Playwright specs or Gherkin), tests/e2e/pages/ (POM), tests/e2e/fixtures/ (data factories), tests/e2e/support/ (global setup/teardown), config/ (env-specific config files), .github/workflows/ (CI pipeline definitions)
›Test data strategy: each E2E test calls a /test/setup API endpoint (available in non-production environments) to create a unique user and seed data; the After hook calls /test/teardown with the created resource IDs; UUID v4 suffixes prevent cross-test ID collisions; the setup API is blocked at the network layer in production
›CI triggers: on every commit — unit + API tests (< 3 min, fail fast); on PR — unit + API + E2E smoke tag (< 12 min, Chromium only); on merge to main — full E2E suite (all browsers, 25 min); nightly — full suite + performance assertions against staging
›Environment config: a config/env.ts loader reads from environment variables; each environment (local, staging, production) has a corresponding .env.${ENV} file; CI injects secrets via GitHub Actions environment secrets; no environment-specific logic in test code
›Flaky-test policy: a test is flagged @flaky if it fails 2 of 5 consecutive runs in the nightly pipeline; flagged tests are moved to a quarantine suite that runs in CI but does not block the pipeline; a GitHub issue is auto-created with the failure logs; SLA is 5 business days to fix or delete; tests not resolved within 10 business days are automatically deleted from the suite and the issue is escalated to the engineering manager

Common mistakes

Jumping straight to tool selection without stating context — recommending Playwright or Cypress without knowing the app's constraints shows a framework preference, not engineering judgement
Designing a single monolithic E2E layer without faster feedback loops at the unit and API levels — this makes the suite expensive to run and slow to identify the source of failures
Ignoring parallel execution in the test data strategy — proposing that all tests share a single login user or a database row seeded once per run leads to intermittent failures that are hard to diagnose
Treating CI integration as 'just run the tests' — not specifying triggers, timeouts, artifact retention, or failure notification means the pipeline provides no actionable signal
No flaky-test governance — leaving flaky tests in the pipeline without a quarantine mechanism slowly erodes team trust; teams stop looking at failures and start ignoring the suite entirely
Overly prescriptive tool selection for a junior or mixed team — proposing a complex BDD framework with Gherkin, a custom World, and a separate reporting pipeline on day one creates onboarding friction that slows adoption

Submission checklist

Context statement with application type, team size, language constraints, and deployment cadence
Folder structure diagram with a one-line description of each directory
Tool selection rationale covering at least test runner, assertion library, and browser automation tool
At least one alternative considered per major tool choice with a reason for not selecting it
Test data strategy with explicit parallel-safety and cleanup approach
CI/CD trigger map showing at least three distinct trigger events and their corresponding test layers
Flaky-test governance policy with detection mechanism, quarantine path, and concrete SLA

Extension ideas

+Add a section on contract testing (Pact or similar) between the React SPA and the Node.js API — describe how you would generate and verify a consumer-driven contract in CI
+Propose a test metrics dashboard: define which metrics you would track (flakiness rate, mean time to green, E2E coverage delta per sprint) and how you would surface them to the team
+Describe how you would onboard a new engineer to the framework within their first sprint, including which documentation to write and which tests to have them add as a ramp-up task