Agile Testing
How testing actually works inside an Agile team — what QA does in each ceremony, how to size effort per story, what "done" means, and how the practice extends from CI into production.
Agile Testing Principles
| Principle | What it looks like in practice |
|---|---|
| Testing is continuous, not a phase | Tests run every commit; QA pairs with devs all sprint, not at the end |
| Quality is the whole team's responsibility | Devs write unit + integration tests; product owns acceptance criteria; QA orchestrates and explores |
| Fast feedback over comprehensive documentation | A bug raised in standup beats a 4-page report a week later |
| Working software over extensive test plans | Run the feature, even half-built, instead of waiting for a "complete" spec to plan against |
| Respond to change over following the plan | Re-prioritise tests when scope shifts mid-sprint; the plan serves the work, not the other way around |
| Prevention over detection | Catch issues at the requirements / design stage — cheaper than catching them in QA, much cheaper than in production |
| Shift-left — QA involved early | Review acceptance criteria, attend design reviews, review PRs, write tests before code |
Agile Testing Quadrants (Brian Marick)
A model for thinking about what kind of testing you're doing and why. Two axes: business vs technology, and supporting the team vs critiquing the product.
Supporting the team
│
┌──────────────────────┼──────────────────────┐
│ │ │
│ Q2 │ Q3 │
│ Functional │ Exploratory │
│ User stories │ Usability / UAT │
│ Prototypes │ Alpha / beta │
│ │ │
├──────────────────────┼──────────────────────┤ Business-facing
│ │
│ Technology-facing │
│ │
├──────────────────────┼──────────────────────┤
│ │ │
│ Q1 │ Q4 │
│ Unit tests │ Performance │
│ Component tests │ Security / load │
│ Contract tests │ Soak / chaos │
│ │ │
└──────────────────────┼──────────────────────┘
│
Critiquing the product
| Quadrant | Test types | Owner | Automation |
|---|---|---|---|
| Q1 — technology-facing, supporting | Unit, component, contract | Devs | Fully automated |
| Q2 — business-facing, supporting | Functional / story tests, prototypes, examples | Devs + QA | Automated where stable, manual for new behaviour |
| Q3 — business-facing, critiquing | Exploratory, usability, UAT, alpha/beta | QA + real users | Manual — judgment-driven |
| Q4 — technology-facing, critiquing | Performance, security, load, soak, chaos | Specialists + QA | Tool-driven, scheduled |
A balanced team invests in all four. A common smell: heavy in Q1 + Q2, no Q3 (no exploratory) and no Q4 (no perf/security). Bugs sneak through the gap.
Test Pyramid
The cost-and-coverage shape of an Agile test suite. Most tests at the bottom (cheap, fast); fewest at the top (slow, expensive).
╱╲
╱ ╲ ~10% E2E slowest, most fragile
╱────╲
╱ ╲ ~20% Integration
╱────────╲
╱ ╲ ~70% Unit fastest, cheapest
╱────────────╲
| Layer | Typical share | Speed | Owner | Strengths |
|---|---|---|---|---|
| Unit | ~70% | ms — runs on save | Devs | Logic errors in pure functions, edge cases, regressions in calculations |
| Integration | ~20% | seconds | Devs + QA | Component interactions, DB queries, API contracts, message handling |
| End-to-end | ~10% | tens of seconds | QA | Real user flows, deploy correctness, browser-specific behaviour |
Anti-patterns
| Anti-pattern | Shape | Why it fails |
|---|---|---|
| Ice-cream cone | Tip-heavy E2E layer over a thin base | Slow CI, brittle tests, expensive maintenance, flaky signal |
| Hourglass | Many unit + many E2E, almost no integration | Big behavioural gaps — pure-logic units pass, full flows pass, but the seams between modules silently break |
| Cupcake | Decorations on top — manual tests stacked above E2E | Manual regression on every release; release cadence drops below business needs |
The pyramid isn't a law — for some products (libraries, pure-logic services) the right shape is even more bottom-heavy. For others (UI-heavy apps), 60/25/15 is more realistic. The point: be deliberate about the ratio, not accidental.
QA in Scrum Ceremonies
| Ceremony | What QA brings |
|---|---|
| Backlog refinement | Review upcoming stories for testability — can we tell when this is done? Flag missing or vague acceptance criteria. Raise risks (data, performance, accessibility) before sizing |
| Sprint planning | Estimate testing effort per story; identify test approach (manual / automated / both); raise dependencies (test data, third-party stubs, env access); split stories that are too big to test in-sprint |
| Daily standup | Testing status per story; blockers (broken build, env down, awaiting fix); fresh defects worth flagging early |
| Sprint review / demo | Demo tested features; show quality metrics (coverage, defect counts, escaped bugs); gather stakeholder feedback that becomes next sprint's input |
| Sprint retrospective | Process improvements: too much regression, slow CI, flaky environment, test-data setup pain, automation gaps. The retro is where QA practice gets better — don't sit silent |
Three Amigos meeting
When a story is unclear, get a developer, a tester, and a product person together — the three amigos. The tester's role is to keep asking "what could go wrong?" and "what's the acceptance criteria for that case?" until the story is concrete enough to estimate.
Story Testing Workflow
The same-sprint flow that healthy Agile teams use. The order matters: testing tasks are spread across the sprint, not stacked at the end.
Story enters sprint
│
▼
QA reviews AC ──── gaps? ──→ raise in standup / Three Amigos
│
▼
QA writes scenarios (shift-left, before dev finishes)
│
▼
Developer builds the feature
│
▼
QA tests on dev branch or feature environment
│
├──── bug found? ──→ communicate immediately (chat / pair > ticket)
│ └─ developer fixes ─ QA verifies
▼
Regression check (automated suite + targeted manual)
│
▼
Story → Done (DoD met) → demo at review
What gets in the way
- Story arrives in code review with no test scenarios. QA wasn't pulled in early — fix at refinement, not at the PR.
- All testing happens on the last day of the sprint. Story was too big to ship + test in one sprint. Split it.
- "It works on my machine." No shared dev/feature env, or env is broken. Treat env health as a blocker, not a fact of life.
- Bugs filed but never fixed in-sprint. Carryover compounds. Cap WIP on bugs the same way you cap stories.
Definition of Done (DoD) — Testing Criteria
A story isn't done until everything below is true. Treat this as a checklist on the story card — paste it into the description if your tracker doesn't surface it natively.
□ All acceptance criteria verified (manual or automated)
□ Unit test coverage meets team threshold (e.g. ≥ 80 %)
□ Integration tests passing
□ Regression suite passing
□ No open critical or high severity defects
□ Performance benchmarks met (if perf-sensitive)
□ Accessibility checks passed (WCAG AA)
□ Cross-browser / cross-device tested per support matrix
□ Code reviewed and approved
□ Documentation updated (user-facing, API, runbook)
□ Telemetry / logging in place
Some teams also add: feature flag added (if behind one), translations updated, analytics event wired, security review checked off.
The exact list depends on the team — but every team should have an explicit DoD. "We'll know it when we see it" is how regressions ship.
Acceptance Criteria & BDD
INVEST — what makes a good user story
| Letter | Means | Tester's lens |
|---|---|---|
| Independent | Can be developed without depending on another story | Can it be tested in isolation? |
| Negotiable | Detail can shift during refinement | Are the AC firm enough to derive cases, or still TBD? |
| Valuable | Delivers value to a user or stakeholder | Can you state the business outcome it enables? |
| Estimable | Team can size the effort | Is testing effort included in the estimate? |
| Small | Fits in one sprint | Can I test all the AC inside the sprint? |
| Testable | Acceptance criteria are verifiable | Can I write a pass/fail test for each AC? |
If you can't answer the testability question, the story isn't ready. Send it back to refinement.
Given / When / Then format
The standard structure for acceptance criteria in Agile + BDD teams. Each scenario reads as one observable outcome.
| Clause | Purpose |
|---|---|
Given | Pre-existing state — the world as it is before the action |
When | The action — exactly one event that triggers the behaviour |
Then | The expected outcome — what must be true after the action |
And / But | Additional Given/When/Then clauses |
Worked example
Given I am a logged-in user
And my cart is empty
When I add an item to my cart
Then the cart count should increase by 1
And I should see the item in the cart summary
Read top to bottom: the scenario is concrete, observable, and binary. The Then clauses are what the test will assert.
Multiple scenarios per story
Most stories need 3–6 scenarios — at minimum, one happy path plus the obvious failure modes.
Scenario: Add an item to an empty cart
Given I am a logged-in user
And my cart is empty
When I add "Mountain Bike" to my cart
Then the cart count should be 1
And the cart summary should list "Mountain Bike"
Scenario: Add an out-of-stock item
Given I am a logged-in user
When I attempt to add an out-of-stock item to my cart
Then I should see "Out of stock" message
And the cart should remain empty
Scenario: Add an item while logged out
Given I am not logged in
When I attempt to add an item to my cart
Then I should be redirected to the login page
And the item should be added to the cart after I log in
Converting acceptance criteria to test cases
Each scenario in Given/When/Then form maps directly to a test case. The test runner determines the level:
| AC scenario level | Where the test runs |
|---|---|
| Pure logic / domain rule | Unit test |
| Service interaction | Integration test |
| End-to-end user flow | E2E (Cypress / Playwright / Selenium) |
The same Given/When/Then text can drive a manual test, a Cucumber/SpecFlow scenario, or be paraphrased into a Playwright test() block — pick the level that matches the AC's scope, not always the highest.
Automation of acceptance tests
When AC are written in Gherkin, automation is mostly glue:
| Tool | Language | Native to |
|---|---|---|
| Cucumber | Java, JS/TS, Ruby, Python, others | Most ecosystems |
| SpecFlow | C# / .NET | Visual Studio |
| Behave | Python | pytest-adjacent |
| Robot Framework | Python (keyword-driven, BDD-like) | Acceptance + RPA |
| Karate | Java (Gherkin for API testing) | API-first BDD |
The benefit isn't speed of writing — it's that the AC become the test artefact. Product, dev, and QA all see the same Given/When/Then; nobody hand-translates between a Word doc and a code file.
The cost: discipline. If step definitions become a tangled mess of generic When I click {string} steps, you've lost the readability advantage. Keep step phrasing domain-specific, not technology-specific.
Continuous Testing
The pipeline-driven extension of Agile testing — every commit verified through a layered test suite that gets slower as confidence grows.
Every commit triggers automated tests
The pipeline runs the same tests for every PR and every merge. Local "but it works on my machine" loses to CI as the source of truth.
The standard pipeline shape
commit
│
▼
┌──────────┐ fail-fast — runs in seconds
│ lint │
└────┬─────┘
▼
┌──────────┐ fast — isolated, no I/O
│ unit │
└────┬─────┘
▼
┌──────────────┐ medium — DB, message bus, HTTP mocks
│ integration │
└────┬─────────┘
▼
┌──────────┐ slow — full browser, real services
│ E2E │
└────┬─────┘
▼
┌──────────────┐ optional / scheduled — load, soak
│ performance │
└──────────────┘
Each stage gates the next. A failure in unit aborts before E2E even starts. Cheaper failures find faster feedback.
Fast-feedback budget
| Stage | Target time | What this means in practice |
|---|---|---|
| Lint | < 30 s | Pre-commit hook catches it before CI fires at all |
| Unit | < 2 min | Devs trust the green light enough to keep flowing |
| Integration | < 5 min | Acceptable to wait on a PR |
| E2E | < 15 min total | Sharded across runners; each shard < 5 min |
| Performance | scheduled / nightly | Not blocking PRs, but visible to the team |
If the unit stage takes 20 minutes, devs stop running it locally. If E2E takes 90 minutes, devs stop reading the failures. Slow tests get bypassed — speed is correctness.
Shift-right — testing extends into production
Continuous testing doesn't stop at deploy. The complement of shift-left is shift-right: learn from production.
| Practice | What it is | What it catches |
|---|---|---|
| Synthetic monitoring | Automated probes hit production from outside (Pingdom, Datadog, Checkly, Grafana k6 cloud) | Outages, latency regressions, broken third-party integrations, cert expiry |
| Real-user monitoring (RUM) | Browser SDK reports real-user load times, errors, click flows | Browser-specific bugs, slow flows for real users on real networks |
| Canary deployments | Roll new version to 1% → 10% → 50% → 100% over hours/days | Regressions visible at low blast radius before wide rollout |
| Feature flags | Ship dark, enable for a small cohort, then everyone | Test in production safely; instant rollback without redeploy |
| Error tracking | Sentry / Rollbar / Bugsnag capture exceptions with stack + breadcrumbs | Bugs that don't reproduce locally; regressions that escape pre-prod tests |
| Chaos engineering | Deliberate failure injection — kill instances, drop traffic, slow networks | Resilience gaps; recovery timing assumptions |
Feature flags — ship dark, then test
Decouples deploy from release. Code reaches production behind a flag; the flag stays off until tested. Switch on for QA, then internal users, then real users.
deploy (flag off, no behaviour change)
│
▼
flag-on for QA-only cohort ──────────┐
│ │
▼ │
flag-on for 1% of real users ├──── monitor production
│ │
▼ │
flag-on for 100% │
│ │
▼ │
remove flag from code ◄───────────────┘
If anything goes wrong at any step: flip the flag off — no rollback, no redeploy.
A/B testing — validate with real users
Run the new version (B) against the old (A) for two cohorts of real users. Compare outcomes:
| What you measure | Example |
|---|---|
| Conversion | % completing the funnel |
| Engagement | Time on page, click-through |
| Errors | Crash rate, validation failure rate |
| Performance | LCP, INP, time-to-interactive |
The QA role isn't to pick the winner — it's to make sure the experiment is measurable (instrumentation present, metrics defined, sample size adequate) and that both arms are equally tested before launch.
Agile Testing Metrics
Metrics in Agile aren't management report fodder — they're feedback for the team. Pick a small set and watch the trend, not the absolute number.
Defect-related metrics
| Metric | Definition | Target | Smell when |
|---|---|---|---|
| Defect density | Defects ÷ stories (or ÷ KLOC) | Trends down sprint over sprint | Spikes — usually a story too big or AC too thin |
| Escaped defects | Bugs found in production that pre-prod tests missed | As close to 0 as the team can sustain | Trending up — coverage gaps; review post-mortems |
| Defect resolution time | Mean time from "reported" → "fixed and verified" | < 2 days for high-severity | Bugs piling up — WIP-cap them |
| Reopened defect rate | % of defects re-opened after marked "fixed" | < 5% | Fix verifications too shallow; missing regression coverage |
Coverage metrics
| Metric | Definition | What it actually tells you |
|---|---|---|
| Acceptance test coverage | % of acceptance criteria with at least one automated test | Confidence the AC won't regress silently |
| Code coverage | % of source lines / branches executed by tests | Useful when trending; useless as an absolute target — 100% covered code can still be untested logic |
| Requirements coverage | % of user stories with at least one test case | Higher level than code coverage — better signal for product completeness |
Code coverage as a target gameable; as a trend, it's a sensible early warning.
Velocity & process metrics
| Metric | Definition | Tester's read |
|---|---|---|
| Velocity impact | How testing effort affects team velocity per sprint | If velocity drops every time a sprint includes UI testing, the test debt is real |
| Sprint burndown — testing tasks | Testing work as part of the sprint burndown chart | Testing should burn down alongside dev — not stack at the end |
| Stories rolled over | Stories that couldn't be marked "Done" because testing wasn't complete | Persistent rollover means testing capacity is short of dev capacity |
| Cycle time | Time from "in progress" → "done" per story | Includes testing — long cycle times often mean late testing |
Automation metrics
| Metric | Definition | Healthy range |
|---|---|---|
| Automation ratio | Automated tests ÷ total tests | Trending up; the absolute % depends on the product |
| Automation coverage of regression suite | % of regression test cases automated | High — manual regression is the slowest path to release |
| Test execution time | Wall-clock time of the full automated suite | Stable or shrinking; growth past the "fast feedback budget" needs sharding or pruning |
| Flakiness rate | % of automated tests that fail on retry without code change | < 1% per test, < 5% suite-wide. Above that, devs stop trusting CI |
| Test maintenance ratio | Time spent fixing tests ÷ time writing new tests | If fixes dominate, the suite is over-coupled to UI internals — refactor |
Pick the smallest set that drives action
Reporting 12 metrics nobody acts on is a bigger problem than reporting 3 you do. A practical starter dashboard:
- Escaped defects this release — the only one product cares about.
- CI build time — fast-feedback budget; team productivity.
- Flakiness rate — trust in the suite; if it climbs, fix it that sprint.
- Stories rolled over due to testing — capacity signal.
Add more only when you have a question those four don't answer.