Visual Verification with Vision Mode

Vision mode is the lever for the work snapshot mode can't do — comparing layouts, reviewing designs against the live page, checking responsive breakpoints, surfacing visual accessibility issues. This lesson focuses on three workflows where vision genuinely earns its (higher) cost: staging-vs-production diffs, design-mockup verification, and visual accessibility review. It also draws a clear line between what vision mode does well and what dedicated visual-regression tools (Percy, Chromatic, Playwright's own toHaveScreenshot) do better.

The mental model: vision mode is a human-style review delivered by the assistant — interpretive, descriptive, narrative. Pixel-perfect diffing is a different category of tool. The two complement each other; neither replaces the other.

Three workflows that pay off

1. Staging vs production diff.

Open https://staging.myshop.com/products in vision mode and take a full-page
screenshot. Then open https://www.myshop.com/products and take another.
 
Compare the two and list visual differences. Group findings under:
- Layout (positions, alignments, spacing)
- Text content (copy changes, missing content)
- Images (different, missing, broken)
- Colour and styling (palette, weight, type)
 
Ignore: rotating banners, randomised product order, timestamps, A/B test variations
on the staging side. If you're unsure whether a difference is intentional, flag it
under "Needs human review" rather than dropping it.

The output is a structured diff that reads like a designer's review: "the staging hero shows a different headline," "the trust badges row is missing on production," "primary buttons on staging use a darker blue (#1B4DAD vs production's #2563EB)." You scan the list, ignore the intentional changes, and triage the rest.

2. Design-mockup verification.

Here is the Figma export of the upcoming homepage redesign [attached image].
Open https://staging.myshop.com/ in vision mode. Walk through the live page and
flag every place the implementation deviates from the mockup. Be specific about
what differs and where.

The assistant alternates between the mockup and the live screenshots, calling out missing icons, off-spec spacing, type-weight mismatches. It's not pixel-perfect — but it's much faster than a designer reading a long checklist.

3. Visual accessibility review.

Take a screenshot of the checkout page at viewport 375×812 (iPhone portrait).
Identify visual accessibility issues:
 
- Low contrast text (call out specific element + readable colour pair)
- Touch targets smaller than 44×44px
- Missing or barely-visible focus indicators (capture a screenshot after Tabbing
  through the page once)
- Text overflowing or being clipped
- Anything that would fail a WCAG 2.1 AA review at first glance
 
Then resize to 1366×768 and repeat.

This isn't a substitute for axe-core or a full accessibility audit, but it surfaces the visible issues — the ones a sighted user with a screen reader would still struggle with. Pair with the snapshot-mode accessibility audit from Chapter 2 (which catches semantic issues) for a credible first pass.

Where vision mode beats and where it loses

Vision-mode review vs dedicated visual-diff tools

Vision mode

Interpretive — explains what changed and why it matters
Handles 'should I care?' triage
Cheap to start — no platform setup
Fuzzy on small pixel changes; per-call cost

Pixel diff tools (Percy, Chromatic, toHaveScreenshot)

Pixel-perfect, deterministic, scriptable
Best for CI gating and regression detection
Noisy without careful baseline management
Requires platform setup and per-screenshot cost

Reach for vision mode when the question is "is this layout sensible?" or "did anything obvious change between these two builds?" Reach for pixel-diff tools when the question is "did any pixel change in this CI run vs the last green main?" Mixing the two — vision for review, pixel diff for gating — covers both modes of failure.

The Playwright-native visual baseline

For deterministic visual regression in your existing suite, you don't need a separate platform:

test('homepage visual regression', async ({ page }) => {
  await page.goto('/');
  await page.waitForLoadState('networkidle');
  await expect(page).toHaveScreenshot('homepage.png', {
    maxDiffPixelRatio: 0.01,
  });
});

Playwright stores baseline images in your repo, diffs against them on each run, and fails on visible regressions. maxDiffPixelRatio tunes the noise floor for anti-aliasing differences. Use this for the screens that matter for branding and revenue. Use vision mode for the screens you only check on big releases.

A combined workflow that works in practice

For a mid-size release where the visual surface has changed:

Vision-mode pre-flight — run the staging-vs-production prompt against the changed pages. Triage the structured findings. File any unexpected diffs as bugs to fix before merge.
Update Playwright baselines for the screens whose changes are intentional. npx playwright test --update-snapshots regenerates the stored images.
Run pixel-diff in CI on the next push. This catches accidental regressions on screens you didn't think had changed — a CSS-cascade side effect from a tweak elsewhere.
Vision mode for the post-deploy smoke — "compare https://www.myshop.com today against the staging screenshots from yesterday. Anything missing?" This catches deploy-time issues that the CI pixel diff couldn't see (CDN cache, asset upload, env-specific config).

That cycle — review, baseline, gate, smoke — is the realistic shape of visual testing in 2026. Vision mode covers the human-judgement steps; pixel diff covers the deterministic ones.

Costs that are easy to forget

Vision mode sends images to the model. Image tokens are roughly 5–15× more expensive than text tokens. Three habits keep costs predictable:

Capture targeted regions, not full pages, when the question is local. "Take a screenshot of just the header" costs a fraction of a full-page screenshot.
Don't run vision in tight loops. A 30-step session in default vision mode (every turn includes a screenshot) bills like ten snapshot-only sessions. Use vision opportunistically — "snapshot for navigation, screenshot only at the verification step."
Cache baselines outside the chat. If you're running staging-vs-prod weekly, save the production screenshot once and only re-capture the staging side. The model can compare against an attached image without a fresh capture.

⚠️ Common mistakes

Using vision mode where pixel diff is the right tool. "Did any pixel change between this run and the last main build?" is a job for toHaveScreenshot, not a vision prompt — deterministic, free at runtime, and gateable in CI. Vision mode for the same question is slow, costly, and noisy.
Trusting a visual review to catch logic bugs. A page can look like it accepted a coupon and have rejected it server-side. Visual verification is about appearance, not behaviour. Pair with snapshot or network assertions whenever the test is really about whether the system did the right thing.
Running visual reviews against unstable pages. Pages with rotating banners, randomised product order, or live counters produce a flood of "differences" that are pure noise. Either freeze the variability (seeded test data, feature-flagged stable mode) or list those elements explicitly in the "ignore" section of the prompt.

🎯 Practice task

Run all three vision workflows against your real app. 35 minutes.

Staging vs production: pick a page where staging is currently ahead of production. Run the diff prompt. Read the structured findings — confirm one finding manually before trusting the rest. Note any false positives that came from rotating content; add them to the "ignore" list and re-run if needed.
Mockup verification: if you have a Figma export of any in-flight design, attach it to a chat and run the design-deviation prompt against the staging implementation. List every flagged deviation in the design-review channel and resolve as "intentional" or "fix."
Visual accessibility review: pick the most-trafficked page on your app and run the accessibility prompt at both mobile and desktop viewports. Triage findings against your team's accessibility rubric.
Pixel-diff baseline: for one of the screens you just reviewed, add a Playwright toHaveScreenshot test. Commit the baseline. Confirm it passes; tweak a CSS rule on the screen by 1px and confirm it now fails. This is the deterministic gate that complements the vision-based review.
Stretch: automate the staging-vs-prod review as a daily run. (See the cost-and-latency lesson at the end of the course before scheduling vision-heavy jobs — daily-on-five-pages is fine; per-commit-on-twenty isn't.)

The final lesson of this chapter is the closing skill: when an existing test fails in CI, AI-driven debugging shrinks the what just broke? loop from minutes of trace-scrubbing to a single explanatory turn.