Visual Regression Testing
Catch layout shifts, colour changes, and rendering regressions automatically — before users do. Learn Playwright's built-in screenshot comparison and Percy/Chromatic for baseline-based visual diffing.
1 The Hook — Why This Matters
A Wellington SaaS company pushes a CSS refactor late on a Friday afternoon. The PR passes all 230 functional tests — every button still clicks, every form still submits, every API call still returns 200. The team merges with confidence.
By Monday morning, a client in Sydney has emailed: the sidebar font is smaller than before, the hero image overflows on mobile, and the card borders on the dashboard have disappeared entirely. Three separate visual regressions, all introduced by one CSS refactor, all invisible to functional tests.
Functional tests verify that the app works. Visual regression tests verify that the app looks correct. These are not the same thing. A button can function perfectly while being rendered completely off the visible viewport. A form can submit successfully while the submit button has zero contrast. Visual regression tests are the safety net that catches the gap.
2 The Rule — The One-Sentence Version
Visual regression tests capture a known-good baseline snapshot and fail when pixels deviate beyond a threshold. Run them in CI on every PR that touches CSS or shared components.
The first run creates the baseline. Every subsequent run compares pixel-by-pixel against it. Deviations above your configured threshold fail the build — giving the team an explicit choice: accept the change (update the baseline) or fix the regression.
3 The Analogy — Think Of It Like...
Visual regression testing is a before/after photo comparison for your UI.
You take a photo of how the app looks when it is correct. Every time you ship, you take a new photo and compare pixel-by-pixel. Any unexpected difference is flagged. It's like an insurance photo of your house before you rent it out — you have evidence of what it looked like, and you can prove what changed.
The key word is unexpected. When a designer intentionally changes a colour, you update the baseline photo. The tool doesn't know whether a change is intentional — that judgement belongs to the team.
4 Watch Me Do It — Step by Step
Playwright has built-in visual comparison via toHaveScreenshot(). No extra dependencies. Here is how to use it on a NZ SaaS dashboard.
import { test, expect } from '@playwright/test';
test('dashboard looks correct', async ({ page }) => {
await page.goto('/dashboard');
// Wait for dynamic content to settle before snapshotting
await page.waitForLoadState('networkidle');
// Mask dynamic elements (timestamps, avatars) so they don't cause false failures
await expect(page).toHaveScreenshot('dashboard.png', {
maxDiffPixels: 100, // allow up to 100 pixel difference (anti-aliasing tolerance)
threshold: 0.2, // 20% per-pixel colour difference tolerance
mask: [
page.locator('[data-testid="last-login-time"]'),
page.locator('[data-testid="user-avatar"]'),
],
});
});
test('login page on mobile — iPhone 14', async ({ page }) => {
await page.setViewportSize({ width: 390, height: 844 });
await page.goto('/login');
await page.waitForLoadState('networkidle');
await expect(page).toHaveScreenshot('login-mobile.png');
});
// Store snapshots next to tests, keyed by OS and browser
export default defineConfig({
snapshotPathTemplate: '{testDir}/__snapshots__/{testFilePath}/{arg}-{projectName}{ext}',
expect: {
toHaveScreenshot: {
maxDiffPixels: 100,
threshold: 0.2,
},
},
});
# First run: creates baseline screenshots (commit these to the repo) npx playwright test --update-snapshots # Subsequent runs: fails if pixels differ beyond threshold npx playwright test # Update baselines for a specific test file only npx playwright test visual.spec.ts --update-snapshots
For team-based review workflows, Percy (by BrowserStack) and Chromatic (by Storybook) provide a web dashboard where developers and designers can approve or reject visual diffs in a PR review. Percy integrates with Playwright in a few lines:
import { percySnapshot } from '@percy/playwright';
test('dashboard visual review', async ({ page }) => {
await page.goto('/dashboard');
await page.waitForLoadState('networkidle');
// Sends snapshot to Percy dashboard for team approval
await percySnapshot(page, 'Dashboard — desktop');
});
5 Tool Comparison — Which Visual Testing Approach?
| Tool | Best for | Cost | Team review UI? |
|---|---|---|---|
| Playwright snapshots | Solo teams, CI gatekeeping, no dashboard needed | Free | No — diffs in terminal/report |
| Percy | Teams that want designers to approve diffs in a PR | Free tier, paid for volume | Yes — web-based diff UI |
| Chromatic | Storybook component libraries; component-level visual testing | Free tier, paid for volume | Yes — integrated with Storybook |
| BackstopJS | Full-page regression on static or server-rendered sites | Free, open source | HTML report, no approval flow |
In NZ, most mid-sized teams start with Playwright snapshots committed to the repo, then graduate to Percy when the design team wants to be involved in approvals. Chromatic is the right choice if you already have a Storybook component library.
6 Common Mistakes — Don't Do This
🚫 Not masking dynamic content
Timestamps, "last seen" labels, user avatars, advertisements, and live data all change between test runs. If you don't mask them, every single visual test becomes flaky — failing not because of a real regression, but because a timestamp ticked over. Use mask: [page.locator('[data-testid="timestamp"]')] in Playwright, or Percy's percyCSS option to hide them.
🚫 Running visual tests on a different OS to the baseline
Fonts render differently on macOS, Windows, and Linux — even at the same pixel dimensions. If your developer generated the baseline on a Mac and CI runs on Ubuntu, you'll see hundreds of false failures from sub-pixel differences. Pin your test environment to one OS using Docker. All baseline screenshots should be generated inside the same container that CI uses.
🚫 Setting threshold to 0% and failing on anti-aliasing
A threshold: 0 or maxDiffPixels: 0 setting will fail on every run because of GPU anti-aliasing differences, subpixel rendering, and JPEG compression artefacts. Start with maxDiffPixels: 100 and threshold: 0.2 and tighten only if you need pixel-perfect accuracy on a specific critical component. Aim for stability first, precision second.
7 Now You Try — Prompt Lab
Write your visual regression plan below and the AI coach will review it.
You have a Playwright test suite with 50 functional tests. Write a plan for adding visual regression coverage. Your plan should address:
- Which pages to cover first and why
- How to handle dynamic content (timestamps, user data, live charts)
- How to integrate baseline updates into the PR workflow
8 Self-Check — Can You Actually Do This?
Click each question to reveal the answer. Three from three means you're ready for the interview round.
Q1. What is a baseline snapshot and what happens on the first vs subsequent test runs?
A baseline snapshot is a known-good screenshot of a page or component, captured when the UI is in a correct state and committed to version control. On the first run, Playwright (or Percy) creates the baseline. On subsequent runs, the tool takes a new screenshot and compares it pixel-by-pixel against the baseline. Any difference above the configured threshold fails the test. To accept an intentional UI change, you update the baseline by running with --update-snapshots and committing the new files.
Q2. What are two strategies for preventing dynamic content from causing false visual test failures?
(1) Masking: use mask: [page.locator(...)] in Playwright to black out elements that change between runs (timestamps, avatars, ads) before taking the screenshot. (2) Stabilising: wait for networkidle, hide loading spinners, or use page.clock.setFixedTime() to freeze dates so dynamic content is consistent between runs. Percy also supports injecting CSS to hide elements via percyCSS.
Q3. Why should visual regression tests run inside a Docker container in CI?
Font rendering, sub-pixel anti-aliasing, and GPU compositing differ across operating systems. A baseline generated on macOS will produce pixel differences when compared to a screenshot taken on Ubuntu, even though the UI is identical. Running both baseline creation and CI comparisons inside the same pinned Docker image (e.g., mcr.microsoft.com/playwright:v1.x.x-focal) ensures the rendering environment is identical, eliminating OS-related false failures.
9 Interview Prep — What They'll Ask
Q1. "How do you prevent visual regression tests from becoming flaky?"
Three main levers: (1) mask dynamic content so changing data doesn't cause spurious diffs; (2) pin the test environment to a fixed OS and browser version via Docker so rendering is deterministic; (3) set a sensible pixel tolerance rather than 0% to absorb anti-aliasing variation. In practice I also wait for networkidle before snapshotting, disable CSS animations for the test run, and ensure the viewport is fixed. Flakiness in visual tests almost always comes from one of these — find the dynamic element and mask it.
Q2. "What would you do if every PR starts failing visual regression because designers make frequent intentional UI changes?"
This is a workflow problem, not a tooling problem. If designers are making frequent intentional changes, the baseline update process needs to be fast and low-friction. I'd switch from Playwright snapshots (where a developer has to run --update-snapshots and commit files) to Percy or Chromatic — tools where designers can approve visual diffs directly in the PR without needing to touch the code. I'd also scope visual tests to components most likely to regress accidentally (shared nav, layout containers) and exclude rapidly-iterating marketing pages. The goal is catching unintentional regressions, not blocking intentional design work.
10 Next Step
You now have the tools to catch both functional regressions and presentation-layer regressions. The next step is practice — apply both skills to realistic exercises in the Mid-Level practice section.