Mid-Level · Automation Engineer

Visual Regression Testing

Q: Q2. What are two strategies for preventing dynamic content from causing false visual test failures?

(1) Masking : use mask: [page.locator(...)] in Playwright to black out elements that change between runs (timestamps, avatars, ads) before taking the screenshot. (2) Stabilising : wait for networkidle , hide loading spinners, or use page.clock.setFixedTime() to freeze dates so dynamic content is consistent between runs. Percy also supports injecting CSS to hide elements via percyCSS .

Catch layout shifts, colour changes, and rendering regressions automatically — before users do. Learn Playwright's built-in screenshot comparison and Percy/Chromatic for baseline-based visual diffing.

Mid-Level ISTQB CTAL-TAE v2.0 — K3 Apply ~13 min read + exercise

1 The Hook — Why This Matters

A Wellington SaaS company pushes a CSS refactor late on a Friday afternoon. The PR passes all 230 functional tests — every button still clicks, every form still submits, every API call still returns 200. The team merges with confidence.

By Monday morning, a client in Sydney has emailed: the sidebar font is smaller than before, the hero image overflows on mobile, and the card borders on the dashboard have disappeared entirely. Three separate visual regressions, all introduced by one CSS refactor, all invisible to functional tests.

Functional tests verify that the app works. Visual regression tests verify that the app looks correct. These are not the same thing. A button can function perfectly while being rendered completely off the visible viewport. A form can submit successfully while the submit button has zero contrast. Visual regression tests are the safety net that catches the gap.

2 The Rule — The One-Sentence Version

Visual regression tests capture a known-good baseline snapshot and fail when pixels deviate beyond a threshold. Run them in CI on every PR that touches CSS or shared components.

The first run creates the baseline. Every subsequent run compares pixel-by-pixel against it. Deviations above your configured threshold fail the build — giving the team an explicit choice: accept the change (update the baseline) or fix the regression.

3 The Analogy — Think Of It Like...

Analogy

Visual regression testing is a before/after photo comparison for your UI.

You take a photo of how the app looks when it is correct. Every time you ship, you take a new photo and compare pixel-by-pixel. Any unexpected difference is flagged. It's like an insurance photo of your house before you rent it out — you have evidence of what it looked like, and you can prove what changed.

The key word is unexpected. When a designer intentionally changes a colour, you update the baseline photo. The tool doesn't know whether a change is intentional — that judgement belongs to the team.

Senior engineer insight

The moment that changed how I think about visual testing was watching a NZ bank's internet banking redesign ship a perfectly functional build where the account balance figure had dropped behind the card overlay — completely hidden on 375px viewports, still correctly reading from the API. Functional tests gave it a green tick. What you actually test with visual regression is the contract between your CSS and your users' screens, which no assertion on a DOM attribute can ever cover. Once I understood that framing, I stopped treating visual tests as optional polish and started treating them as a first-class acceptance gate alongside functional tests.

The most common mistake: teams generate the baseline on a developer's MacBook, then wonder why CI (running Ubuntu) produces thousands of false failures — pixel-perfect cross-OS consistency requires all screenshots, baseline and comparison alike, to be produced inside the same pinned Docker image.

From the field

A Wellington council team I worked with assumed their Playwright visual suite was clean because it had been green for six weeks. What nobody had noticed was that every screenshot in the suite was taken at the loading spinner frame — waitForLoadState('domcontentloaded') was firing before the skeleton screens resolved, so the "golden baseline" was literally a page full of grey rectangles. The tests kept passing because the grey rectangles always matched the grey rectangles. The actual dashboard had been drifting for weeks. When we switched to networkidle and added an explicit wait on the main data table's locator, we discovered four layout regressions that had accumulated since the last design sprint — including a date-picker widget that was rendering off-screen on 1280px viewports. The lesson that generalises: a green visual regression suite that has never been validated against a known real regression is not a safety net; it's a false sense of security. Inject a deliberate regression, confirm it fails, then call it trustworthy.

4 Watch Me Do It — Step by Step

Playwright has built-in visual comparison via toHaveScreenshot(). No extra dependencies. Here is how to use it on a NZ SaaS dashboard.

TypeScript / Playwright — Built-in screenshot comparison

import { test, expect } from '@playwright/test';

test('dashboard looks correct', async ({ page }) => {
  await page.goto('/dashboard');

  // Wait for dynamic content to settle before snapshotting
  await page.waitForLoadState('networkidle');

  // Mask dynamic elements (timestamps, avatars) so they don't cause false failures
  await expect(page).toHaveScreenshot('dashboard.png', {
    maxDiffPixels: 100,  // allow up to 100 pixel difference (anti-aliasing tolerance)
    threshold: 0.2,      // 20% per-pixel colour difference tolerance
    mask: [
      page.locator('[data-testid="last-login-time"]'),
      page.locator('[data-testid="user-avatar"]'),
    ],
  });
});

test('login page on mobile — iPhone 14', async ({ page }) => {
  await page.setViewportSize({ width: 390, height: 844 });
  await page.goto('/login');
  await page.waitForLoadState('networkidle');
  await expect(page).toHaveScreenshot('login-mobile.png');
});

Playwright config — playwright.config.ts

// Store snapshots next to tests, keyed by OS and browser
export default defineConfig({
  snapshotPathTemplate: '{testDir}/__snapshots__/{testFilePath}/{arg}-{projectName}{ext}',
  expect: {
    toHaveScreenshot: {
      maxDiffPixels: 100,
      threshold: 0.2,
    },
  },
});

CLI — Updating baselines after an intentional UI change

# First run: creates baseline screenshots (commit these to the repo)
npx playwright test --update-snapshots

# Subsequent runs: fails if pixels differ beyond threshold
npx playwright test

# Update baselines for a specific test file only
npx playwright test visual.spec.ts --update-snapshots

For team-based review workflows, Percy (by BrowserStack) and Chromatic (by Storybook) provide a web dashboard where developers and designers can approve or reject visual diffs in a PR review. Percy integrates with Playwright in a few lines:

TypeScript / Percy — Team visual review workflow

import { percySnapshot } from '@percy/playwright';

test('dashboard visual review', async ({ page }) => {
  await page.goto('/dashboard');
  await page.waitForLoadState('networkidle');
  // Sends snapshot to Percy dashboard for team approval
  await percySnapshot(page, 'Dashboard — desktop');
});

Pro tip: Always run visual regression tests inside a Docker container with a pinned OS and browser version. Fonts render slightly differently on macOS vs Linux — if your CI runs on Linux but developers run tests locally on macOS, you'll see constant false failures from sub-pixel font rendering differences.

5 Tool Comparison — Which Visual Testing Approach?

Tool	Best for	Cost	Team review UI?
Playwright snapshots	Solo teams, CI gatekeeping, no dashboard needed	Free	No — diffs in terminal/report
Percy	Teams that want designers to approve diffs in a PR	Free tier, paid for volume	Yes — web-based diff UI
Chromatic	Storybook component libraries; component-level visual testing	Free tier, paid for volume	Yes — integrated with Storybook
BackstopJS	Full-page regression on static or server-rendered sites	Free, open source	HTML report, no approval flow

In NZ, most mid-sized teams start with Playwright snapshots committed to the repo, then graduate to Percy when the design team wants to be involved in approvals. Chromatic is the right choice if you already have a Storybook component library.

6 Common Mistakes — Don't Do This

🚫 Not masking dynamic content

Timestamps, "last seen" labels, user avatars, advertisements, and live data all change between test runs. If you don't mask them, every single visual test becomes flaky — failing not because of a real regression, but because a timestamp ticked over. Use mask: [page.locator('[data-testid="timestamp"]')] in Playwright, or Percy's percyCSS option to hide them.

🚫 Running visual tests on a different OS to the baseline

Fonts render differently on macOS, Windows, and Linux — even at the same pixel dimensions. If your developer generated the baseline on a Mac and CI runs on Ubuntu, you'll see hundreds of false failures from sub-pixel differences. Pin your test environment to one OS using Docker. All baseline screenshots should be generated inside the same container that CI uses.

🚫 Setting threshold to 0% and failing on anti-aliasing

A threshold: 0 or maxDiffPixels: 0 setting will fail on every run because of GPU anti-aliasing differences, subpixel rendering, and JPEG compression artefacts. Start with maxDiffPixels: 100 and threshold: 0.2 and tighten only if you need pixel-perfect accuracy on a specific critical component. Aim for stability first, precision second.

7 Now You Try — Prompt Lab

Write your visual regression plan below and the AI coach will review it.

📸 Exercise — Visual Regression Coverage Plan

You have a Playwright test suite with 50 functional tests. Write a plan for adding visual regression coverage. Your plan should address:

Which pages to cover first and why
How to handle dynamic content (timestamps, user data, live charts)
How to integrate baseline updates into the PR workflow

Why teams fail here

Baseline rot: developers run --update-snapshots to fix a failing CI build without reviewing the actual diff, silently accepting regressions as the new normal — within a few sprints the baseline no longer reflects design intent.
Scope creep on snapshot coverage: teams screenshot every page in the app including high-churn marketing pages, then spend more time approving legitimate design changes than they would have spent manually checking the UI — the suite becomes a friction tax that engineers route around.
No CI environment parity: Percy or Chromatic is wired up, but local developers still run snapshots natively on their machines and commit OS-specific baselines to the repo, creating a two-baseline collision that makes diffs unreadable and PR reviews meaningless.
Dynamic data not stabilised: content from live APIs — transaction lists on a banking portal, patient counts in a health dashboard, road event feeds for a TransitNZ integration — changes between runs, turning every test into a lottery rather than a regression gate.

Key takeaway

Visual regression testing is not about pixel perfection — it is about making the decision to accept a visual change explicit and deliberate rather than silent and accidental.

8 Self-Check — Can You Actually Do This?

Click each question to reveal the answer. Three from three means you're ready for the interview round.

Q1. What is a baseline snapshot and what happens on the first vs subsequent test runs?

A baseline snapshot is a known-good screenshot of a page or component, captured when the UI is in a correct state and committed to version control. On the first run, Playwright (or Percy) creates the baseline. On subsequent runs, the tool takes a new screenshot and compares it pixel-by-pixel against the baseline. Any difference above the configured threshold fails the test. To accept an intentional UI change, you update the baseline by running with --update-snapshots and committing the new files.

Q2. What are two strategies for preventing dynamic content from causing false visual test failures?

(1) Masking: use mask: [page.locator(...)] in Playwright to black out elements that change between runs (timestamps, avatars, ads) before taking the screenshot. (2) Stabilising: wait for networkidle, hide loading spinners, or use page.clock.setFixedTime() to freeze dates so dynamic content is consistent between runs. Percy also supports injecting CSS to hide elements via percyCSS.

Q3. Why should visual regression tests run inside a Docker container in CI?

Font rendering, sub-pixel anti-aliasing, and GPU compositing differ across operating systems. A baseline generated on macOS will produce pixel differences when compared to a screenshot taken on Ubuntu, even though the UI is identical. Running both baseline creation and CI comparisons inside the same pinned Docker image (e.g., mcr.microsoft.com/playwright:v1.x.x-focal) ensures the rendering environment is identical, eliminating OS-related false failures.

9 Interview Prep — What They'll Ask

Q1. "How do you prevent visual regression tests from becoming flaky?"

Three main levers: (1) mask dynamic content so changing data doesn't cause spurious diffs; (2) pin the test environment to a fixed OS and browser version via Docker so rendering is deterministic; (3) set a sensible pixel tolerance rather than 0% to absorb anti-aliasing variation. In practice I also wait for networkidle before snapshotting, disable CSS animations for the test run, and ensure the viewport is fixed. Flakiness in visual tests almost always comes from one of these — find the dynamic element and mask it.

Q2. "What would you do if every PR starts failing visual regression because designers make frequent intentional UI changes?"

This is a workflow problem, not a tooling problem. If designers are making frequent intentional changes, the baseline update process needs to be fast and low-friction. I'd switch from Playwright snapshots (where a developer has to run --update-snapshots and commit files) to Percy or Chromatic — tools where designers can approve visual diffs directly in the PR without needing to touch the code. I'd also scope visual tests to components most likely to regress accidentally (shared nav, layout containers) and exclude rapidly-iterating marketing pages. The goal is catching unintentional regressions, not blocking intentional design work.

10 Next Step

You now have the tools to catch both functional regressions and presentation-layer regressions. The next step is practice — apply both skills to realistic exercises in the Mid-Level practice section.

← BDD & Gherkin Practice Exercises → All Learning Modules

Visual Regression Testing

1 The Hook — Why This Matters

2 The Rule — The One-Sentence Version

3 The Analogy — Think Of It Like...

4 Watch Me Do It — Step by Step

5 Tool Comparison — Which Visual Testing Approach?

6 Common Mistakes — Don't Do This

7 Now You Try — Prompt Lab

8 Self-Check — Can You Actually Do This?

Related techniques

9 Interview Prep — What They'll Ask

10 Next Step