Mid-Level · Automation Engineer

Flaky Test Management

A flaky test is one that fails sometimes and passes other times without any code change. Left unmanaged, flaky tests destroy team confidence in the test suite — people start ignoring red builds. This module teaches you to detect, quarantine, root-cause, and fix flaky tests systematically.

Mid-Level Senior SDET ~20 min read · ~45 min with exercises

1 The Hook

A Wellington SaaS team has 400 Playwright tests. CI is green 60% of the time. Developers have learned to re-run failing builds until they go green. Nobody has time to investigate — it’s just how it is.

A new SDET joins and pulls the retry data from the last two weeks. She finds 23 tests with a retry rate above 40%. She tags them all as quarantined, moves them to a separate suite, and runs the main pipeline without them. Suddenly CI is green 97% of the time.

The 23 quarantined tests get fixed one by one over the next sprint. The team stops re-running builds. They start trusting red as red again. The process that got them there: detect, quarantine, root-cause, fix. In that order. Never skip quarantine.

2 The Rule

A flaky test is not a test. It is a random number generator wearing a test’s clothing. Quarantine it immediately. A failing test tells you something. A flaky test teaches you to ignore failures — and that is more dangerous than having no tests at all.

3 The Analogy

Analogy

A car alarm that goes off randomly.

At first people check — it might be a break-in. After the tenth false alarm, nobody looks. The next time it goes off because someone actually broke in, everyone ignores it. The alarm hasn’t become useless because it’s broken. It became useless because it’s unreliable, and an unreliable alarm trains people to ignore it.

Flaky tests train developers to ignore failures. That is more dangerous than having no tests. Quarantine the false alarm. Fix it. Keep the alarm worth trusting.

Senior engineer insight

The moment I stopped treating flaky tests as a test problem and started treating them as a production reliability signal, everything changed. A test that fails under parallel load is usually revealing a genuine race condition in the application — not just a slow CI runner. On one Harbour Bank banking integration suite, three "flaky" Playwright tests turned out to be correctly detecting a session-token collision bug that only surfaced under concurrent login load. The tests weren't broken; they were the only thing catching a real defect.

The most common mistake: enabling retries: 3 in CI config and calling the problem solved — now you've hidden the noise and the signal at the same time.

From the field

A TransitNZ (TransitNZ) portal team had a Playwright suite that sat at around 70% CI reliability for six months. The assumption was that their Jenkins runners were just underpowered — a hardware problem, not a test problem. A new QA lead pulled the raw retry logs and built a simple spreadsheet: test name, retry rate, last failure message. Immediately obvious: eight tests were responsible for 80% of all failures, and every one of them was hitting a shared test user account that the suite created once in a beforeAll hook. When two test workers ran concurrently, they'd race to modify that account's profile data and one would always find it in the wrong state. Switching to per-test unique user creation via the API fixed all eight. The lesson that generalises: don't blame the infrastructure until you've ruled out shared state — it's almost always shared state.

4 Watch Me Do It

The four-step flaky test process. Run them in order every time.

Step 1: Detect — Add retry tracking to your Playwright config and surface flaky tests as a separate metric:

// playwright.config.ts reporter: [ ['html'], ['json', { outputFile: 'test-results/results.json' }], ], retries: process.env.CI ? 2 : 0,

After each CI run, identify tests that passed only after a retry — those are your flaky candidates:

// Count retried-but-passed tests from the JSON report npx playwright test --reporter=json | jq \ '.suites[].specs[] | select(.tests[].results | length > 1) | .title'

Set a threshold: any test with a retry rate above 20% over two weeks goes on the quarantine list.

Step 2: Quarantine — Skip the test in CI immediately. Do not investigate first — quarantine first, so the rest of the suite stays trustworthy:

// Tag it clearly so it does not get forgotten test.skip(process.env.CI === 'true', 'Quarantined — flaky, under investigation BUG-4521'); test('login redirects correctly', async ({ page }) => { ... });

Raise a ticket for every quarantined test. The ticket is the commitment to fix it. No ticket means it dies quietly.

Step 3: Root cause — The five most common causes of Playwright flakiness:

Race conditions — missing await, relying on timing rather than state
Shared test data — parallel tests fighting over the same database record
Network variance — real external API calls without mocking
Element timing — animation not complete when the interaction fires
Test order dependency — test B relies on state left by test A

Step 4: Fix patterns — The most common fix is replacing timing with state:

// ❌ Relying on time await page.waitForTimeout(2000); await page.click('#submit');

// ✅ Wait for state — the button must be enabled before clicking await expect(page.getByRole('button', { name: 'Submit' })).toBeEnabled(); await page.click('#submit');

For shared test data, use unique identifiers so parallel tests cannot collide:

// ✅ Unique email per test run — no shared state const uniqueEmail = `test+${Date.now()}@resync-test.nz`;

Pro tip: Run a suspect test with --repeat-each 50 locally after your fix. If it passes 50 times in a row, it’s fixed. One failure in 50 means the root cause is still there.

5 When to Use It

Trigger flaky test management when any of these are true:

CI pass rate drops below 95%
Developers routinely re-run failing builds without investigating
Any single test has a retry rate above 20% over two weeks
The team says “oh, just re-run it” about a specific test

Do not wait until it is a crisis. One accepted flaky test normalises the behaviour and becomes 20. The time to quarantine is the moment you detect the first retry pattern, not after the suite has lost the team’s trust.

6 Common Mistakes

🚫 Thinking retries fix flaky tests

What I used to think: Setting retries: 2 in Playwright config makes the flaky test green, so the problem is solved.
Actually: Retries hide flaky tests. They make CI green but leave the root cause in place. Each retry is a debt notice, not a payment. The test is still failing — you just are not seeing it. Worse, some flaky tests are detecting real application bugs, and retries swallow that signal entirely.

🚫 Investigating flaky tests one at a time as they appear

What I used to think: When a test goes flaky, stop and investigate it immediately.
Actually: Triage in batches. Quarantine all flaky tests weekly, then root-cause in priority order. One-by-one investigation as they appear is whack-a-mole — you spend all your time on the noisiest test, not the most impactful one. Weekly batch triage keeps the suite clean and lets you prioritise by retry rate.

🚫 Assuming the test is broken when it flakes

What I used to think: If a test fails sometimes, the test has a bug.
Actually: More often than not, the application has a timing issue, a shared-state bug, or a race condition that the flaky test is correctly detecting — just inconsistently. Investigate the application behaviour, not just the test code. You may find a real defect hiding behind what looks like a test reliability problem.

7 Now You Try

Write your answer, run it for AI feedback, then check the model answer.

🧪 Exercise — Fix the Flaky Insurance Test

A Playwright test for a NZ insurance portal checkout flow fails 35% of the time in CI with: TimeoutError: page.click: element is not stable. The test clicks a ‘Submit Claim’ button after filling a form. The button has a loading animation when the form validates. Write the fixed test code and explain the root cause.

Show model answer

Root cause:
The button has a loading animation during form validation. Playwright tries to click it while the animation is running — the element is in the DOM and visible but not yet "stable" (it is still moving). This is an element timing / animation race condition. The test does not wait for the button to reach a stable, enabled state before interacting.

Fixed test code:
// Wait for the button to be enabled (animation complete, validation passed)
// Playwright's toBeEnabled() auto-retries until the condition is true
await expect(page.getByRole('button', { name: 'Submit Claim' })).toBeEnabled();
await page.getByRole('button', { name: 'Submit Claim' }).click();

// Alternative: let Playwright's actionability checks handle it
// page.click() already waits for visible + stable + enabled by default,
// but if the element is inside a CSS animation container that confuses the
// stability check, wrap the click in a waitFor condition on a loading indicator:
await expect(page.getByTestId('form-loading-spinner')).not.toBeVisible();
await page.getByRole('button', { name: 'Submit Claim' }).click();

Key point: Never use waitForTimeout here. The fix is to wait for a specific state (button enabled, spinner gone), not a fixed time. This makes the test pass on fast runs AND slow CI runners.

Why teams fail here

No quarantine discipline: Teams investigate flaky tests while leaving them in the main pipeline — so CI stays unreliable during the investigation and developers keep learning to ignore red.
Retries as the fix: Setting retries: 2 or retries: 3 masks the flake rate in dashboards, so leadership thinks the suite is healthy while the root causes compound beneath the surface.
No flakiness tracking over time: Teams notice a flaky test when it blocks a deploy, fix it reactively, but never build the data pipeline (retry rates, trend by test name) that would catch the next ten before they become critical.
Quarantine without tickets: The test gets tagged skip and then quietly lives there for months — no owner, no deadline, no resolution. Quarantine without a linked Jira/Azure DevOps ticket is just permanent deletion with extra steps.

Key takeaway

A test suite that team members have learned to distrust is more dangerous than no test suite at all — quarantine ruthlessly, fix systematically, and never let a retry config become a substitute for root-cause work.

8 Self-Check

Click each question to reveal the answer.

Q1: What is test quarantine and why is it the right first step for a flaky test?

Quarantine means tagging the test to skip in the main CI pipeline while it is under investigation. It is the right first step because it immediately restores the reliability signal of the rest of the suite. A quarantined test does not disappear — it lives in a separate job with a ticket to fix it. The alternative (leaving it in the main suite) trains the whole team to ignore red.

Q2: A test fails only when run in parallel with other tests but passes in isolation. What is the most likely root cause?

Shared test data or shared state. When tests run in parallel they compete for the same database records, users, or configuration. One test modifies data that another test expects to be in its original state. The fix is test data isolation: each test must create its own unique data and clean up after itself.

Q3: Your CI retry rate is 18% across 200 tests. Is this acceptable?

No. Any test with a retry rate above 20% individually needs quarantine, and a suite-wide retry rate of 18% means roughly 36 tests are regularly flaking. This is well past the point of “acceptable noise” — it is a systemic reliability problem. Run the four-step process: detect which tests are the worst offenders, quarantine them, root-cause, fix. Aim to get CI pass rate (without retries) above 97%.

9 ISTQB Mapping

CTAL-TAE Section 6.2 — Flaky tests: detection, classification, and management strategies in automated test suites.

CTAL-TA v3.1.2 Section 5.3 — Test execution: retry strategies, their legitimate uses, and the risks of using retries as a substitute for root-cause investigation.

10 Next Steps

Apply the detect-quarantine-root-cause-fix loop to your own suite, then go deeper on related topics:

Async Flakiness & Determinism Debugging Failures Test Reporting ← All Mid-Level Learning

Flaky Test Management

1 The Hook

2 The Rule

3 The Analogy

4 Watch Me Do It

5 When to Use It

6 Common Mistakes

7 Now You Try

8 Self-Check

Related techniques

9 ISTQB Mapping

10 Next Steps