Flakiness Classification Framework

1 The Problem — Why Flakiness Is a Strategic Issue

A test suite with 5% flaky tests sounds manageable. Do the maths on a real CI pipeline and it stops sounding manageable immediately.

1,500

Wasted test runs per day
(1,000 tests × 30 CI runs × 5%)

547,500

Wasted test runs per year

40%

Of re-run "passes" may be masking a real bug

But the hidden cost is worse than the compute bill. When engineers learn that the pipeline can be unblocked by clicking Re-run, they stop investigating. The suite becomes a liar. Real regressions get lost in the noise. You don't find the bug until a customer does.

Flaky tests also compound trust decay. Once engineers mentally classify the suite as unreliable, they start ignoring failures. At that point, your entire automation investment produces negative value — it gives a false sense of coverage while masking active defects.

The solution is not to tolerate flakiness. It is to classify it, track it, and eliminate it by root cause. This lesson gives you the formal system to do that.

2 The Flakiness Classification Taxonomy

Every flaky test has a root cause that fits into one of six categories. Diagnosing the category determines the fix. Treating them the same wastes time — a timing fix will not solve a state pollution problem.

Category 1

Timing Dependency

The test passes or fails based on how long an operation takes. Usually caused by hard-coded waits or assertions that fire before async operations complete.

Detection signal

Failure message says "element not found," "timeout," or "element detached from DOM." Fails consistently on slower CI runners but never on a developer's local machine.

Fix strategy

Replace all sleep() / waitForTimeout() calls with condition-based waits: waitForSelector, waitForResponse, waitForLoadState('networkidle'). Assert on state, not on time.

Common tool

Playwright actionability checks auto-wait. Cypress cy.get() retries. Selenium: use WebDriverWait with ExpectedConditions, never Thread.sleep().

NZ example: An ACC portal test clicks Submit then immediately asserts the confirmation banner. Passes on the developer's fast SSD. Fails on GitHub Actions runners (slower start-up latency) because the banner has not yet rendered when the assertion fires.

Category 2

State Pollution

A test depends on state left behind by a previous test. If the order changes, or the previous test is skipped, this test fails.

Detection signal

Test passes when run in isolation (--grep / fit) but fails in the full suite. The failure order is inconsistent — it depends on which tests ran before it.

Fix strategy

Each test must own its full setup (beforeEach seeds its own data) and teardown (afterEach deletes what it created). Never share mutable objects between tests. Use test-scoped database transactions that roll back after each test.

Code smell

Shared global variables, beforeAll that creates records multiple tests write to, or test files that assume a particular execution order.

NZ example: Test B asserts the count of KiwiSaver members is 5. But Test A (which sometimes runs first) adds a member and does not clean up — so Test B sees 6 and fails.

Category 3

Infrastructure Race Condition

Parallel test workers compete for a shared resource: the same database row, the same test user account, or the same network port.

Detection signal

Tests pass with --workers=1 but fail with --workers=4 or higher. Failure messages mention unique constraint violations, deadlocks, or "address already in use."

Fix strategy

Partition test data per worker (prefix records with the worker ID). Assign each worker a unique test user account and its own database schema or test-database instance. Use random ephemeral ports for local test servers.

Architecture note

This category is often discovered only after scaling CI parallelism. Build for isolation from day one — retrofitting is expensive.

NZ example: IRD integration tests all share the same test tax account. Two parallel workers both trigger a refund on the account simultaneously — one wins, one gets a duplicate-transaction error.

Category 4

Network Flakiness

The test calls a real external API or service that is temporarily unavailable, rate-limited, or returning non-deterministic responses.

Detection signal

Failures include network errors, 429 Too Many Requests, 503 Service Unavailable, or connection timeout messages. Failure rate correlates with time of day or CI pipeline load.

Fix strategy

Mock all external APIs in unit and integration tests (MSW for JavaScript, WireMock for JVM). Reserve real external API calls for a dedicated integration stage that is explicitly expected to be fragile and is not a blocking gate on the main branch.

Governance rule

If a test that hits a real external API fails, it should alert the external-dependency team — not block a developer's feature branch merge.

NZ example: Tests that call the NZ Post address validation API directly. The API has a rate limit. During peak CI load at 9 am Monday, all 12 parallel workers hit the limit simultaneously and roughly a third of those tests fail.

Category 5

Environment Flakiness

The test depends on local environment state: timezone, locale, feature flags, environment variables, or runtime versions that differ between machines.

Detection signal

Test passes locally, fails in CI. Passes on one developer's machine, fails on another's. Failure messages include incorrect date formats, unexpected locale-specific strings, or missing environment variables.

Fix strategy

Pin all environment-specific values. Set TZ=Pacific/Auckland explicitly in your CI runner configuration. Lock runtime versions with .nvmrc, .tool-versions, or .python-version. Never read timezone or locale from the host OS implicitly.

NZ-specific note

NZ projects frequently fail date assertions because developers test in UTC+13 (NZDT) and CI runs in UTC. A date that is "today" locally becomes "yesterday" in UTC after midnight.

NZ example: An Inland Revenue annual returns test asserts that a filing deadline date renders as "31 March 2026." A developer in Auckland runs the test at 11 pm — the CI runner in UTC reads that as "30 March" and the assertion fails. Fix: set TZ=Pacific/Auckland in .github/workflows/ci.yml and freeze test dates to a constant fixture rather than new Date().

Category 6

Animation and Rendering Timing

The test clicks an element while a CSS animation is in progress, or takes a screenshot before a component has finished rendering.

Detection signal

Visual regression tests fail intermittently with pixel-level diffs. Click actions on animated elements throw "element is not stable" or "intercept pointer events" errors. Failures are non-deterministic even when run sequentially.

Fix strategy

Disable animations in test mode by injecting a CSS class or global style (*, *::before, *::after { animation-duration: 0s !important; transition-duration: 0s !important; }) when process.env.CI is set. Use Playwright actionability checks — the engine auto-waits for elements to be stable before acting.

Visual regression note

For visual snapshot tests, always call page.waitForLoadState('networkidle') and pause all animations before capturing. Many teams add a 100ms stabilisation delay — but this is a last resort, not a primary strategy.

NZ example: A public service portal uses a slide-in confirmation modal with a 300ms CSS transition. The E2E test clicks Confirm as soon as the modal appears — but mid-animation the element is outside the viewport clip area and Playwright throws "element is not visible." Fix: disable transitions in CI, or await page.waitForFunction(() => !document.querySelector('.modal').classList.contains('animating')).

3 Flakiness Tracking Template

Every new flaky test should be logged the moment it is identified. Review the register weekly in team standup. The act of logging makes flakiness visible — invisible flakiness never gets fixed.

Test ID / Name	Category	First Seen	Failure Rate	Root Cause (suspected)	Fix Applied	Fixed Date	Confirmed Fixed
checkout > displays confirmation banner	Timing Dependency	2026-06-01	12%	Hard-coded 500ms wait before banner renders	Replaced with `waitForSelector('.banner')`	2026-06-03	Yes (0% over 7 days)
members > count is 5	State Pollution	2026-06-05	8%	Test A leaves a member record unseeded	Added `afterEach` cleanup to Test A	—	No (in progress)
[your test here]	—	—	—	—	—	—	—

Quarantine and deletion policy

Logged: Add to register on first occurrence. Label the test in code with a tracking comment.
2 weeks without a fix: Quarantine (skip with a tracking comment). A failing test that is noisily failing is worse than a skipped test — the noise makes real failures invisible.
4 weeks quarantined without a fix: Delete the test. A test that is always skipped provides zero coverage and false assurance that it exists. Delete it, file a ticket to rewrite it properly, and move on.

Pro tip: Store the flakiness register as a CSV committed to your repo alongside your test configuration files. Version-control your flakiness history the same way you version-control your code.

4 The Flakiness Budget — Managing Flakiness as an SLO

Treat test flakiness as an operational metric — the same way SRE teams define error budgets for production services. A flakiness budget gives leadership a number they can act on, and gives engineering teams a target that is not "zero flakiness forever" (aspirational but impractical) but rather "within an agreed threshold."

Setting the SLO

Define flakiness as: percentage of test runs in a rolling week that produce at least one false failure
Recommended starting SLO: < 1% of test runs trigger a false failure per week
Calculate baseline before setting the target — you need two weeks of data first
Include the flakiness rate in your CI dashboard alongside test pass rate and build duration

When you exceed the budget

Freeze new feature test authoring until flakiness is back below the threshold — this creates immediate engineering incentive
Declare a "flakiness sprint": two or three engineers spend a full sprint on nothing but flakiness resolution
Report to leadership as a CI health metric — not as a test quality problem but as a velocity risk
Quarantine all tests above 5% individual failure rate immediately to stop bleeding signal

NZ government context: On NZ government project CI pipelines (often running on limited Azure DevOps agent pools shared across multiple teams), flaky tests are especially costly. A flaky test that triggers a re-run can add 10–30 minutes to queue time when the agent pool is saturated. Multiplied across a day, this burns agent-hour budget and delays deployments for every team sharing the pool. Present this to leadership in agent-hours, not test counts.

5 The Quarantine Pattern — Code Template

When a flaky test cannot be fixed immediately, quarantine it with a structured comment that provides a paper trail: who quarantined it, why, which category it belongs to, the tracking ticket, and a hard deletion deadline.

Playwright / Jest / Vitest — JavaScript

// FLAKY: Category=TIMING. Ticket: ENG-4521. Quarantined 2026-06-15.
// Suspected: waitForTimeout(500) before banner renders — needs waitForSelector.
// Owner: @jane.smith. Delete or fix by 2026-07-15.
test.skip("should display confirmation banner after submit", async ({ page }) => {
  await page.goto("/checkout");
  await page.click("#submit-btn");
  await page.waitForTimeout(500); // <-- the offender
  await expect(page.locator(".confirmation-banner")).toBeVisible();
});

Python / pytest

# FLAKY: Category=STATE_POLLUTION. Ticket: ENG-5103. Quarantined 2026-06-18.
# Suspected: previous test leaves member record in DB. Needs afterEach cleanup.
# Owner: @anika.patel. Delete or fix by 2026-07-18.
@pytest.mark.skip(reason="FLAKY ENG-5103 — quarantine until 2026-07-18")
def test_member_count_is_five(client, db):
    response = client.get("/api/members")
    assert response.json()["count"] == 5

Pro tip: Add a CI job that scans for quarantine comments whose deadline date has passed. If 2026-07-15 is in the past and the test is still skipped, the build should emit a warning in the log — a gentle automated nudge to either fix or delete.

6 When Re-Running Is the Wrong Answer

Most CI systems offer an automatic re-run-on-failure feature. GitHub Actions has retry-on-error. Azure DevOps has re-run failed tests. Playwright Test has --retries. These features are useful tools that have been widely misused.

When re-running is legitimate

Category 4 (Network Flakiness) when calling a real external API that has a known transient failure mode
Capped at 1 automatic retry maximum — not 3, not 5
Every retry is logged as a flakiness event regardless of the final outcome

When re-running is harmful

When engineers use it to make a build green and move on
When a retry obscures a real regression — a real bug that passes on retry 40% of the time will be missed indefinitely
When it is applied wholesale (all failed tests) rather than surgically (specific known-transient tests)

The policy recommendation for most teams:

Allow a maximum of one automatic retry per test. Never three, never five.
All retry invocations are flakiness events — even if the retry passes. Log them. Count them against your flakiness budget.
If a test retries and passes more than twice in a rolling month, it must be classified in the flakiness register and investigated. Passing on retry is not passing.
No engineer should manually click Re-run as a reflexive action without first checking the failure log for 30 seconds. This is a team culture standard, not just a tooling rule.

The 40% rule: Research in large CI pipelines consistently shows that re-running a failed test and getting a pass is not reliable evidence that the failure was a false positive. Roughly 40% of what engineers dismiss as "the flaky test acting up" is actually a real regression that re-runs until it passes. Organisations that enforce investigation-first policies catch significantly more real bugs earlier.

7 Self-Check — Can You Actually Apply This?

Click each question to reveal the answer. Aim for all five before moving on.

Q1. A test consistently passes when run with --grep "member count" but fails intermittently in the full suite. Which category is this most likely, and what is the first investigation step?

Category 2: State Pollution. The first step is to run the full suite twice consecutively while logging which test ran immediately before the failing test in each run. If a different preceding test correlates with the failure each time, that predecessor is leaving state behind. Audit its teardown logic.

Q2. Your team adds --retries=3 to the Playwright config to "fix" flaky tests before a big release. What is the principal risk of this decision?

Setting retries to 3 means a real regression can pass on the second or third attempt and be reported as a pass. Engineers lose visibility of real failures. The flakiness is hidden rather than fixed, and the flakiness budget is no longer meaningful. The correct approach is to classify each flaky test, quarantine it if needed, and fix the root cause — not mask it with retries.

Q3. A test fails with a "unique constraint violation" only when the pipeline runs with 8 parallel workers. Which category, and what is the architectural fix?

Category 3: Infrastructure Race Condition. Multiple workers are writing to the same shared resource (database row, user account, or sequence). Architectural fix: partition test data by worker ID (e.g., prefix all test records with the Playwright worker index), or give each worker its own isolated database schema that is created fresh and destroyed after the run.

Q4. You have a test that has been in the flakiness register for 5 weeks, quarantined for 4 weeks, and still has no assigned fix. What should happen to it?

Per the quarantine policy, it should be deleted. A test quarantined for 4 or more weeks without a fix provides zero test coverage (it is always skipped) but creates false assurance that coverage exists. Delete it, file a ticket to rewrite it with proper isolation, and prioritise the rewrite like any other coverage gap.

Q5. Your flakiness SLO is <1% of weekly test runs. You measure 3.2% this week. What two immediate actions should you take?

Immediate: (1) Quarantine all individual tests with a failure rate above 5% — this reduces noise and stops them masking real failures. (2) Freeze new feature test authoring until the rate is back below 1%, creating engineering incentive to resolve existing flakiness before adding new tests. Then plan a flakiness sprint and report the CI health metric to leadership as a velocity risk.