Flakiness Classification Framework
Give your team a shared vocabulary for why tests flake — then eliminate flakiness by root cause, not by re-running until green.
1 The Problem — Why Flakiness Is a Strategic Issue
A test suite with 5% flaky tests sounds manageable. Do the maths on a real CI pipeline and it stops sounding manageable immediately.
(1,000 tests × 30 CI runs × 5%)
But the hidden cost is worse than the compute bill. When engineers learn that the pipeline can be unblocked by clicking Re-run, they stop investigating. The suite becomes a liar. Real regressions get lost in the noise. You don't find the bug until a customer does.
Flaky tests also compound trust decay. Once engineers mentally classify the suite as unreliable, they start ignoring failures. At that point, your entire automation investment produces negative value — it gives a false sense of coverage while masking active defects.
The solution is not to tolerate flakiness. It is to classify it, track it, and eliminate it by root cause. This lesson gives you the formal system to do that.
2 The Flakiness Classification Taxonomy
Every flaky test has a root cause that fits into one of six categories. Diagnosing the category determines the fix. Treating them the same wastes time — a timing fix will not solve a state pollution problem.
Timing Dependency
The test passes or fails based on how long an operation takes. Usually caused by hard-coded waits or assertions that fire before async operations complete.
Failure message says "element not found," "timeout," or "element detached from DOM." Fails consistently on slower CI runners but never on a developer's local machine.
Replace all sleep() / waitForTimeout() calls with condition-based waits: waitForSelector, waitForResponse, waitForLoadState('networkidle'). Assert on state, not on time.
Playwright actionability checks auto-wait. Cypress cy.get() retries. Selenium: use WebDriverWait with ExpectedConditions, never Thread.sleep().
State Pollution
A test depends on state left behind by a previous test. If the order changes, or the previous test is skipped, this test fails.
Test passes when run in isolation (--grep / fit) but fails in the full suite. The failure order is inconsistent — it depends on which tests ran before it.
Each test must own its full setup (beforeEach seeds its own data) and teardown (afterEach deletes what it created). Never share mutable objects between tests. Use test-scoped database transactions that roll back after each test.
Shared global variables, beforeAll that creates records multiple tests write to, or test files that assume a particular execution order.
Infrastructure Race Condition
Parallel test workers compete for a shared resource: the same database row, the same test user account, or the same network port.
Tests pass with --workers=1 but fail with --workers=4 or higher. Failure messages mention unique constraint violations, deadlocks, or "address already in use."
Partition test data per worker (prefix records with the worker ID). Assign each worker a unique test user account and its own database schema or test-database instance. Use random ephemeral ports for local test servers.
This category is often discovered only after scaling CI parallelism. Build for isolation from day one — retrofitting is expensive.
Network Flakiness
The test calls a real external API or service that is temporarily unavailable, rate-limited, or returning non-deterministic responses.
Failures include network errors, 429 Too Many Requests, 503 Service Unavailable, or connection timeout messages. Failure rate correlates with time of day or CI pipeline load.
Mock all external APIs in unit and integration tests (MSW for JavaScript, WireMock for JVM). Reserve real external API calls for a dedicated integration stage that is explicitly expected to be fragile and is not a blocking gate on the main branch.
If a test that hits a real external API fails, it should alert the external-dependency team — not block a developer's feature branch merge.
Environment Flakiness
The test depends on local environment state: timezone, locale, feature flags, environment variables, or runtime versions that differ between machines.
Test passes locally, fails in CI. Passes on one developer's machine, fails on another's. Failure messages include incorrect date formats, unexpected locale-specific strings, or missing environment variables.
Pin all environment-specific values. Set TZ=Pacific/Auckland explicitly in your CI runner configuration. Lock runtime versions with .nvmrc, .tool-versions, or .python-version. Never read timezone or locale from the host OS implicitly.
NZ projects frequently fail date assertions because developers test in UTC+13 (NZDT) and CI runs in UTC. A date that is "today" locally becomes "yesterday" in UTC after midnight.
TZ=Pacific/Auckland in .github/workflows/ci.yml and freeze test dates to a constant fixture rather than new Date().Animation and Rendering Timing
The test clicks an element while a CSS animation is in progress, or takes a screenshot before a component has finished rendering.
Visual regression tests fail intermittently with pixel-level diffs. Click actions on animated elements throw "element is not stable" or "intercept pointer events" errors. Failures are non-deterministic even when run sequentially.
Disable animations in test mode by injecting a CSS class or global style (*, *::before, *::after { animation-duration: 0s !important; transition-duration: 0s !important; }) when process.env.CI is set. Use Playwright actionability checks — the engine auto-waits for elements to be stable before acting.
For visual snapshot tests, always call page.waitForLoadState('networkidle') and pause all animations before capturing. Many teams add a 100ms stabilisation delay — but this is a last resort, not a primary strategy.
await page.waitForFunction(() => !document.querySelector('.modal').classList.contains('animating')).3 Flakiness Tracking Template
Every new flaky test should be logged the moment it is identified. Review the register weekly in team standup. The act of logging makes flakiness visible — invisible flakiness never gets fixed.
| Test ID / Name | Category | First Seen | Failure Rate | Root Cause (suspected) | Fix Applied | Fixed Date | Confirmed Fixed |
|---|---|---|---|---|---|---|---|
| checkout > displays confirmation banner | Timing Dependency | 2026-06-01 | 12% | Hard-coded 500ms wait before banner renders | Replaced with waitForSelector('.banner') |
2026-06-03 | Yes (0% over 7 days) |
| members > count is 5 | State Pollution | 2026-06-05 | 8% | Test A leaves a member record unseeded | Added afterEach cleanup to Test A |
— | No (in progress) |
| [your test here] | — | — | — | — | — | — | — |
Quarantine and deletion policy
- Logged: Add to register on first occurrence. Label the test in code with a tracking comment.
- 2 weeks without a fix: Quarantine (skip with a tracking comment). A failing test that is noisily failing is worse than a skipped test — the noise makes real failures invisible.
- 4 weeks quarantined without a fix: Delete the test. A test that is always skipped provides zero coverage and false assurance that it exists. Delete it, file a ticket to rewrite it properly, and move on.
4 The Flakiness Budget — Managing Flakiness as an SLO
Treat test flakiness as an operational metric — the same way SRE teams define error budgets for production services. A flakiness budget gives leadership a number they can act on, and gives engineering teams a target that is not "zero flakiness forever" (aspirational but impractical) but rather "within an agreed threshold."
Setting the SLO
- Define flakiness as: percentage of test runs in a rolling week that produce at least one false failure
- Recommended starting SLO: < 1% of test runs trigger a false failure per week
- Calculate baseline before setting the target — you need two weeks of data first
- Include the flakiness rate in your CI dashboard alongside test pass rate and build duration
When you exceed the budget
- Freeze new feature test authoring until flakiness is back below the threshold — this creates immediate engineering incentive
- Declare a "flakiness sprint": two or three engineers spend a full sprint on nothing but flakiness resolution
- Report to leadership as a CI health metric — not as a test quality problem but as a velocity risk
- Quarantine all tests above 5% individual failure rate immediately to stop bleeding signal
NZ government context: On NZ government project CI pipelines (often running on limited Azure DevOps agent pools shared across multiple teams), flaky tests are especially costly. A flaky test that triggers a re-run can add 10–30 minutes to queue time when the agent pool is saturated. Multiplied across a day, this burns agent-hour budget and delays deployments for every team sharing the pool. Present this to leadership in agent-hours, not test counts.
5 The Quarantine Pattern — Code Template
When a flaky test cannot be fixed immediately, quarantine it with a structured comment that provides a paper trail: who quarantined it, why, which category it belongs to, the tracking ticket, and a hard deletion deadline.
Playwright / Jest / Vitest — JavaScript
// FLAKY: Category=TIMING. Ticket: ENG-4521. Quarantined 2026-06-15.
// Suspected: waitForTimeout(500) before banner renders — needs waitForSelector.
// Owner: @jane.smith. Delete or fix by 2026-07-15.
test.skip("should display confirmation banner after submit", async ({ page }) => {
await page.goto("/checkout");
await page.click("#submit-btn");
await page.waitForTimeout(500); // <-- the offender
await expect(page.locator(".confirmation-banner")).toBeVisible();
});
Python / pytest
# FLAKY: Category=STATE_POLLUTION. Ticket: ENG-5103. Quarantined 2026-06-18.
# Suspected: previous test leaves member record in DB. Needs afterEach cleanup.
# Owner: @anika.patel. Delete or fix by 2026-07-18.
@pytest.mark.skip(reason="FLAKY ENG-5103 — quarantine until 2026-07-18")
def test_member_count_is_five(client, db):
response = client.get("/api/members")
assert response.json()["count"] == 5
2026-07-15 is in the past and the test is still skipped, the build should emit a warning in the log — a gentle automated nudge to either fix or delete.6 When Re-Running Is the Wrong Answer
Most CI systems offer an automatic re-run-on-failure feature. GitHub Actions has retry-on-error. Azure DevOps has re-run failed tests. Playwright Test has --retries. These features are useful tools that have been widely misused.
When re-running is legitimate
- Category 4 (Network Flakiness) when calling a real external API that has a known transient failure mode
- Capped at 1 automatic retry maximum — not 3, not 5
- Every retry is logged as a flakiness event regardless of the final outcome
When re-running is harmful
- When engineers use it to make a build green and move on
- When a retry obscures a real regression — a real bug that passes on retry 40% of the time will be missed indefinitely
- When it is applied wholesale (all failed tests) rather than surgically (specific known-transient tests)
The policy recommendation for most teams:
- Allow a maximum of one automatic retry per test. Never three, never five.
- All retry invocations are flakiness events — even if the retry passes. Log them. Count them against your flakiness budget.
- If a test retries and passes more than twice in a rolling month, it must be classified in the flakiness register and investigated. Passing on retry is not passing.
- No engineer should manually click Re-run as a reflexive action without first checking the failure log for 30 seconds. This is a team culture standard, not just a tooling rule.
The 40% rule: Research in large CI pipelines consistently shows that re-running a failed test and getting a pass is not reliable evidence that the failure was a false positive. Roughly 40% of what engineers dismiss as "the flaky test acting up" is actually a real regression that re-runs until it passes. Organisations that enforce investigation-first policies catch significantly more real bugs earlier.
7 Self-Check — Can You Actually Apply This?
Click each question to reveal the answer. Aim for all five before moving on.
Q1. A test consistently passes when run with --grep "member count" but fails intermittently in the full suite. Which category is this most likely, and what is the first investigation step?
Category 2: State Pollution. The first step is to run the full suite twice consecutively while logging which test ran immediately before the failing test in each run. If a different preceding test correlates with the failure each time, that predecessor is leaving state behind. Audit its teardown logic.
Q2. Your team adds --retries=3 to the Playwright config to "fix" flaky tests before a big release. What is the principal risk of this decision?
Setting retries to 3 means a real regression can pass on the second or third attempt and be reported as a pass. Engineers lose visibility of real failures. The flakiness is hidden rather than fixed, and the flakiness budget is no longer meaningful. The correct approach is to classify each flaky test, quarantine it if needed, and fix the root cause — not mask it with retries.
Q3. A test fails with a "unique constraint violation" only when the pipeline runs with 8 parallel workers. Which category, and what is the architectural fix?
Category 3: Infrastructure Race Condition. Multiple workers are writing to the same shared resource (database row, user account, or sequence). Architectural fix: partition test data by worker ID (e.g., prefix all test records with the Playwright worker index), or give each worker its own isolated database schema that is created fresh and destroyed after the run.
Q4. You have a test that has been in the flakiness register for 5 weeks, quarantined for 4 weeks, and still has no assigned fix. What should happen to it?
Per the quarantine policy, it should be deleted. A test quarantined for 4 or more weeks without a fix provides zero test coverage (it is always skipped) but creates false assurance that coverage exists. Delete it, file a ticket to rewrite it with proper isolation, and prioritise the rewrite like any other coverage gap.
Q5. Your flakiness SLO is <1% of weekly test runs. You measure 3.2% this week. What two immediate actions should you take?
Immediate: (1) Quarantine all individual tests with a failure rate above 5% — this reduces noise and stops them masking real failures. (2) Freeze new feature test authoring until the rate is back below 1%, creating engineering incentive to resolve existing flakiness before adding new tests. Then plan a flakiness sprint and report the CI health metric to leadership as a velocity risk.