Mid-Level Automation · Reliability

Async Flakiness & Determinism

A flaky test is a test that lies — green when the code is broken, red when it works. It is the single fastest way to destroy a team’s trust in automation. Almost every flaky test is the same bug: the test checked before the app was ready.

Mid-Level Reliability · Async & Timing ~25 min read · ~60 min with exercises

1 The Hook

A team at a fictional NZ logistics company, Kahu Freight, had a Playwright suite of 300 tests. It was green on every developer’s machine. But in CI, roughly one run in five went red — and never the same test twice. One day it was the checkout test, the next the login test, the next a dashboard assertion. Re-running the pipeline usually went green.

So the team did the natural thing: they turned on retries. retries: 2 in the config. Now a test that failed got two more shots, and the pipeline went green almost every time. Problem solved — or so it looked. Velocity was fine, the board was green, everyone moved on.

Three months later a real bug shipped to production: under load, the checkout page occasionally rendered the “Pay” button before the cart total finished loading, and customers could submit a payment for $0.00. When the team went back through their history, they found something uncomfortable: the checkout test had been failing intermittently in CI for months — it had actually been catching this exact race condition — but the retries swallowed the failure every time. The flaky test wasn’t noise. It was a real defect waving a flag, and the team had taped over the flag.

That is the whole lesson in one story. Flakiness is not a nuisance to be silenced with retries — it is a signal that your test and your app disagree about when something is ready. The job is not to make the red go away. The job is to make the test deterministic: same input, same result, every single time. This lesson teaches you why tests go flaky and how to build waits that don’t.

2 The Rule

A test must be deterministic: the same code under test must produce the same result every run. A test that passes and fails without the code changing isn’t giving you information — it’s giving you noise, and noise that you learn to ignore is worse than no test at all. Almost all flakiness comes from one root: the test asserts on a moment in time instead of waiting for a condition to be true.

3 The Analogy

Analogy

A smoke alarm that cries wolf.

A smoke alarm that goes off every time you make toast quickly becomes an alarm nobody trusts. Within a week someone takes the battery out “just while I’m cooking.” The alarm hasn’t become useless because it’s broken — it’s become useless because it’s unreliable, and an unreliable alarm trains people to ignore it. Then the night there’s a real fire, the battery is still on the bench.

A flaky test is that alarm. Every false red trains the team to shrug and hit “re-run.” And a team that reflexively re-runs failures has quietly taken the batteries out of their whole suite — so when a test finally catches a real fire, like Kahu Freight’s $0.00 checkout, nobody’s listening. Determinism is what keeps the alarm worth trusting.

4 Why Tests Go Flaky

Flakiness feels mysterious because it’s intermittent, but it comes from a short list of causes. Almost all of them are a mismatch between the test’s timing and the app’s timing. Learn the categories and you can name any flake on sight:

  • Timing / race conditions (the big one): the test acts or asserts before the app has finished an async operation — a fetch, a render, an animation. Fast machines hide it; a loaded CI runner exposes it. This is the majority of all flakiness.
  • Hard-coded waits: sleep(2000) is a bet that the app will always be ready in two seconds. On a slow run it isn’t, and the test fails; on a fast run you waste two seconds. Either way it’s wrong.
  • Shared state / test order dependence: test B only passes if test A ran first and left data behind. Run them in a different order, or in parallel, and B fails. Each test must set up and tear down its own world.
  • Non-deterministic test data: a test that depends on “today’s date,” a random value, or a record that another test might also touch. The data shifts under the test.
  • Animations and transitions: the element is in the DOM but still sliding into place, so a click lands on the wrong spot or is intercepted.
  • Network variance: a real third-party call that’s usually fast but occasionally slow, or returns slightly different data.
Pro tip: When a test is flaky, the first diagnostic question is never “which line failed” — it’s “what was the test waiting for, and was that thing actually ready?” Nine times in ten the answer is that the test assumed something was ready when it wasn’t.

5 The Hard-Wait Anti-Pattern

The instinct of every new automator hitting a flaky test is to add a sleep. It even works — for a while. This is the single most damaging habit in UI automation, so look hard at why it’s wrong.

// ❌ The anti-pattern: a fixed sleep await page.click('#checkout'); await page.waitForTimeout(2000); // "wait for the total to load" await expect(page.locator('#total')).toHaveText('$42.00');

This code says: wait exactly two seconds, then check. It is wrong in both directions at once. If the total loads in 2.5 seconds on a slow CI run, the test fails even though the app is perfectly fine — a false red. If it loads in 100ms, you’ve thrown away 1.9 seconds for nothing — multiply that across 300 tests and your suite crawls. A fixed wait is a guess about timing, and the app’s timing is exactly the thing you can’t guarantee.

The fix is a web-first assertion: an assertion that automatically retries until the condition is true or a timeout is hit. It waits for the state you care about, not for a number of milliseconds.

// ✅ Web-first: wait for the condition, not the clock await page.click('#checkout'); await expect(page.locator('#total')).toHaveText('$42.00'); // auto-retries until true

This version passes as soon as the total is correct — whether that takes 100ms or 2.5s — and only fails if it never becomes correct within the timeout. It is faster on quick runs and robust on slow ones. The lesson: never wait for time; always wait for a condition. Playwright’s expect, Cypress’s built-in retry-ability, and Selenium’s explicit WebDriverWait with expected conditions all exist precisely so you never have to sleep.

6 Deterministic Wait Architecture

Killing sleeps is the start. A genuinely stable suite is designed to be deterministic. Five principles:

  • Assert on conditions, never on time. Use web-first / auto-retrying assertions everywhere. If you ever type waitForTimeout or sleep, treat it as a bug to be removed.
  • Wait for the right signal. Wait for the element to be actionable (visible, stable, enabled) — or for the network response, or for a loading spinner to disappear — not just for it to exist in the DOM. “Present” and “ready” are different states.
  • Isolate state. Every test creates the data it needs and cleans up after itself. No test depends on another test’s leftovers or on running in a particular order. This is what makes parallel execution safe.
  • Control the non-deterministic inputs. Freeze the clock for date-dependent tests, seed any randomness, and pin test data. Don’t let “today’ or Math.random() decide whether your test passes.
  • Mock the unstable edges. For UI tests, stub flaky third-party calls so the test exercises your code, not the reliability of someone else’s API. Test the real integration separately, deliberately, in its own suite.

Put together, these turn the “pass 50 times in a row” challenge from luck into design. A test that waits for conditions, owns its data, and controls its inputs has no reason to vary — so it doesn’t.

Pro tip: The honest test of a stable test is to run it 50–100 times in a loop locally (most runners have a --repeat-each flag). If it’s green every time, it’s deterministic. If it fails once in 40, you have a latent race — find it now, because CI will find it for you at the worst possible moment.

7 The Retry Trap

Test runners let you auto-retry failed tests, and used wrongly that feature is how Kahu Freight shipped a $0.00 checkout. So be precise about what retries are for.

A retry that turns red into green hides information. If a test only passes on the second attempt, something is non-deterministic — and that something might be a real bug in the app, not in the test. Blanket retries paper over exactly the signal you built the test to catch.

Retries have a legitimate, narrow use: as a safety net and a measurement tool, not a fix. The healthy pattern is:

  • Track flakiness, don’t silence it. Configure the runner to record when a test passed only on retry, and treat that count as a defect backlog — not as “green, move on.”
  • Quarantine, then fix. Move a known-flaky test out of the blocking suite so it stops eroding trust, but with a ticket to fix it — not to forget it.
  • Reserve retries for genuinely external instability you don’t control (a shared environment, a third party), and even then, log it.

The line to hold: a retry should buy you visibility while you fix the cause, never permission to ignore it. The moment retries become the fix, your green pipeline is lying to you.

8 Common Mistakes

🚫 Adding a sleep to “fix” a flaky test

Why it happens: It works on your machine right now, so it feels solved.
The fix: A fixed sleep is a guess about timing that will be wrong on a slower or faster machine. Replace it with a web-first assertion that waits for the actual condition — the element being actionable, the text being correct, the spinner being gone.

🚫 Turning on blanket retries and calling it stable

Why it happens: The pipeline goes green and velocity looks fine.
The fix: Retries hide non-determinism, which is sometimes a real app bug. Track retry-passes as a defect signal and fix the cause; reserve retries for instability you genuinely don’t control.

🚫 Waiting for an element to exist, not to be ready

Why it happens: The element is in the DOM, so it seems safe to click.
The fix: “Present” is not “actionable.” An element can be in the DOM but hidden, disabled, or still animating into place. Wait for it to be visible, stable, and enabled — which good auto-retrying assertions do for you.

🚫 Tests that depend on order or shared data

Why it happens: Reusing data set up by an earlier test is less typing.
The fix: The moment you run in parallel or change order, those tests collapse. Each test must create and tear down its own data so it passes alone, in any order, every time.

🚫 Letting the clock or randomness decide the result

Why it happens: Using the real date or a random value seems harmlessly realistic.
The fix: “Today” rolls over midnight, hits weekends, and crosses daylight-saving boundaries; random values occasionally hit an edge. Freeze the clock and seed randomness so the input is fixed — otherwise the test is non-deterministic by construction.

9 Now You Try

Three graded exercises. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Rewrite the Flaky Test

The Playwright test below is flaky. Identify every source of flakiness, then rewrite it to be deterministic. You don’t need perfect syntax — show the technique (web-first assertions, no sleeps, isolated data, controlled inputs).

test('checkout shows total', async ({ page }) => { await page.goto('/cart'); await page.waitForTimeout(3000); await page.click('#checkout'); await page.waitForTimeout(2000); // assumes the test 'add item to cart' ran first and left an item const total = await page.locator('#total').textContent(); expect(total).toBe('$42.00'); // order created with today's date expect(page.locator('#order-date')).toHaveText(new Date().toLocaleDateString()); });

List the flakiness sources and write your deterministic rewrite:

Show model answer
Flakiness sources:
1. Two fixed sleeps (waitForTimeout 3000 and 2000) — guesses about timing that break on slow/fast runs.
2. textContent() + expect(total).toBe(...) — a one-shot read with no auto-retry; it captures whatever is on screen at that instant, before the total may have loaded.
3. Order dependence / shared state — relies on a separate 'add item to cart' test having run first and left an item. Fails alone or in parallel.
4. Non-deterministic data — asserts the order date equals today's real date, which changes every day and across timezones/DST.

Deterministic rewrite:
test('checkout shows total', async ({ page }) => {
  // own your data: seed a cart with a known item via API/fixture, don't depend on another test
  await seedCartWithItem(page, { sku: 'ABC', price: 4200 });
  // freeze the clock so the date is fixed
  await page.clock.setFixedTime(new Date('2026-06-05T10:00:00+12:00'));

  await page.goto('/cart');
  await page.getByRole('button', { name: 'Checkout' }).click();   // auto-waits for actionable

  await expect(page.locator('#total')).toHaveText('$42.00');       // web-first, auto-retries
  await expect(page.locator('#order-date')).toHaveText('05/06/2026'); // fixed, not new Date()
});

Key moves: no sleeps; web-first assertions that wait for the condition; the test seeds its own data; the clock is frozen so the date is deterministic. Full marks identify the sleeps, the one-shot read, the order dependence, AND the date — and replace all four.
🔧 Exercise 2 of 3 — Diagnose the Flake

A test fails roughly 1 run in 8, only in CI, never locally, and never on the same assertion twice. The team has added retries: 2 and the pipeline is green again. For each part below, give your answer with reasoning.

Diagnose it:

Show model answer
(a) Almost certainly a timing/race condition. CI runners are slower and more loaded than a dev laptop, so async operations (fetch, render, animation) take longer and the test asserts before the app is ready. "Different assertion each time" fits a general readiness/race problem rather than one specific broken locator. Shared-state-under-parallelism is the second candidate if CI runs tests in parallel and local doesn't.

(b) Because retries hide non-determinism, and non-determinism is sometimes a real app bug (the Kahu Freight $0.00 checkout). Going green on retry tells you nothing about WHY it failed — you've silenced a signal, not fixed a cause. The pipeline is now lying about stability.

(c) Reproduce the load locally: run the test with --repeat-each 50 (and under CPU throttling / parallel workers) to make it fail on demand. Then read the trace from a failed run to see exactly what the test was waiting for and what state the app was in. Replace any sleeps with web-first assertions; wait for the real signal (network response / spinner gone / actionable element). Check whether the app itself has a race (e.g. button enabled before data loads) — that's a product bug to log, not a test fix. Keep retries only as a tracked flakiness metric while you fix it.

Marking: full marks name timing/race as the likely category WITH the CI-load reasoning, explain that retries hide a possibly-real bug, and give a concrete reproduce-then-trace plan rather than "add waits".
🏗️ Exercise 3 of 3 — Pass It 50 Times

You must write one UI test that passes 50 times in a row against this deliberately hostile page: a product list that lazy-loads items as you scroll; an “Add to cart” button that slides in with a 600ms animation; and a backend that responds in a random 200ms–4000ms. Describe your wait strategy — what you wait for at each step and what you deliberately avoid — so the test is deterministic despite all three hazards.

Show model answer
Lazy-loading list: don't assume the item is present — scroll it into view (or use a locator that auto-scrolls, like Playwright's getByRole which scrolls before acting) and wait for the specific item locator to be visible. Wait for the condition "this item exists and is visible", not a scroll-then-sleep.

Animated button: wait for it to be actionable, not merely present. Auto-retrying click/assertions wait for the element to be visible AND stable (not moving) AND enabled before acting, which rides out the 600ms slide. Avoid clicking immediately after it appears in the DOM.

Variable 200–4000ms backend: use web-first assertions with a timeout comfortably above the worst case (e.g. wait up to 10s for the result), so a slow 4000ms response still passes. Better: wait for the actual network response or for the loading spinner to disappear, so you key off the real signal rather than any fixed number. Don't sleep 4000ms — that's slow on fast runs and still risky.

What I avoid: every waitForTimeout/sleep; one-shot reads (textContent then assert); assuming DOM-present == ready; depending on scroll position or another test's state.

Prove determinism: run it with --repeat-each 50 (ideally under throttling and in parallel). Green 50/50 = deterministic; one failure = a latent race to fix before it ships.

Marking: full marks give a CONDITION-based wait for each of the three hazards (scroll+visible, actionable/stable, response/spinner with a generous timeout), explicitly reject sleeps, and include the repeat-each proof. Any "add a 5-second sleep" answer misses the point.

10 Self-Check

Click each question to reveal the answer.

Q1: In one sentence, what is the root cause of most test flakiness?

The test asserts on a moment in time instead of waiting for a condition to be true — so it checks before the app is actually ready, and whether that’s fast enough varies run to run.

Q2: Why is waitForTimeout(2000) wrong in both directions?

If the app is slower than 2s on a given run, the test fails even though nothing is broken (false red). If it’s faster, you waste the remaining time on every run (slow suite). A fixed wait is a guess about timing you can’t guarantee; wait for the condition instead.

Q3: What is a web-first assertion and why does it kill most flakiness?

An assertion that automatically retries until the condition is true or a timeout is reached — e.g. expect(locator).toHaveText('$42.00'). It waits for the state you care about rather than a number of milliseconds, so it’s both faster on quick runs and robust on slow ones.

Q4: When are auto-retries acceptable, and when are they a trap?

Acceptable as a tracked safety net for instability you genuinely don’t control, while you fix the cause. A trap when they turn red into green and you move on — that hides non-determinism, which is sometimes a real app bug (the $0.00 checkout). Retries should buy visibility, never permission to ignore.

Q5: How do you prove a test is actually deterministic?

Run it many times in a row (e.g. --repeat-each 50), ideally under load or in parallel. Green every time means deterministic; a single failure in 40 reveals a latent race — fix it now rather than letting CI surface it later.

11 Interview Prep

Real questions asked in NZ automation interviews. Read the model answers, then practise your own version.

“One of our tests fails maybe one run in ten, only in CI. How would you approach it?”

I’d treat it as a timing/race problem first, because CI runners are slower and more loaded than a laptop, so async work takes longer and the test asserts before the app is ready. I’d reproduce it locally with a repeat-each loop, ideally under throttling, until it fails on demand, then read the trace to see exactly what the test was waiting for. The fix is usually replacing a sleep or a one-shot read with a web-first assertion that waits for the real condition. And I’d check whether the app itself has a race — sometimes a flaky test is catching a genuine bug, so I wouldn’t just paper over it with retries.

“Why not just set retries: 2 and move on?”

Because retries hide information. A test that only passes on the second go is non-deterministic, and that non-determinism is sometimes a real product bug, not a test bug — I’ve seen a flaky checkout test that was actually catching a $0-payment race, and blanket retries swallowed it for months. I do use retries, but as a flakiness metric and a temporary quarantine net while I fix the cause — never as the fix itself. A green pipeline built on retries is a pipeline that’s lying about stability.

“What makes a test suite deterministic by design?”

Five things: assert on conditions not time (web-first assertions, zero sleeps); wait for elements to be actionable, not just present; every test owns and cleans up its own data so order and parallelism don’t matter; control non-deterministic inputs by freezing the clock and seeding randomness; and mock unstable third-party edges so the UI test exercises my code, not someone else’s uptime. A test built that way has nothing left to vary, so it doesn’t.