Network & Resilience · Lesson 2

Flaky API Resilience Testing

Q: What are the three states of a circuit breaker, and what do you test in each?

Closed — normal traffic flows; test that it opens after a failure threshold. Open — calls fail fast locally without hitting the dependency; test that no calls are sent during the cool-off. Half-open — one trial call is let through; test that a success closes the breaker and resumes traffic, while a failure re-opens it.

Your app is only as reliable as the services it depends on — and they will fail, slowly, partially, and at the worst moment. This lesson teaches you to test what your app does when a dependency lets it down.

Network & Resilience Network & Resilience — Lesson 2 of 3 ~30 min read · ~70 min with exercises

1 The Hook

A fictional NZ logistics firm, Cargo Compass, ran a despatch dashboard that booked freight by calling a third-party carrier’s rating API to get a price and a tracking number. In testing it was flawless: enter the consignment, the API returned a price in under a second, the booking went through. It shipped, and for months it was fine.

Then one Friday the carrier’s API started returning 503 Service Unavailable under load. The despatch dashboard had never been tested against a failing carrier. Instead of telling the despatch team “carrier pricing is down, try again shortly,” the dashboard caught the error and silently retried — immediately, over and over, hundreds of times a minute. The retries piled more load onto an already struggling carrier, which made the 503s worse, which triggered more retries. Meanwhile the despatch screen just showed a spinner. Operators, seeing nothing happen, clicked “Book” again, and again.

By mid-morning two things had gone wrong. The carrier’s API was effectively taken down by Cargo Compass’s own retry storm, and a backlog of duplicate bookings had been created for the consignments where some retries did eventually land. A single dependency wobbling had turned into a self-inflicted outage and a reconciliation mess — not because the carrier failed, but because the app had no idea how to behave when it did.

Here is the lesson hidden in that story. The team had tested that the dashboard worked when the carrier API worked. They never tested what it did when the carrier returned a 503, a slow timeout, or a garbled response. Flaky-API resilience testing is the practice of deliberately making the dependency misbehave and proving your app degrades gracefully instead of making everything worse.

2 The Rule

Every dependency will fail eventually, so a feature is not done until you have tested how it behaves when its API does. A resilient app fails in a small, contained, honest way — it stops hammering a struggling service, tells the user the truth, and never turns one dependency’s wobble into its own outage or a pile of duplicates.

Senior engineer insight

The most humbling lesson I learned working on NZ microservices at a fintech was that our circuit breaker configuration lived in code comments, not in tests. We thought we had resilience because we had the pattern — but nobody had ever verified the breaker actually opened at the threshold, or that it half-opened correctly, or that the timeout we set matched what the downstream service actually needed. When our payment gateway wobbled under EOFY load, the breaker did open, but the half-open probe interval was set to 500ms — far too aggressive for a service still recovering — so it kept cycling open and slamming shut instead of letting traffic resume. That experience changed how I think about resilience: the pattern is not the test.

Most common mistake: teams write a circuit breaker, confirm it does not throw during normal traffic, and call it tested. The breaker's three state transitions — closed to open, open staying open through the cool-off, half-open probing and closing on recovery — each need an explicit test driven by a mock you control. If you cannot trigger the state transitions on demand, you do not know they work.

3 The Analogy

Analogy

Ringing a busy Revenue NZ phone line on the first day of a tax change.

When the line is jammed, the sensible caller hears the engaged tone, accepts it, and tries again in a while. The person who instead redials every two seconds the instant they hear the tone achieves nothing for themselves and makes the congestion worse for everyone. Multiply that impatient caller by every other caller doing the same thing and the line never clears — the callers, not the call centre, are now the problem.

A flaky API is that busy phone line, and a naive retry loop is the impatient redialler. A circuit breaker is the caller who, after a few engaged tones, decides to stop trying for a bit and do something else — giving the line a chance to recover. Flaky-API testing is checking that your app behaves like the patient caller who backs off and tries later, not the one who jams the line and then complains it is jammed.

4 The Ways an API Misbehaves

“The API failed” is not one thing. APIs fail in distinct ways, and each needs its own test because each breaks the caller differently.

Clean error responses (5xx and 4xx)

The honest failures: a 503 Service Unavailable when the service is overloaded, a 500 Internal Server Error when it has a bug, a 429 Too Many Requests when you have hit a rate limit. These at least tell you something went wrong. The Cargo Compass 503 is here. The test is whether your app reads the status, distinguishes a retryable failure (503, 429) from a permanent one (a 400 you should never retry), and acts accordingly.

Slow responses and timeouts (the 408 class)

Worse than a clean error, because the app does not know if the request failed or is just slow. The service takes 30 seconds to answer, or never answers at all. A request timeout (408) or a connection that simply hangs leaves the app guessing. This is where an over-tight timeout fails a slow-but-valid call, and where a missing timeout lets one slow dependency freeze the whole screen.

Malformed and unexpected responses

The sneakiest failure: a 200 OK with a body that is wrong — truncated JSON, a null where a number was promised, an HTML error page where JSON was expected, a field renamed without warning. The status says success, so a naive app parses it and crashes, or worse, carries the bad data forward as if it were real. Always test that your app validates the shape of a response, not just its status code.

Intermittent flakiness

The defining condition of this lesson: the API works, then fails, then works again, with no pattern. One call in three returns a 503; a response is fine until it suddenly is not. This is the hardest to test for because it is hard to reproduce by accident — which is exactly why you must engineer it deliberately, by configuring a stub or proxy to fail a set fraction of calls.

Pro tip: You cannot rely on a real dependency to misbehave on cue. The professional approach is a mock or proxy you control that can be told to return a 503, hold a response for 40 seconds, truncate the JSON, or fail one call in three. Resilience you cannot trigger on demand is resilience you have not tested — you have only hoped.

From the field

A Wellington team building a microservices-based government portal assumed that because each service had health checks and a retry wrapper, the system was resilient. What they had not tested was the combination: when the document-generation service started returning slow 200s — technically alive, but taking 28 seconds per response — the retry wrapper did not trigger (no error code), the health check passed (it got a 200), and every upstream request held a thread waiting. Within minutes the thread pool was exhausted and the entire portal returned 503 to users, not just document-generation. The service that was struggling was not the one that went down — the healthy ones did, starved of threads. After that, they added bulkhead isolation so the slow service could only consume a bounded connection pool, and they added response-time SLA assertions to their test suite alongside status-code assertions. The lesson that generalises: a timeout on a 200 is not paranoia, it is the test you did not think you needed until it was too late.

5 Circuit Breakers & Graceful Degradation

Resilient apps use a handful of patterns to contain a failing dependency. You cannot test them well without naming them.

A circuit breaker stops calling a dependency that is clearly failing. After a threshold of failures it “opens” — for a cool-off period, calls fail fast locally instead of being sent, giving the struggling service room to recover. Then it “half-opens,” letting one trial call through; if that succeeds it “closes” and normal traffic resumes, and if it fails it opens again. This is precisely what Cargo Compass lacked — a breaker would have tripped after the first burst of 503s and stopped the retry storm cold. The tester checks all three states: that it opens under sustained failure, fails fast while open, and recovers when the dependency does.

Graceful degradation means the app does something sensible when a dependency is down, rather than collapsing. Show cached pricing with a “last updated” note; let the user queue the booking for later; disable just the feature that needs the failed service while the rest of the screen keeps working. The opposite is a single failed call taking the whole page down with a blank error.

A fallback is a defined alternative when the primary path fails — a secondary provider, a cached value, a default. The test is that the fallback is actually correct and safe, not a stale or wrong value dressed up as a real one, and that the app makes clear it is running on the fallback.

A bulkhead isolates dependencies so one failing service cannot exhaust the resources the others need — the carrier API hanging must not consume every connection and starve the rest of the app. The test is that a slow or dead dependency stays contained to its own feature.

6 Idempotent Retries & Timeouts

Retrying a failed call is reasonable — but, as Cargo Compass found, the details decide whether retry is a fix or a weapon.

Retry only what is safe to retry. A read (GET) is naturally safe to repeat. A write (booking freight, taking a payment) is not — a retry of a write that actually succeeded creates a duplicate, which is the duplicate-bookings half of the Cargo Compass mess. Writes must carry an idempotency key the server recognises, so a retried booking is matched to the first and ignored rather than creating a second. Test this by retrying a write that already landed and proving exactly one record exists.

Back off, do not hammer. Retries must use exponential backoff with jitter — wait 1s, 2s, 4s, with a small random offset — and a hard cap on attempts before giving up. Immediate, unlimited retries are the retry storm that took the carrier down. The tester confirms retries space out, are limited, and respect a 429’s Retry-After hint where one is given.

Set a timeout on every external call, and tune it. A call with no timeout can hang forever and freeze the feature behind it. A timeout too tight fails a slow-but-valid response and triggers a needless retry. Test both: that a hung dependency is cut off and surfaced as an error, and that a slow-but-good response inside the tuned window is allowed to complete.

Distinguish retryable from terminal. A 503 or 429 is worth retrying; a 400 Bad Request or 401 Unauthorised is not — retrying it just wastes effort and may lock an account. The test is that the app retries the transient failures and gives up immediately on the permanent ones with a clear message.

Pro tip: The two highest-value flaky-API tests are “fail one write call, retry, prove one record” (idempotency) and “return 503 to a burst of calls, prove the breaker opens and stops sending” (circuit breaker). Those two together would have prevented both halves of the Cargo Compass incident — the duplicates and the self-inflicted outage.

7 What to Test Against a Flaky API

The practical checklist for any flow that calls an external service:

Each failure mode explicitly: 503, 500, 429, a slow response, a timeout, malformed JSON, and a wrong-shape 200 — each driven deliberately with a mock, each with a defined expected behaviour.
Retryable vs terminal: transient failures (503, 429) are retried; permanent ones (400, 401) are not.
Backoff discipline: retries use exponential backoff with jitter, are capped, respect Retry-After, and never become a storm.
Idempotency on writes: a retried write that already succeeded produces exactly one record, never a duplicate.
Circuit breaker states: the breaker opens under sustained failure, fails fast while open, half-opens to probe, and closes on recovery.
Timeouts on every external call: a hung dependency is cut off and surfaced; a slow-but-valid response inside the window still completes.
Response validation: the app checks the shape of a response, not just its status — malformed or unexpected bodies are rejected, not carried forward as real data.
Error-state UX: the user sees an honest, specific, recoverable message — not an endless spinner, a blank screen, or a false success that invites a re-click.
Graceful degradation: one failed dependency does not take the whole screen down; unaffected features keep working.

8 Building Flaky-API Test Cases

A strong flaky-API test case names the exact failure being injected, drives the flow through it with a controlled mock, and asserts on both the app’s behaviour and what was actually stored. Here is a worked case written to catch the Cargo Compass bug:

Test ID:            API-FLK-021

Injected failure:   Carrier rating API returns 503 to a sustained burst of booking calls

Risk category:      Retry storm + duplicate bookings on dependency failure

Pre-conditions:     Carrier API mocked to return 503; circuit breaker configured;

                  one consignment ready to book; idempotency keys enabled.

Action:             1) Attempt the booking while the API returns 503 repeatedly.

                  2) Then switch the mock to succeed and let the app recover.

Expected result:    1) Retries use backoff with jitter and are capped — NOT an immediate flood.

                  2) After the failure threshold the circuit breaker opens and calls fail fast.

                  3) The user sees an honest “carrier pricing unavailable, try again shortly” message.

                  4) On recovery the breaker half-opens, probes, and closes; booking succeeds.

Server assertion:   Exactly ONE booking exists for this consignment — no duplicates from retries.

Call-count assertion: Outbound calls during the outage are bounded by the breaker, not unbounded.

Evidence required:  Outbound call log with timestamps (showing backoff + breaker open);

                  error message shown; server query showing a single booking.

Traceability:       Risk R-04 (dependency failure causes retry storm and duplicate bookings).

Result:             [Pass / Fail]

Notice what makes this catch the Hook bug: the failure is named and injected with a mock, not waited for; the expected result asserts backoff and a circuit breaker opening rather than an immediate flood; there is a call-count assertion to prove the retries were bounded; and the final check is on the server — exactly one booking — catching the duplicate half of the incident. The error-state UX is asserted explicitly, not left as a spinner.

9 Common Mistakes

🚫 Only testing the API when it works

Why it happens: The real dependency behaves in test, so its failures never appear and feel hypothetical.
The fix: A dependency that works in test will fail in production — that is the Cargo Compass trap. Inject each failure mode deliberately with a mock you control: 503, timeout, malformed JSON. Resilience you cannot trigger on demand is untested.

🚫 Retrying writes without idempotency or backoff

Why it happens: Wrapping a failed call in a retry loop is easy and looks robust.
The fix: Immediate unlimited retries are a storm that can take a struggling dependency down, and retrying a write that already landed creates duplicates — both halves of the Cargo Compass incident. Use backoff with jitter, a cap, and an idempotency key on writes. Test that one write produces one record.

🚫 Trusting a 200 status without checking the body

Why it happens: A 200 reads as “success,” so the body is assumed good.
The fix: A 200 can carry truncated JSON, a null where a number was promised, or an HTML error page. Validate the shape of every response, not just the status, and reject malformed bodies instead of carrying bad data forward as if it were real.

🚫 Showing an endless spinner instead of an honest error

Why it happens: The happy path never fails, so the failure UI was never designed.
The fix: A spinner that never resolves invites the user to re-click — the despatch operators clicking “Book” again. Test that every failure mode surfaces a specific, honest, recoverable message with a clear next step, and that re-clicking does not duplicate the action.

Why teams fail here

They test resilience using a real dependency that happens to be stable in the test environment — then discover in production that resilience only appeared to exist because the dependency never actually misbehaved during the test run.
They set timeouts but never test the timeout path: the value is a guess from a comment in the code, the timeout triggers nothing useful (no user message, no fallback), and nobody has verified what the downstream service actually needs under load.
They treat idempotency as a backend concern and never assert it in a test — so a retry on a write that already succeeded silently creates a duplicate order, booking, or charge that only appears in a data reconciliation audit weeks later.
They stop at “the app did not crash” rather than asserting that the failure stayed contained: unaffected features kept working, the user got an honest and actionable message, and no data was duplicated or silently dropped.

10 Now You Try

Three graded exercises across the ways an API fails. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Missing Error Handling

Read the description of a fictional insurance quote screen that calls a third-party risk-pricing API below. Identify 3 resilience gaps — ways it mishandles a flaky API — and name the failure mode and the pattern that would fix each (retry/backoff, idempotency, circuit breaker, timeout, response validation, graceful degradation, error-state UX).

Screen: get-a-quote, calling a risk-pricing API
On “Get quote” the screen calls the pricing API and waits for the answer with no timeout, showing a spinner until it returns. It assumes the response is valid JSON with a “premium” field and renders it straight to the screen. If the call returns any error, the screen retries immediately and keeps retrying until it succeeds. There is no circuit breaker, so every quote attempt keeps calling the API even while it is clearly down.

List 3 resilience gaps, the failure mode, and the fixing pattern for each:

Show model answer

There are at least four real gaps here; any three well-explained earns full marks.

1. No timeout — The call waits forever, so a hung pricing API freezes the quote screen behind an endless spinner. Failure mode: slow response / timeout (408 class). Fixing pattern: a tuned timeout on the call, surfacing an honest error if it is exceeded.

2. No response validation — It assumes valid JSON with a "premium" field and renders it straight out, so a malformed body or a 200 with the wrong shape crashes the screen or shows garbage. Failure mode: malformed / unexpected response. Fixing pattern: validate the response shape before using it; reject and handle a bad body.

3. Immediate unlimited retry with no circuit breaker — Retrying instantly and forever hammers a struggling API and never lets it recover. Failure mode: clean error (503/500) handled badly. Fixing pattern: exponential backoff with jitter and a cap, plus a circuit breaker that opens under sustained failure and fails fast.

Bonus — error-state UX: an endless spinner with no honest message invites the user to re-click. Fixing pattern: a specific, recoverable error message.

The trap: every one of these passes a test where the pricing API behaves, because the only thing tested was the happy path.

🔧 Exercise 2 of 3 — Fix the Test Case

The test case below only checks the happy path against a working API. Rewrite it to inject a failure and test resilience, with these fields: Test ID, Injected failure, Risk category, Pre-conditions, Action, Expected result, Server assertion, Evidence required, Traceability. Use a fictional myIR-style tax portal calling a bank-account-validation API as the context.

Original (too shallow):
“Enter a bank account and submit. The validation API returns valid. Check the form accepts it. Pass if it submits.”

Rewrite as a flaky-API resilience test case:

Show model answer

Test ID: API-FLK-018

Injected failure: Bank-account-validation API returns a slow response that exceeds the timeout, then a 503 on retry

Risk category: Frozen form / false acceptance / retry storm when validation is flaky

Pre-conditions: Validation API mocked to first hang past the timeout, then return 503; circuit breaker and a tuned timeout configured; one bank account ready to validate.

Action: 1) Submit the account while the API hangs past the timeout. 2) Let the app retry into the 503. 3) Switch the mock to succeed and let it recover.

Expected result: 1) The hung call is cut off at the timeout and surfaced as an honest "could not validate right now" message, not an endless spinner. 2) The form does NOT accept the account as validated while the check has not actually succeeded. 3) Retries back off and are capped; the breaker opens under sustained 503s and fails fast. 4) On recovery the validation completes and the form proceeds.

Server assertion: No account is recorded as "validated" unless the API actually returned a valid result; exactly one validation record on eventual success, no duplicates from retries.

Evidence required: Outbound call log showing the timeout, backoff, and breaker; the error message shown; form state during the outage; server record on recovery.

Traceability: Risk register R-06 (validation dependency flakiness causes a frozen form or a falsely accepted account).

What makes it strong: it injects specific failures (timeout then 503) with a mock, asserts the form does not falsely accept while the check failed, checks backoff and the breaker, and ends on a server assertion. The original tested only the happy path.

🏗️ Exercise 3 of 3 — Design the Resilience Test Cases for This Flaky API

Design a resilience test plan of 5 test cases for a fictional retail checkout that calls a payment-gateway API which is known to be intermittently flaky. Each case needs at least: an ID, the failure injected, an acceptance criterion, and the evidence required. Cover a 503 retry storm, a slow/timeout, a malformed-JSON response, idempotency on a retried charge, and circuit-breaker recovery.

Show model answer

RES-01 | Injected failure: gateway returns 503 to a burst of charge calls | Acceptance criteria: retries use backoff with jitter and are capped; the circuit breaker opens under sustained 503s and calls fail fast; outbound calls are bounded, not a flood | Evidence required: outbound call log with timestamps; breaker-state log

RES-02 | Injected failure: gateway hangs past the timeout | Acceptance criteria: the call is cut off at the tuned timeout and surfaced as an honest, recoverable error; the charge is not treated as successful | Evidence required: timeout in the call log; error message shown; no success recorded

RES-03 | Injected failure: gateway returns 200 with malformed/truncated JSON | Acceptance criteria: the app validates the response shape, rejects the malformed body, and does not record a charge or render garbage | Evidence required: the malformed response; app behaviour; no charge recorded

RES-04 | Injected failure: a charge that actually succeeded is retried after a dropped response | Acceptance criteria: the idempotency key is honoured so exactly one charge results, never two | Evidence required: idempotency key; two send attempts in the log; single charge on the server

RES-05 | Injected failure: gateway fails, then recovers | Acceptance criteria: the breaker opens during failure, half-opens to probe on recovery, and closes; normal checkout resumes; no charge lost or duplicated across the transition | Evidence required: breaker-state transitions; server charge count before/after; recovery log

Strong plans: each case injects a specific failure, has a measurable criterion, names concrete evidence, and together they cover a retry storm (RES-01), timeout (RES-02), malformed response (RES-03), idempotency (RES-04), and breaker recovery (RES-05). Weak plans say "test the API fails gracefully" five times — that is the difference being marked.

11 Self-Check

Click each question to reveal the answer.

Q1: Why can a naive retry loop turn one dependency’s wobble into your own outage?

Because immediate, unlimited retries pile more load onto an already struggling service, making its failures worse, which triggers more retries — a retry storm. That is the Cargo Compass incident: the app, not the carrier, took the carrier down. The fix is backoff with jitter, a retry cap, and a circuit breaker that stops sending once the dependency is clearly failing.

Q2: Name the four ways an API misbehaves, and which is the sneakiest.

Clean error responses (5xx/4xx), slow responses and timeouts (the 408 class), malformed or unexpected responses, and intermittent flakiness. The malformed response is the sneakiest: a 200 OK with a wrong body — truncated JSON, a null, an HTML error page — so a status-only check passes and the app carries bad data forward as if it were real. Validate the shape, not just the status.

Q3: What are the three states of a circuit breaker, and what do you test in each?

Closed — normal traffic flows; test that it opens after a failure threshold. Open — calls fail fast locally without hitting the dependency; test that no calls are sent during the cool-off. Half-open — one trial call is let through; test that a success closes the breaker and resumes traffic, while a failure re-opens it.

Q4: Which failures are worth retrying and which are not?

Transient failures — 503 Service Unavailable, 429 Too Many Requests, timeouts — are worth retrying with backoff. Permanent failures — 400 Bad Request, 401 Unauthorised — are not; retrying them wastes effort and can lock an account. Test that the app retries the transient ones and gives up immediately on the terminal ones with a clear message.

Q5: Why must writes carry an idempotency key when retried?

Because a retry of a write that actually succeeded but whose response was lost will otherwise create a duplicate — the duplicate bookings at Cargo Compass. An idempotency key lets the server recognise the retry as the same request and ignore it, so exactly one record results. Test it by retrying a landed write and proving one record exists.

12 Interview Prep

Real questions asked in NZ QA interviews for backend and integration roles. Read the model answers, then practise your own version.

“How would you test that our app handles a third-party API going down?”

I’d never wait for the real dependency to fail — I’d put a mock or proxy in front of it that I can tell to misbehave on cue. Then I test each failure mode separately: a 503, a 500, a 429, a slow response past the timeout, a hung connection, malformed JSON, and a 200 with the wrong shape. For each I assert a defined behaviour — backoff and a capped retry on the transient ones, an immediate honest error on the permanent ones, response validation rejecting bad bodies, and a circuit breaker that opens under sustained failure and recovers. And on any write I prove idempotency, so a retry produces one record, not a duplicate. The point is that resilience I cannot trigger on demand is resilience I have not tested.

“A dependency had a brief outage and we ended up with duplicate records and a flood of traffic. What went wrong?”

That is the classic retry-storm-plus-duplicates pattern. The app most likely retried immediately and without limit, so when the dependency wobbled the retries hammered it and made the outage worse — and any write whose response was lost got retried and created a duplicate. The two missing controls are a circuit breaker, which would have tripped and stopped the flood, and idempotency keys on writes, which would have collapsed the retries into one record. I’d reproduce both by mocking a 503 burst and asserting the breaker opens and the call count is bounded, then retrying a landed write and asserting one record.

“What is the difference between graceful degradation and just catching the error?”

Catching the error stops a crash but does nothing useful — often it just shows a blank screen or a spinner. Graceful degradation means the app does something sensible when a dependency is down: shows cached data with a “last updated” note, lets the user queue the action for later, or disables only the feature that needs the failed service while the rest of the screen keeps working. So in testing I do not just check the app does not crash — I check that the unaffected parts still function and the user gets an honest, recoverable path forward, not a dead end.

Key takeaway

Resilience you cannot trigger on demand is not resilience — it is hope, and hope is not a test strategy.

← Throttling & Offline Next: State & Caching Bugs →

Flaky API Resilience Testing

1 The Hook

2 The Rule

3 The Analogy

4 The Ways an API Misbehaves

Clean error responses (5xx and 4xx)

Slow responses and timeouts (the 408 class)

Malformed and unexpected responses

Intermittent flakiness

5 Circuit Breakers & Graceful Degradation

6 Idempotent Retries & Timeouts

7 What to Test Against a Flaky API

8 Building Flaky-API Test Cases

9 Common Mistakes

10 Now You Try

11 Self-Check

Related techniques

12 Interview Prep