Senior · Reliability Technique

Error Recovery Testing

Q: Q1. What does idempotent mean in testing?

Idempotent means that repeating the same action produces the same result without additional side effects. If a user clicks "Submit" twice due to a network retry, an idempotent system creates exactly one record and charges exactly once. Without idempotency, retries become duplicates.

Q: Q2. What is the difference between RTO and RPO?

RTO (Recovery Time Objective) is the maximum acceptable downtime after a failure — how fast must the system be back? RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time — how much data can we afford to lose? Example: RTO of 1 hour means the system must restore within 60 minutes. RPO of 5 minutes means you can lose at most 5 minutes of data.

Q: Q3. What is chaos engineering and why would a tester use it?

Chaos engineering is the practice of deliberately injecting failures into a system to verify resilience. Tools like Chaos Monkey randomly terminate production instances. Testers use it to discover recovery weaknesses that scripted tests miss: cascading failures, circuit breaker behaviour, and unexpected dependency interactions. It proves that recovery works in reality, not just in theory.

The error message looks fine. But did the transaction roll back? Is the database consistent? Can the user retry without losing data or being charged twice?

Senior ISTQB CTAL-TA — K4 Analyse ~12 min read + exercise

1 The Hook — Why This Matters

In 2020, an NZ insurance company released a mobile claim app that let customers upload photos of storm damage. During a major Auckland weather event, thousands of users tried the feature simultaneously. It failed catastrophically.

WiFi disconnect mid-upload: Photos were lost. The user had to re-select and re-upload every image from the beginning. Payment gateway 503: A customer paid their excess via credit card. The gateway timed out. The money left their account, but the app recorded no payment. Refresh on form: A user refreshed the browser halfway through a multi-step claim. Every photo and description vanished. Server restart: During peak load, a container restarted. One claim was created twice, with duplicate payments.

The company had tested error messages. "Upload failed" appeared correctly. But they had never tested what happened to the data when the error occurred. Error recovery testing is not about the message. It is about the state.

2 The Rule — The One-Sentence Version

An error is not handled until the system returns to a consistent state, the user knows what happened, and retrying does not make things worse.

A junior tester sees an error message and marks the test passed. A senior tester asks: is the database transaction rolled back? Are there orphaned records? Did partial data persist? Can the user safely retry? Is the recovery time within the business requirement? Error recovery is about system integrity, not user messaging.

Senior engineer insight

The thing that changed how I think about error recovery is realising that your error handling code runs on the worst possible day — peak traffic, degraded network, third-party gateway flaky — and has never been tested under those conditions. Most teams test happy paths under load and failure paths in isolation, so they have never seen what their circuit breakers actually do when five microservices degrade simultaneously. At CoverNZ-scale systems processing thousands of concurrent claims, I have seen recovery logic that worked perfectly in isolation produce cascading duplicate records the moment it met real retry storms.

The most common Senior mistake: defining RTO and RPO in the test plan but never actually measuring them during testing — they stay as aspirational targets on a spreadsheet, not verified SLAs.

3 The Analogy — Think Of It Like...

Analogy

A restaurant kitchen when the power cuts out.

A good kitchen does not just say "Sorry, power cut" to the diners. The freezers stay sealed to preserve food. Half-cooked meals are discarded, not served. The till remembers which tables paid. When power returns, the kitchen does not re-cook meals that were already served. Error recovery is every one of those backstage decisions, not just the apology to the customer.

From the field

A large NZ government agency was migrating a legacy debt repayment system to a new microservices platform. The team had documented RTO of four hours for the payment processing service and was confident the rollback procedure was solid. What they had never tested was what happened when a payment confirmation webhook arrived after a rollback had already marked the transaction as failed. When the first production incident hit, the system refunded a customer who had actually paid — the webhook replayed into a state the recovery logic was not designed to handle. The fix required a manual reconciliation run across three months of transactions. The lesson that generalises everywhere: your recovery procedure and your retry/webhook pipeline are two separate failure modes; you must test them colliding, not in sequence.

4 Watch Me Do It — Step by Step

Here is the systematic approach for testing error recovery. I will use the NZ insurance claim app as the running example.

Identify critical transactions and define recovery objectives Map every transaction where data or money changes state: claim submission, photo upload, payment, policy update. For each, document the RTO (Recovery Time Objective: how fast must it recover?) and RPO (Recovery Point Objective: how much data can we afford to lose?). Example: claim upload RTO = 30 seconds, RPO = zero photos lost.
Design failure scenarios across all layers Categorise failures by source: network (drop, throttle, high latency), server (restart, out-of-memory, 503 response), client (browser crash, refresh, back button), and dependency (payment gateway down, CDN timeout, database deadlock). Each critical transaction needs at least one failure scenario per layer.
Execute the failure and observe system behaviour For each scenario, verify four outcomes: (a) the user sees a clear, actionable error message; (b) partial data is preserved where possible; (c) retry does not create duplicates or inconsistent state; (d) no orphaned records remain in the database. Example: "Upload failed. 3 of 5 photos saved. Tap to retry."
Verify database consistency after recovery Query the database directly after each failure. Check for orphan records, partial foreign keys, negative balances, or duplicate transactions. A user-facing success message means nothing if the backend ledger is wrong.
Measure and document recovery time Time how long from failure detection to full service restoration. Compare against the RTO. If the payment gateway returns a 503, does the system retry with exponential backoff? Does circuit breaker logic kick in after repeated failures? Document the actual MTTR (Mean Time To Recover) for each scenario.
Test idempotency of retry operations Submit the same claim twice with identical parameters. The system must create exactly one claim and charge exactly one payment. Use the same idempotency key if the API supports it. Without idempotency, network retries become duplicate orders.

Insurance claim — failure scenario matrix

Failure	Layer	Expected recovery
WiFi drops mid-upload	Network	Partial photos saved; user sees count; retry resumes
Payment gateway 503	Dependency	No charge recorded; user can retry; no duplicate
Browser refresh on form	Client	Draft auto-saved; form repopulated on return
Server restart mid-submit	Server	Transaction rolled back; no orphan claim
Duplicate submit click	Client	Idempotent: one claim, one charge

Pro tip: The 3-2-1 backup rule applies to test data too: three copies, two media types, one offsite. When you are deliberately crashing systems to test recovery, you need a way back. Snapshot the database before destructive recovery tests.

5 When to Use It / When NOT to Use It

✅ Use this when...

The system handles financial transactions or sensitive data
Network reliability is variable (mobile, rural, offshore)
Microservices or third-party integrations are involved
SLAs or compliance frameworks define RTO and RPO targets
You are preparing for disaster recovery drills or audits

❌ Don't rely on this alone when...

The application is read-only with no state changes
You have not tested the happy path thoroughly first
You lack database access to verify backend consistency
You are testing in production without safeguards
The architecture has no retry, circuit breaker, or rollback mechanism

6 Common Mistakes — Don't Do This

🚫 Testing only the error message, not the state

I used to think: If the user sees "Upload failed," the error handling works.
Actually: The message is the tip of the iceberg. Did the database roll back? Are there half-written records? Can the user retry without starting over? A beautiful error message over a corrupted database is a failed test.

🚫 Not testing the retry path

I used to think: If the first attempt fails and the second succeeds, recovery is fine.
Actually: The retry might create a duplicate charge, a duplicate record, or append to a partially failed state. You must verify that retrying is idempotent — the same action produces the same result, not additional side effects.

🚫 Testing in isolation, not under load

I used to think: A single failure scenario is enough to validate recovery logic.
Actually: Recovery behaves differently under load. A database deadlock during a single-user test may resolve cleanly. Under load, the same deadlock can cascade, causing connection pool exhaustion. Use chaos engineering tools or load testing to inject failures under realistic traffic.

7 Now You Try — Design a Recovery Test

🎯 Interactive Exercise

Scenario: An NZ travel booking site charges a $50 deposit when a user clicks "Reserve." The flow is: (1) validate availability, (2) charge deposit via Stripe, (3) create reservation record, (4) send confirmation email. You are told the error handling is "robust."

Question: Design three failure scenarios and state what you would verify in each. Write them down before revealing the answer.

Three critical failure scenarios:

Stripe timeout after charge succeeds. The API call to Stripe returns a timeout, but Stripe processed the charge. Verify: the booking system must reconcile with Stripe (webhook or idempotency key) and not charge again on retry. The user must see confirmation, not "payment failed."
Server crash between charge and reservation creation. The deposit is charged but the server restarts before writing the reservation record. Verify: on recovery, the system must either refund the charge or create the reservation. There must be no orphaned payment without a booking.
User double-clicks the Reserve button. Two requests fire milliseconds apart. Verify: idempotency prevents two charges and two reservations. Only one confirmation email is sent. The UI disables the button after the first click.

Tip: In distributed systems, the hardest bugs happen at the boundaries between services. Test the handoffs, not just the individual components.

Why teams fail here

Recovery tests are treated as optional scope — they get cut when the sprint is under pressure, even though they are the scenarios most likely to cost real money in production.
Teams test failure injection in dev environments where latency is near zero and there is one user — the race conditions that cause orphaned records only emerge at realistic concurrency levels.
Idempotency is assumed rather than verified: the API documentation says "supports idempotency keys" but nobody has tested what happens when the key expires mid-retry storm or when two callers use the same key format independently.
Database consistency checks are skipped because testers lack direct DB access — so a "passed" recovery test is actually just a user-facing message check with no validation of backend state, leaving corrupted ledger records undetected until a finance audit.

8 Self-Check — Can You Actually Do This?

Click each question to reveal the answer. If you got all three, you're ready to practice.

Q1. What does idempotent mean in testing?

Idempotent means that repeating the same action produces the same result without additional side effects. If a user clicks "Submit" twice due to a network retry, an idempotent system creates exactly one record and charges exactly once. Without idempotency, retries become duplicates.

Q2. What is the difference between RTO and RPO?

RTO (Recovery Time Objective) is the maximum acceptable downtime after a failure — how fast must the system be back? RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time — how much data can we afford to lose? Example: RTO of 1 hour means the system must restore within 60 minutes. RPO of 5 minutes means you can lose at most 5 minutes of data.

Q3. What is chaos engineering and why would a tester use it?

Chaos engineering is the practice of deliberately injecting failures into a system to verify resilience. Tools like Chaos Monkey randomly terminate production instances. Testers use it to discover recovery weaknesses that scripted tests miss: cascading failures, circuit breaker behaviour, and unexpected dependency interactions. It proves that recovery works in reality, not just in theory.

9 Interview Prep — Q&A

Q. What does idempotent mean and why does it matter for error recovery?

Idempotent means the same operation can be applied multiple times without changing the result beyond the initial application. In error recovery, networks are unreliable. A request may time out even though the server processed it. The client retries. If the operation is not idempotent, the retry creates a duplicate charge, duplicate order, or duplicate record. I test idempotency by replaying requests with identical parameters and verifying only one side effect occurs.

Q. How would you test a payment gateway timeout scenario?

I use a proxy or mock server to simulate a gateway that accepts the charge but never responds, or responds with a 504 after the charge succeeds. I verify three things: (1) the user does not see a generic failure that encourages them to retry manually; (2) the backend reconciles the ambiguous state via webhooks or query APIs; (3) a retry with the same idempotency key does not create a second charge. I check the database, the payment provider dashboard, and the user-facing history.

Q. What metrics do you use to measure error recovery?

I use RTO (Recovery Time Objective) to define how fast the system must restore, RPO (Recovery Point Objective) to define acceptable data loss, and MTTR (Mean Time To Recover) to measure actual recovery performance over time. I also track idempotency failure rate, orphan record count, and user-reported retry friction. These metrics turn recovery from a subjective "feels okay" into an objective SLA.

Q. Describe the circuit breaker pattern and how you would test it.

A circuit breaker stops the system from repeatedly calling a failing dependency. In the Closed state, requests flow normally. After a threshold of failures, it Opens and fails fast without calling the dependency. After a timeout, it Half-Opens to test recovery. I test it by forcing the dependency to fail and verifying: fast failure in Open state, no cascading load on the failing service, graceful degradation (cached data or queued tasks), and automatic recovery when the dependency heals.

Key takeaway

Any system can handle a failure gracefully in a demo — the professional judgment is knowing that recovery only counts when it has been verified at realistic concurrency, with database consistency checked, retry idempotency proven, and the RTO measured against the clock, not assumed against a document.

← All Senior learning Full reference (ISTQB deep-dive) → Practice 09: Error boundary coverage →

Error Recovery Testing

1 The Hook — Why This Matters

2 The Rule — The One-Sentence Version

3 The Analogy — Think Of It Like...

4 Watch Me Do It — Step by Step

5 When to Use It / When NOT to Use It

6 Common Mistakes — Don't Do This

7 Now You Try — Design a Recovery Test

8 Self-Check — Can You Actually Do This?

Related techniques

9 Interview Prep — Q&A