Senior · Reliability Technique

Error Recovery Testing

The error message looks fine. But did the transaction roll back? Is the database consistent? Can the user retry without losing data or being charged twice?

Senior ISTQB CTAL-TA — K4 Analyse ~12 min read + exercise

1 The Hook — Why This Matters

In 2020, an NZ insurance company released a mobile claim app that let customers upload photos of storm damage. During a major Auckland weather event, thousands of users tried the feature simultaneously. It failed catastrophically.

WiFi disconnect mid-upload: Photos were lost. The user had to re-select and re-upload every image from the beginning. Payment gateway 503: A customer paid their excess via credit card. The gateway timed out. The money left their account, but the app recorded no payment. Refresh on form: A user refreshed the browser halfway through a multi-step claim. Every photo and description vanished. Server restart: During peak load, a container restarted. One claim was created twice, with duplicate payments.

The company had tested error messages. "Upload failed" appeared correctly. But they had never tested what happened to the data when the error occurred. Error recovery testing is not about the message. It is about the state.

2 The Rule — The One-Sentence Version

An error is not handled until the system returns to a consistent state, the user knows what happened, and retrying does not make things worse.

A junior tester sees an error message and marks the test passed. A senior tester asks: is the database transaction rolled back? Are there orphaned records? Did partial data persist? Can the user safely retry? Is the recovery time within the business requirement? Error recovery is about system integrity, not user messaging.

3 The Analogy — Think Of It Like...

Analogy

A restaurant kitchen when the power cuts out.

A good kitchen does not just say "Sorry, power cut" to the diners. The freezers stay sealed to preserve food. Half-cooked meals are discarded, not served. The till remembers which tables paid. When power returns, the kitchen does not re-cook meals that were already served. Error recovery is every one of those backstage decisions, not just the apology to the customer.

4 Watch Me Do It — Step by Step

Here is the systematic approach for testing error recovery. I will use the NZ insurance claim app as the running example.

  1. Identify critical transactions and define recovery objectives Map every transaction where data or money changes state: claim submission, photo upload, payment, policy update. For each, document the RTO (Recovery Time Objective: how fast must it recover?) and RPO (Recovery Point Objective: how much data can we afford to lose?). Example: claim upload RTO = 30 seconds, RPO = zero photos lost.
  2. Design failure scenarios across all layers Categorise failures by source: network (drop, throttle, high latency), server (restart, out-of-memory, 503 response), client (browser crash, refresh, back button), and dependency (payment gateway down, CDN timeout, database deadlock). Each critical transaction needs at least one failure scenario per layer.
  3. Execute the failure and observe system behaviour For each scenario, verify four outcomes: (a) the user sees a clear, actionable error message; (b) partial data is preserved where possible; (c) retry does not create duplicates or inconsistent state; (d) no orphaned records remain in the database. Example: "Upload failed. 3 of 5 photos saved. Tap to retry."
  4. Verify database consistency after recovery Query the database directly after each failure. Check for orphan records, partial foreign keys, negative balances, or duplicate transactions. A user-facing success message means nothing if the backend ledger is wrong.
  5. Measure and document recovery time Time how long from failure detection to full service restoration. Compare against the RTO. If the payment gateway returns a 503, does the system retry with exponential backoff? Does circuit breaker logic kick in after repeated failures? Document the actual MTTR (Mean Time To Recover) for each scenario.
  6. Test idempotency of retry operations Submit the same claim twice with identical parameters. The system must create exactly one claim and charge exactly one payment. Use the same idempotency key if the API supports it. Without idempotency, network retries become duplicate orders.
Insurance claim — failure scenario matrix
Failure Layer Expected recovery
WiFi drops mid-uploadNetworkPartial photos saved; user sees count; retry resumes
Payment gateway 503DependencyNo charge recorded; user can retry; no duplicate
Browser refresh on formClientDraft auto-saved; form repopulated on return
Server restart mid-submitServerTransaction rolled back; no orphan claim
Duplicate submit clickClientIdempotent: one claim, one charge
Pro tip: The 3-2-1 backup rule applies to test data too: three copies, two media types, one offsite. When you are deliberately crashing systems to test recovery, you need a way back. Snapshot the database before destructive recovery tests.

5 When to Use It / When NOT to Use It

✅ Use this when...

  • The system handles financial transactions or sensitive data
  • Network reliability is variable (mobile, rural, offshore)
  • Microservices or third-party integrations are involved
  • SLAs or compliance frameworks define RTO and RPO targets
  • You are preparing for disaster recovery drills or audits

❌ Don't rely on this alone when...

  • The application is read-only with no state changes
  • You have not tested the happy path thoroughly first
  • You lack database access to verify backend consistency
  • You are testing in production without safeguards
  • The architecture has no retry, circuit breaker, or rollback mechanism

6 Common Mistakes — Don't Do This

🚫 Testing only the error message, not the state

I used to think: If the user sees "Upload failed," the error handling works.
Actually: The message is the tip of the iceberg. Did the database roll back? Are there half-written records? Can the user retry without starting over? A beautiful error message over a corrupted database is a failed test.

🚫 Not testing the retry path

I used to think: If the first attempt fails and the second succeeds, recovery is fine.
Actually: The retry might create a duplicate charge, a duplicate record, or append to a partially failed state. You must verify that retrying is idempotent — the same action produces the same result, not additional side effects.

🚫 Testing in isolation, not under load

I used to think: A single failure scenario is enough to validate recovery logic.
Actually: Recovery behaves differently under load. A database deadlock during a single-user test may resolve cleanly. Under load, the same deadlock can cascade, causing connection pool exhaustion. Use chaos engineering tools or load testing to inject failures under realistic traffic.

7 Now You Try — Design a Recovery Test

🎯 Interactive Exercise

Scenario: An NZ travel booking site charges a $50 deposit when a user clicks "Reserve." The flow is: (1) validate availability, (2) charge deposit via Stripe, (3) create reservation record, (4) send confirmation email. You are told the error handling is "robust."

Question: Design three failure scenarios and state what you would verify in each. Write them down before revealing the answer.

Three critical failure scenarios:

  1. Stripe timeout after charge succeeds. The API call to Stripe returns a timeout, but Stripe processed the charge. Verify: the booking system must reconcile with Stripe (webhook or idempotency key) and not charge again on retry. The user must see confirmation, not "payment failed."
  2. Server crash between charge and reservation creation. The deposit is charged but the server restarts before writing the reservation record. Verify: on recovery, the system must either refund the charge or create the reservation. There must be no orphaned payment without a booking.
  3. User double-clicks the Reserve button. Two requests fire milliseconds apart. Verify: idempotency prevents two charges and two reservations. Only one confirmation email is sent. The UI disables the button after the first click.

Tip: In distributed systems, the hardest bugs happen at the boundaries between services. Test the handoffs, not just the individual components.

8 Self-Check — Can You Actually Do This?

Click each question to reveal the answer. If you got all three, you're ready to practice.

Q1. What does idempotent mean in testing?

Idempotent means that repeating the same action produces the same result without additional side effects. If a user clicks "Submit" twice due to a network retry, an idempotent system creates exactly one record and charges exactly once. Without idempotency, retries become duplicates.

Q2. What is the difference between RTO and RPO?

RTO (Recovery Time Objective) is the maximum acceptable downtime after a failure — how fast must the system be back? RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time — how much data can we afford to lose? Example: RTO of 1 hour means the system must restore within 60 minutes. RPO of 5 minutes means you can lose at most 5 minutes of data.

Q3. What is chaos engineering and why would a tester use it?

Chaos engineering is the practice of deliberately injecting failures into a system to verify resilience. Tools like Chaos Monkey randomly terminate production instances. Testers use it to discover recovery weaknesses that scripted tests miss: cascading failures, circuit breaker behaviour, and unexpected dependency interactions. It proves that recovery works in reality, not just in theory.

9 Interview Prep — Q&A

Q. What does idempotent mean and why does it matter for error recovery?

Idempotent means the same operation can be applied multiple times without changing the result beyond the initial application. In error recovery, networks are unreliable. A request may time out even though the server processed it. The client retries. If the operation is not idempotent, the retry creates a duplicate charge, duplicate order, or duplicate record. I test idempotency by replaying requests with identical parameters and verifying only one side effect occurs.

Q. How would you test a payment gateway timeout scenario?

I use a proxy or mock server to simulate a gateway that accepts the charge but never responds, or responds with a 504 after the charge succeeds. I verify three things: (1) the user does not see a generic failure that encourages them to retry manually; (2) the backend reconciles the ambiguous state via webhooks or query APIs; (3) a retry with the same idempotency key does not create a second charge. I check the database, the payment provider dashboard, and the user-facing history.

Q. What metrics do you use to measure error recovery?

I use RTO (Recovery Time Objective) to define how fast the system must restore, RPO (Recovery Point Objective) to define acceptable data loss, and MTTR (Mean Time To Recover) to measure actual recovery performance over time. I also track idempotency failure rate, orphan record count, and user-reported retry friction. These metrics turn recovery from a subjective "feels okay" into an objective SLA.

Q. Describe the circuit breaker pattern and how you would test it.

A circuit breaker stops the system from repeatedly calling a failing dependency. In the Closed state, requests flow normally. After a threshold of failures, it Opens and fails fast without calling the dependency. After a timeout, it Half-Opens to test recovery. I test it by forcing the dependency to fail and verifying: fast failure in Open state, no cascading load on the failing service, graceful degradation (cached data or queued tasks), and automatic recovery when the dependency heals.