Senior · Quality Engineering

Root Cause Analysis & Blameless Post-Mortems

Closing a defect is not the same as understanding it. A surface fix makes the symptom disappear; root cause analysis answers the two questions that actually make the system stronger — why did the defect exist, and why did it escape testing?

Senior Quality Engineering · Defect & Incident Analysis ~25 min read · ~60 min with exercises

1 The Hook

Tūī Bank, a fictional NZ retail bank, shipped a routine release to its mobile app on a Friday evening. Over the weekend, a slow trickle of complaints became a flood: customers paying bills were occasionally charged twice. By Monday morning around 900 duplicate payments had gone out, the contact centre was overwhelmed, and the incident was on the front page of a news site.

The on-call engineer found the immediate cause quickly. A retry in the payments service fired a second time when the bank’s gateway was slow to respond — the first request had actually succeeded, but the app didn’t hear back in time, so it tried again. The fix was one line: make the payment call idempotent so a retry with the same reference can’t charge twice. Shipped Monday afternoon. Symptom gone.

And that is exactly where weak teams stop. The duplicate-charge bug was fixed — but nobody had yet answered the questions that matter. Why did a non-idempotent payment call exist in a banking system at all? And why did it pass through requirements, code review, and a full test cycle without anyone catching it?

When Tūī Bank’s test lead ran a proper root cause analysis, the answer was uncomfortable and far more valuable than the one-line fix. The retry behaviour had been added in a hotfix three sprints earlier under deadline pressure, with no test written for the “gateway slow but request succeeded” path. The test environment’s mock gateway always responded instantly, so the race condition could not occur in any test that existed. And the team had no test charter for timeout-and-retry behaviour anywhere in the payments suite. The real defect wasn’t a line of code. It was a blind spot in how the team tested timing.

That second analysis is what this lesson teaches. Anyone can close a bug. A senior tester finds the reason the bug was possible — and the reason it survived.

2 The Rule

A fix removes the symptom. Root cause analysis removes the class of defect. If your analysis ends at “we patched the code,” the same kind of bug will come back wearing a different shirt. Every escaped defect has two root causes — why it was introduced, and why it wasn’t caught — and the second one is the tester’s to own.

3 The Analogy

Analogy

A leak in the ceiling.

You walk into the lounge and there’s a puddle on the floor. You can mop it up — that’s the hotfix. The puddle is gone and the room looks fine. But mop it every morning and you’re not a plumber, you’re a cleaner. The stain on the ceiling tells you water is coming from above; follow it and you find a cracked tile on the roof; ask why the tile cracked and you find the flashing was never installed properly when the house was built.

The puddle is the production incident. The cracked tile is the bug in the code. The missing flashing is the root cause — the decision or gap that made the whole chain possible. Mopping treats the symptom forever; fixing the flashing means you never mop again. Root cause analysis is the discipline of refusing to stop at the puddle.

4 The Two QA Questions

Developers running an RCA usually ask one question: why did this defect get into the product? That is necessary but it is only half the job, and it is the developer’s half. As a tester, you own a second question that is just as important and that nobody else in the room is incentivised to ask:

Question 1 — Why did the defect exist?
The introduction path: the code, design, requirement, or assumption that put the fault into the product. This usually leads to a code or process fix.

Question 2 — Why did testing not catch it?
The escape path: the gap in coverage, environment, data, or charter that let the defect walk past every quality gate. This leads to a testing fix — a new test, a new environment condition, a new risk on the register.

An RCA that only answers Question 1 makes the developers better and leaves QA exactly as leaky as before. The escaped-defect question is where a tester earns their seniority: every production bug is, by definition, a test that was missing or wrong. Finding which one — and adding it — is how the test suite gets stronger every time something breaks.

There is a sharper version of the escape question that matters enormously the moment a stakeholder asks “why did this reach production?” The honest answer is always one of three, and each points at a completely different fix:

“We didn’t test it.” A coverage gap — the scenario was never exercised at all. Fix: add the missing test or charter.

“We tested it and it passed.” The most valuable and most uncomfortable answer — the test existed but was wrong, too weak, or the environment couldn’t reproduce the real condition (the Tūī Bank instant-mock-gateway trap). Fix: repair the test or the environment — adding another test like it would also pass and also miss the bug.

“We tested it, but it was out of scope.” Not a testing failure at all — a risk-prioritisation or sign-off decision that knowingly accepted the gap. Fix: revisit the risk decision and who owns it, not the suite.

Conflating these three is how blame lands in the wrong place. “We tested it and it passed” is a test-quality problem; “out of scope” is a governance decision someone signed off; only “we didn’t test it” is a pure coverage miss. Naming which one it actually was — honestly — is what turns a defensive post-incident meeting into a useful one, and it tells you whether the fix belongs in the suite, in the environment, or in the risk register.

Pro tip: When you join the post-incident meeting, say out loud: “Let’s also work out why this got past us, not just how it got in.” You will often be the only person who asks. That single question is the most senior thing a tester can do in the room.

5 Watch Me Do It — 5 Whys on the Tūī Bank Incident

The 5 Whys is the simplest RCA tool: you take the symptom and ask “why?” repeatedly, each answer becoming the next question, until you reach something that is a cause you can act on rather than another symptom. “Five” is a guide, not a rule — sometimes it’s three, sometimes seven. Here is the introduction path for the duplicate charge:

Why?Customers were charged twice. → Because the payment request was sent to the gateway twice.
Why?Because the app retried after a timeout. → The first request had succeeded, but the slow gateway response made the app think it had failed.
Why?Because the payment call was not idempotent. → A retry with the same reference created a second charge instead of being recognised as a duplicate.
Why?Because the retry was added in a deadline hotfix three sprints ago, and idempotency wasn’t considered. → No design review covered the “succeeded-but-slow” case.
Why?Because the team had no standard requiring idempotency for payment operations, and time pressure let the gap ship.Root cause (introduction): a missing engineering standard, not a careless engineer.

Now — and this is the half most people skip — the same technique on the escape path:

Why?Testing didn’t catch it. → Because no test exercised the “gateway slow but request succeeded” timing.
Why?Because the test environment’s mock gateway always responded instantly. → The race condition was physically impossible to reproduce in any existing test.
Why?Because nobody had identified timeout-and-retry as a risk worth a test charter. → Timing behaviour wasn’t on the payments risk register.
Why?Because the team treated “the happy path returns the right number” as sufficient coverage for payments, and never modelled adverse network conditions.Root cause (escape): a coverage model blind to non-functional timing risk.

Two root causes, two different fixes. The introduction root cause produces an engineering action (an idempotency standard for money-moving operations). The escape root cause produces a testing action (a chaos/latency profile in the test gateway, plus a timeout-and-retry charter on the payments risk register). Patch only the one line of code and you’ve fixed one bug; fix both root causes and you’ve closed the door on an entire family of timing defects.

Pro tip: A good 5 Whys ends at a cause inside your team’s control — a standard, a process, a test, an environment setting. If your last “why” lands on “the engineer made a mistake” or “the customer did something weird,” you stopped one question too early. People are never the root cause; the system that let the mistake matter is.

6 The Techniques

5 Whys is the workhorse, but it has a weakness: it follows a single chain and can miss the fact that several causes combined to produce one failure. Three tools cover the range:

5 Whys — for linear causes

Best when there is one clear chain from symptom to cause. Fast, needs no template, easy to do live in a meeting. Weakness: it assumes a single thread, so use something richer when a failure had multiple contributing factors.

Fishbone (Ishikawa) diagram — for multiple contributing causes

When a failure had several contributors, draw a fishbone: the problem is the fish’s head, and bones branch off by category. A practical category set for software testing:

  • People / Skills — knowledge gaps, onboarding, unclear ownership.
  • Process — missing review step, no definition of done, rushed sign-off.
  • Tooling / Automation — gaps in the pipeline, no static analysis, flaky suite ignored.
  • Environment / Data — test env unlike production, unrealistic mock data, missing edge data.
  • Requirements — ambiguous or missing acceptance criteria, untested assumptions.

The value of the fishbone is that it forces you to consider categories you’d otherwise skip. The Tūī Bank incident has bones in Process (no idempotency standard), Environment (instant mock gateway), and Requirements (timing never specified) — a 5 Whys down any one of those branches would have missed the other two.

Contributing cause vs root cause

Not every cause is the root cause. A contributing cause made the failure more likely or worse; the root cause is the one that, if removed, would have prevented the failure entirely. The deadline pressure at Tūī Bank was a contributing cause — real, worth noting — but removing “pressure” isn’t an action you can take. The missing idempotency standard is. Aim your corrective actions at root causes you control and note contributing causes as context.

Pro tip: The test of a real root cause is the “would it have prevented this?” question. Write your candidate cause, then ask: “If this had not been true, would the incident still have happened?” If yes, keep digging — you’ve found a contributing cause, not the root.

7 The Blameless Post-Mortem

An RCA produces findings; a post-mortem is the written record and the meeting that turns those findings into action. The word that matters is blameless. A blameless post-mortem operates on a single assumption: everyone involved acted reasonably given the information they had at the time. The goal is to fix the system, never to identify a culprit.

This isn’t about being nice — it’s about getting the truth. The moment a post-mortem becomes about blame, people stop telling you what really happened. The engineer who pushed the hotfix goes quiet, the timeline develops gaps, and you lose exactly the information you need to prevent a recurrence. Blame buys silence; safety buys the truth. A psychologically safe review gets you the messy, honest account that actually contains the root cause.

A solid blameless post-mortem document has a predictable shape:

1. Summary — one paragraph: what happened, when, and the impact in plain terms.
2. Impact — quantified: customers affected, dollars, duration, reputational reach.
3. Timeline — factual sequence with timestamps, from introduction to detection to resolution. No interpretation, just events.
4. Root cause(s) — both questions: why it was introduced and why it escaped.
5. What went well — detection, response, communication worth keeping. Post-mortems aren’t only about failure.
6. Action items — specific, owned, dated, and systemic. Each maps to a root cause.
7. Lessons learned — what the whole team now knows that it didn’t before.

The action items are where post-mortems live or die. “Be more careful” is not an action item — it’s a wish. “Everyone should double-check payments code” is not an action item — it has no owner and no system change. A real action item is concrete, assigned, dated, and changes the system so the next person can’t make the same mistake: “Add an idempotency-key requirement to the payments coding standard and a pipeline check that fails the build if a money-moving endpoint lacks one — owner: Priya — by 28 June.”

In the NZ context, this discipline isn’t optional for regulated work. A bank reporting a payments incident to the Reserve Bank, or a health provider reviewing a clinical-system failure, needs exactly this artefact: a factual timeline, identified root causes, and dated corrective actions. The blameless post-mortem is both a learning tool and an audit record — the same document satisfies the team and the regulator.

8 Common Mistakes

🚫 Stopping at the first technical cause

Why it happens: Finding the buggy line feels like solving the problem, and the pressure to “close it out” is real.
The fix: The buggy line is usually a symptom of a process or design gap. Keep asking why until you reach a cause your team can change at the system level — a standard, a test charter, an environment fix — not just a patch.

🚫 Only answering “why did it exist” and skipping “why did it escape”

Why it happens: The introduction path is the developer’s instinct and it dominates the room.
The fix: As the tester, own the escape question. Every production defect is a missing or wrong test — name it and add it, or the suite stays exactly as leaky as it was.

🚫 Naming a person as the root cause

Why it happens: “Sam pushed the bad code” is a tidy, satisfying answer.
The fix: People make mistakes — that’s a constant, not a root cause. The root cause is the system that let one person’s mistake reach production: no review gate, no test, no standard. Blame ends the investigation; the system question continues it.

🚫 Action items that are wishes, not changes

Why it happens: “Be more careful” and “test more thoroughly” are easy to write and feel responsible.
The fix: If an action item has no owner, no date, and doesn’t change the system, nothing will change. Make each one specific, assigned, dated, and tied to a root cause — ideally something that makes the failure mode impossible, not just discouraged.

🚫 Treating the post-mortem as paperwork

Why it happens: The fire is out, everyone’s tired, and the document feels like a formality.
The fix: The artefact is worthless if the action items aren’t tracked to completion. Put them in the backlog with the same weight as features, and review them at the next retro. An untracked action item is a recurrence waiting to happen.

9 Now You Try

Three graded exercises. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Both Root Causes

Read the incident below. Produce a 5 Whys for the introduction path (why the defect existed) and a separate 5 Whys for the escape path (why testing missed it). End each chain at a systemic root cause your team could control, and name the corrective action for each.

Incident: Awa Health patient portal
A fictional NZ health provider’s portal let patients book appointments. After a release, patients in Christchurch saw appointment times one hour off — a 9:00am slot showed as 10:00am. About 200 patients arrived at the wrong time over four days. The cause: the booking service stored times in NZDT but rendered them assuming NZST, so during the daylight-saving transition the offset was wrong. The dev fixed it by storing all times in UTC. It was never caught in testing because the test environment’s clock was fixed to a single date in July (standard time), and no test covered a daylight-saving boundary.

Write both chains and the two corrective actions:

Show model answer
INTRODUCTION PATH
Why 1: Patients saw times one hour off. → Because the service rendered NZDT times as if they were NZST.
Why 2: Because the code assumed a fixed +12 offset instead of handling daylight saving. → Timezone was treated as a constant, not a rule.
Why 3: Because there was no standard requiring time storage in UTC with explicit zone conversion at render. → Each developer handled time ad hoc.
Root cause: No engineering standard for time handling (store UTC, convert at the edge).
Corrective action: Add a "store UTC, render with explicit IANA zone (Pacific/Auckland)" rule to the coding standard; add a lint/review check. Owner + date.

ESCAPE PATH
Why 1: Testing missed it. → No test exercised a daylight-saving boundary.
Why 2: Because the test environment clock was pinned to one July date. → DST transitions were physically impossible to hit in any test.
Why 3: Because timezone/DST was never identified as a risk needing coverage. → Not on the risk register or any charter.
Root cause: A coverage model and environment blind to time-dependent behaviour.
Corrective action: Add DST-boundary and timezone test data (parameterise the env clock across both NZST and NZDT dates); add a "time-dependent behaviour" charter to the regression suite. Owner + date.

Marking: full marks need (a) two SEPARATE chains, (b) each ending at a system-level cause the team controls (a standard / an environment / a charter — not "the dev forgot"), and (c) a concrete corrective action per chain. A common miss is doing only the introduction path.
🔧 Exercise 2 of 3 — Make It Blameless

The post-mortem extract below is blameful and its action items are wishes. Rewrite it: remove the blame, reframe the root cause as a system gap, and replace the action items with specific, owned, dated, systemic ones.

Original (blameful):
“Root cause: Jordan deployed a config change on Friday without checking it properly and broke the login service for 2 hours. Jordan should have been more careful. Action items: (1) Jordan to be more careful with config. (2) Everyone should double-check before deploying. (3) Don’t deploy on Fridays.”

Rewrite the root cause and action items, blamelessly:

Show model answer
Root cause (system gap): A config change could reach production without an automated validation or a second review, so a single human error took down login. The gap is the missing guardrail, not the individual — anyone could have made the same change with the same result.

Action items:
1. Add config-schema validation to the deploy pipeline that fails the build on an invalid login-service config — owner: Platform team — by [date]. (Makes the failure mode impossible.)
2. Require a second-reviewer approval on production config changes to auth services — owner: Eng lead, via branch protection — by [date]. (Adds the missing gate.)
3. Add a fast rollback runbook + automated health check that auto-reverts login config on failed smoke test — owner: SRE — by [date]. (Cuts the 2-hour exposure.)

Note on "no Friday deploys": that's a contributing-factor mitigation, not a root-cause fix — fine to keep as a guideline, but it doesn't address the actual gap (no validation, no second review).

Marking: full marks remove all blame language and the name-as-cause, reframe the cause as a missing system guardrail, and give action items that are specific, owned, dated, and change the system. Items like "be more careful" or "double-check" score zero — they're wishes.
🏗️ Exercise 3 of 3 — Fishbone the Failure

A fictional NZ logistics company’s parcel-tracking app showed delivered parcels as “in transit” for thousands of customers after a release. Build a fishbone analysis: list at least one plausible contributing cause under each category — People, Process, Tooling, Environment/Data, Requirements — then identify which one is the most likely root cause and justify why using the “would removing it have prevented the failure?” test.

Show model answer
A strong fishbone has a plausible, specific cause on every bone — not the same cause reworded:

People: The team had no one familiar with the carrier's status-code mapping; the "delivered" code was misunderstood.
Process: No review step checked the status-mapping logic against the carrier's spec before release.
Tooling: No contract test against the carrier API, so a mapping change wasn't flagged in CI.
Environment / Data: The test environment used mock carrier data that only included "in transit" statuses — "delivered" was never in the test data.
Requirements: The acceptance criteria never specified how each carrier status maps to a customer-facing state.

Most likely ROOT cause: Environment/Data — the test data contained no "delivered" status, so the mapping bug could not surface in any test. Apply the test: if the test data HAD included delivered parcels, the wrong mapping would have shown immediately and been caught → removing this gap would have prevented the escape. (Requirements is a strong second and is the introduction-side root; a top answer notes both: the requirement gap let the bug in, the data gap let it escape.)

Marking: full marks give a distinct, plausible cause per category (not five versions of one idea), pick a defensible root cause, and justify it with the counterfactual test. Bonus for separating the introduction root (requirements) from the escape root (test data).

10 Self-Check

Click each question to reveal the answer.

Q1: What are the two root-cause questions every escaped defect has, and which one does the tester own?

Why did the defect exist? (the introduction path — usually a code/process fix) and why did testing not catch it? (the escape path). The tester owns the second: every production defect is a missing or wrong test, and finding and adding it is what strengthens the suite.

Q2: How do you know when to stop asking “why” in a 5 Whys?

When you reach a cause your team can act on at the system level — a standard, a process, a test, an environment setting. If the last answer is “the engineer made a mistake” or “the customer did something odd,” you stopped one question early. People aren’t root causes; the system that let the mistake matter is.

Q3: What is the difference between a contributing cause and a root cause?

A contributing cause made the failure more likely or worse; the root cause is the one that, if removed, would have prevented the failure entirely. Test a candidate with: “If this hadn’t been true, would the incident still have happened?” If yes, it’s contributing, not root.

Q4: Why is a post-mortem “blameless,” and what does blame cost you?

Because the goal is to fix the system, not find a culprit — and because blame buys silence. The moment people fear being named, the honest account dries up and you lose the information that contains the root cause. Safety buys the truth.

Q5: What separates a real action item from a wish?

A real action item is specific, owned, dated, and systemic — it changes the system so the failure mode becomes impossible or is caught automatically, and it traces back to a root cause. “Be more careful” and “test more thoroughly” are wishes: no owner, no date, no system change.

11 Interview Prep

Real questions asked in NZ QA interviews for senior and lead roles. Read the model answers, then practise your own version.

“A production bug just got fixed. As the tester, what do you do next?”

I run a root cause analysis on two questions, not one. First, why did the defect exist — the introduction path, which usually points at a code or process gap the devs will own. Second, and this is mine to own, why did it get past testing — the escape path. Every production defect is a test that was missing or wrong, so I trace exactly which coverage, environment, or data gap let it through and I add that test. The fix closes the bug; my job is to make sure that class of bug can’t escape again. I’d capture both in a blameless post-mortem with owned, dated action items.

“Walk me through how you’d run a 5 Whys without it turning into finger-pointing.”

I keep one rule visible: people are never the root cause. Each “why” has to land on a system condition — a missing standard, an absent test, an unrealistic environment — not a person’s choice. If a chain ends at “Sam forgot,” I ask one more why: what in the system let Sam’s forgetting reach production? That reframes it from blame to a missing guardrail, which is both fairer and more useful — and it keeps people honest, because blame makes people go quiet and you lose the real story. I’d also use a fishbone instead if the failure clearly had several contributing causes, since 5 Whys follows only one thread.

“What makes a post-mortem action item good or useless?”

A useless one is a wish — “be more careful,” “double-check deploys” — no owner, no date, no change to the system. A good one is specific, assigned, dated, and systemic: it changes the system so the failure mode is impossible or caught automatically, and it maps to a named root cause. My favourite kind makes the mistake un-makeable — a pipeline check that fails the build, a required second review on auth config — rather than just asking people to try harder. And it has to be tracked to completion in the backlog, or the post-mortem was just paperwork.