Incident Review — Chaos Day
The Tūāpapa portal has been live for three weeks. This morning, several things broke at once. Your final job: untangle the chain of failures, find the real causes, and write the blameless post-mortem that makes the team stronger instead of scared.
1 The Hook
8:40am. The Tūāpapa support queue lights up. Members can’t see their balance — the page spins and shows nothing. Others report that changing their contribution rate does nothing when they click Save. A member using a screen reader emails to say the whole contributions section has become unusable overnight. And the on-call engineer notices that a handful of rate changes submitted in the last hour seem to have been recorded twice in the fund administration system.
Four different symptoms, all at once, on a live KiwiSaver portal with 180,000 members and an FMA that cares. In the heat of it, the instinct is to find who broke it — “who deployed last night?” — and to fix each symptom in isolation. Both instincts will fail you. The symptoms are not four unrelated bugs; they are the visible ends of a smaller number of underlying causes, some of which trigger each other. And hunting for a culprit makes the engineers who understand the system best go quiet exactly when you need them talking.
This is the final skill of the capstone, and the one that separates a tester from a senior one: taking a pile of simultaneous production failures, diagnosing the chain that links them to their root causes, and writing it up in a way that fixes the system rather than punishing a person. Everything you learned in the first six weeks — requirements, design, evidence, defects, regression, release — comes back today, because every one of these failures traces to a decision made earlier.
2 The Rule
Multiple simultaneous failures are usually a chain, not a coincidence — trace the symptoms back to their root causes before you fix anything, because fixing a symptom can hide the cause that will strike again. And write the review blamelessly: incidents come from system and process gaps, not bad people, and a hunt for a culprit silences the only people who can fix it.
3 The Analogy
An air-accident investigator, not a blame court.
When an aircraft has an incident, NZ’s Transport Accident Investigation Commission does not start by asking which pilot to punish. They reconstruct the chain — this warning was missed because that alert was miscalibrated because this checklist had a gap — because they know accidents are almost never one person’s single mistake. They publish findings that change the system so the same chain cannot happen again, and they do it blamelessly so everyone tells the truth about what they saw.
A production incident review is the same discipline. You reconstruct the failure chain, you look for the system and process gaps that let it happen, and you write it so people speak openly. The moment it becomes a blame court, the truth dries up — and an incident you cannot honestly understand is one you will live through again.
4 Chaos Day: The Four Failures
Here is what the team pieces together over the morning. Four symptoms, and the technical detail behind each.
balanceNZD to balance_nzd. The portal still asks for balanceNZD, gets nothing, and the page hangs waiting.F2 — Save button does nothing. A front-end change shipped in the same release renamed the Save button’s automation locator/id. The portal’s own click handler was wired to the old id, so clicking Save now does nothing — and the automated regression test that should have caught it was also keyed to the old id, so it “passed” against an element that no longer existed.
F3 — Screen reader can’t use contributions. The fast-follow that added the screen-reader announcement (the Week 6 condition) shipped, but it moved the focus order and removed the label from the rate selector, so a screen reader now reads it as an unlabelled control. The accessibility fix introduced a new accessibility failure.
F4 — Some rate changes recorded twice. Because of F1 the pages were slow, so impatient members clicked Save more than once. With no protection against a double submission and a race in the rate-change handler, two near-simultaneous requests both wrote to the fund administration system — a duplicate rate change.
Notice already that these are not four independent bugs. F4 is partly caused by F1. F2 and F3 both come from changes shipped in the same release that were not properly re-tested. The job is to make that structure explicit.
5 Reading the Failure Chain
A failure chain maps how one thing led to another, separating triggers from consequences. Drawing it stops you fixing symptoms in the wrong order and reveals the few causes behind the many symptoms.
└─ API renamed balanceNZD → balance_nzd ........... F1 balance won’t load
│ └─ slow/spinning pages → members double-click Save
│ └─ no double-submit guard + race ........... F4 duplicate rate change
└─ front-end renamed Save id; click handler & test both on old id .. F2 Save does nothing
Fast-follow accessibility change moved focus / dropped label ........... F3 screen reader broken
Two root causes: (A) a release shipped with breaking changes that integration
and regression testing did not catch; (B) changes were not re-tested for the
effects they had on connected behaviour (accessibility, concurrency).
Once it is drawn, the four symptoms collapse into two root causes. F1 and F2 are both “a breaking change shipped untested.” F3 and F4 are both “a change’s effect on connected behaviour was not re-tested” — the exact regression thinking from Week 5. The chain is what turns “four fires” into “two things to actually fix.”
6 Root Cause vs Symptom
A symptom is what the member saw. A root cause is the underlying gap that, if fixed, stops the symptom recurring. The discipline is to keep asking “and why did that happen?” until you reach something you can actually change in the system or process.
Why? The click handler points at an old element id.
Why didn’t testing catch it? The regression test was keyed to the same old id, so it passed against an element that no longer existed — a test that cannot fail is worse than no test.
Why was a breaking rename shipped at all? There was no contract check between front-end and back-end, and no review step that flags a renamed id as breaking.
Root cause: the release process has no guard against breaking interface changes, and the test suite has locators that silently pass when the thing they target is gone.
Fixing the symptom is re-pointing the handler at the new id — ten minutes, and it will happen again next rename. Fixing the root cause is adding a contract check and making locators fail loudly when their target is missing. A post-mortem that stops at the symptom guarantees a repeat.
7 The Blameless Post-Mortem
A post-mortem is the written record of an incident: what happened, the timeline, the chain, the root causes, and the actions to prevent recurrence. “Blameless” does not mean no accountability — it means the actions land on systems and processes, not on naming a person who made a normal mistake under normal conditions.
Summary: For ~2 hours, members could not view balances or change rates;
screen-reader users were locked out; some rate changes duplicated.
Impact: ~X members affected; Y duplicate changes corrected; FMA notified.
Timeline: 02:00 release deployed; 08:40 first reports; 09:15 root cause found;
10:30 API rolled back; 11:00 service restored.
Failure chain: (as drawn in section 5).
Root causes: (A) breaking changes shipped without an interface/contract check
or effective regression; (B) change effects on connected behaviour
(accessibility, concurrency) not re-tested.
What went well: on-call escalation was fast; rollback worked cleanly.
Actions: 1. Add a front-end/back-end contract check to the pipeline (owner, date).
2. Make UI locators fail when their target is missing (owner, date).
3. Add a double-submit guard on the rate-change handler (owner, date).
4. Re-test accessibility after every accessibility change (owner, date).
Language: No individuals named. Causes described as system/process gaps.
Every action is a change to the system, each has an owner and a date, and not a single line names who typed the rename. That is what makes people tell you the truth next time — and what makes the portal genuinely safer, which is the only point of writing it.
8 Common Mistakes
🚫 Treating simultaneous failures as unrelated bugs
Why it happens: Four symptoms look like four problems, and fixing each one feels like progress.
The fix: Draw the failure chain first. The duplicate rate change was caused by the slow pages, which were caused by the API rename — fixing them in isolation, or in the wrong order, can hide a cause that strikes again. Several symptoms usually trace to a few root causes.
🚫 Stopping at the symptom
Why it happens: Re-pointing the click handler at the new id makes Save work again, and the fire is out.
The fix: That fixes today and guarantees a repeat at the next rename. Keep asking “and why did that happen?” until you reach a system or process gap you can change — here, no contract check and locators that pass when their target is gone.
🚫 Hunting for who to blame
Why it happens: “Who deployed last night?” feels like the fastest route to accountability.
The fix: Blame makes the people who understand the system best go quiet, so the truth you need to prevent a repeat dries up. The engineer who shipped the rename did a normal thing in a process with no guard. Fix the process; the rename was the trigger, not the cause.
🚫 Actions with no owner or date
Why it happens: “We should add a contract check” feels like a conclusion, so the write-up ends there.
The fix: An action without an owner and a date is a wish that never happens, and the same incident recurs in three months. Every action needs a named owner and a deadline, and someone tracking that they actually land.
9 Now You Try
Three graded exercises on the Chaos Day incident. Write your answer, run it for feedback from your test lead, then compare to the model answer.
A teammate’s reaction to Chaos Day is below. Identify 3 problems with this approach and explain why each will make the incident worse or recur.
List 3 problems and explain each:
Show model answer
This response is a textbook set of incident mistakes. Any three earn full marks. 1. Treating the four symptoms as four unrelated bugs and not drawing the chain — the duplicate rate change (F4) is caused by the slow pages (F1), which are caused by the API rename. They are not independent, and fixing them in isolation hides the structure. 2. Blaming "whoever deployed the API change" — this is a blame hunt. The rename was the trigger, not the root cause; the cause is that there was no contract check or effective regression to catch a breaking change. Naming a person silences the people who can actually help and puts blame in the report, which is exactly what a blameless review avoids. 3. Stopping at symptoms and declaring the double-submit "probably won't happen again" — fixing F1 makes F4 disappear without fixing the real cause (no double-submit guard + a race). The next slow day, it duplicates rate changes again. Symptom-fixing guarantees a repeat. Bonus: "once Save works we're done" — no root-cause analysis, no actions with owners/dates, no re-test of the accessibility regression (F3) at all. The right move is to draw the chain, reach the two root causes, and write blameless actions with owners and dates.
Using the four failures (F1 balance won’t load; F2 Save does nothing; F3 screen reader broken; F4 duplicate rate change), map the failure chain — show which failures trigger which, and collapse the four symptoms into the two root causes. Show your reasoning, not just a list.
Map the chain and name the two root causes:
Show model answer
Trigger(s): the overnight release (a back-end API change + a front-end change), plus the accessibility fast-follow. The chain: - API renamed balanceNZD -> balance_nzd; the portal still requests the old name -> F1 balance won't load (page spins). - F1's slow/spinning pages made impatient members click Save more than once -> with no double-submit guard and a race in the rate handler, two requests both wrote -> F4 duplicate rate change. (So F4 is a CONSEQUENCE of F1, not independent.) - The front-end renamed the Save button's id; the click handler was still on the old id -> F2 Save does nothing. The regression test was also keyed to the old id, so it passed against an element that no longer existed and didn't catch it. - The accessibility fast-follow moved focus order and removed the selector's label -> F3 screen reader broken (a new accessibility failure from an accessibility fix). Two root causes: (A) Breaking interface changes (the API field rename, the Save id rename) shipped without an interface/contract check or effective regression — F1 and F2. (B) The effect of a change on connected behaviour was not re-tested — concurrency for the rate handler (F4) and accessibility after the accessibility change (F3). The key insight: four symptoms, two causes; and F4 is downstream of F1, so fixing F1 alone would mask F4's real cause.
Write the blameless post-mortem for Chaos Day. Include: Summary, Impact, Timeline, Failure chain, Root causes, What went well, and Actions (each with an owner and a date). Name no individuals; every action must target a system or process.
Show model answer
Post-mortem — Tūāpapa portal incident (Chaos Day) Summary: For about two hours after an overnight release, members could not view balances or change their contribution rate, screen-reader users were locked out of the contributions section, and a small number of rate changes were recorded twice. Impact: ~X members affected over the window; Y duplicate rate changes identified and corrected in the fund administration system; FMA notified per our obligations; no funds lost. Timeline: 02:00 release deployed (API + front-end). 08:40 first member reports. 09:15 root cause identified (API field rename + Save id rename). 10:30 API change rolled back. 11:00 service restored; duplicate changes reconciled by midday. Failure chain: API renamed the balance field -> balance won't load -> slow pages led members to double-click Save -> no double-submit guard + a race -> duplicate rate change. Separately, the front-end renamed the Save id while the click handler (and the regression test) stayed on the old id -> Save did nothing and the test passed against a missing element. The accessibility fast-follow moved focus and dropped the selector label -> screen reader broken. Root causes: (A) breaking interface changes shipped without a contract check or effective regression; (B) change effects on connected behaviour (concurrency, accessibility) were not re-tested. What went well: on-call escalation was fast; the API rollback was clean; duplicates were detected and reconciled the same day. Actions: 1. Add a front-end/back-end contract check to the pipeline — Platform team — by 30 June. 2. Make UI test locators fail when their target element is missing — QA team — by 30 June. 3. Add a double-submit guard + fix the race on the rate-change handler — Dev team — by 27 June. 4. Mandate an accessibility re-test after any accessibility change — QA team — ongoing, from next sprint. Language: no individuals named; all causes framed as system/process gaps. A strong post-mortem reconstructs the chain, reaches the two root causes, lists actions that each target the system with an owner and date, and names no one. It makes the portal safer instead of making people afraid.
10 Self-Check
Click each question to reveal the answer.
Q1: Why draw the failure chain before fixing any of the symptoms?
Because simultaneous failures are usually linked, and fixing one can hide the cause of another. The duplicate rate change was a consequence of the slow pages caused by the API rename — fixing the pages first would make the duplication “disappear” while the underlying double-submit race sat waiting for the next slow day. The chain turns many symptoms into a few real causes.
Q2: What is the difference between a symptom and a root cause?
A symptom is what the member saw — Save does nothing. A root cause is the underlying gap that, if fixed, stops it recurring — no contract check on breaking changes, and locators that pass when their target is gone. You reach it by repeatedly asking “and why did that happen?” until you hit something you can change in the system or process.
Q3: How did one automated test “pass” against a broken Save button?
The regression test was keyed to the same old element id that the front-end had renamed, so it was checking an element that no longer existed and reported success. A test that cannot fail is worse than no test — it gives false confidence. The fix is locators that fail loudly when their target is missing.
Q4: Why is a blame hunt actively harmful in an incident review?
Because it makes the people who understand the system best go quiet, so the truth you need to prevent a repeat dries up. Incidents come from system and process gaps, not bad people doing normal work; the engineer who renamed the field was the trigger, not the cause. Blameless reviews keep people talking and fix the process.
Q5: What makes a post-mortem action worth writing down?
It targets a system or process gap (not a person), and it has a named owner and a date. “We should add a contract check” with no owner is a wish that never happens and the incident recurs. An action with an owner, a deadline, and someone tracking it is what actually makes the system stronger.
11 Interview Prep
Real questions asked in NZ QA interviews. Read the model answers, then practise your own version.
“Several things break in production at once. Walk me through how you’d approach it.”
First I stabilise — escalate, and if there’s a clean rollback, take it. But before fixing each symptom, I draw the failure chain, because simultaneous failures are usually linked and fixing them in the wrong order can mask a cause. On a recent example, a duplicate-write symptom was actually downstream of slow pages caused by an API field rename — fixing the pages would have hidden the real concurrency bug. Once the chain is drawn, the handful of symptoms usually collapse into a couple of root causes, and I keep asking “why did that happen” until I reach a system or process gap I can actually change.
“What is a blameless post-mortem and why does it matter?”
It’s a written incident review where the actions land on systems and processes, not on naming the person who made a normal mistake. It matters because the moment a review becomes a blame court, the people who understand the system best stop telling you the truth — and an incident you can’t honestly understand is one you’ll repeat. Blameless doesn’t mean no accountability; the accountability is to fix the gap that let a normal human action cause an outage. I’d frame every cause as a system or process gap and give every action an owner and a date.
“A test passed but the feature was clearly broken in production. How does that happen?”
Usually the test was checking the wrong thing or checking something that no longer existed. In the Tūāpapa incident, a Save button’s id was renamed, but the regression test was still keyed to the old id, so it “passed” against an element that wasn’t there — a test that literally couldn’t fail. That’s worse than no test, because it gives false confidence. The fix is making locators fail loudly when their target is missing, and treating an always-green test as a smell. It’s a reminder that a passing suite is only as trustworthy as what the tests actually assert.