Capstone · Week 7 of 7

Incident Review — Chaos Day

The Tūāpapa portal has been live for three weeks. This morning, several things broke at once. Your final job: untangle the chain of failures, find the real causes, and write the blameless post-mortem that makes the team stronger instead of scared.

Capstone Week 7 of 7 — Incident Review ~35 min read · ~85 min with exercises

1 The Hook

8:40am. The Tūāpapa support queue lights up. Members can’t see their balance — the page spins and shows nothing. Others report that changing their contribution rate does nothing when they click Save. A member using a screen reader emails to say the whole contributions section has become unusable overnight. And the on-call engineer notices that a handful of rate changes submitted in the last hour seem to have been recorded twice in the fund administration system.

Four different symptoms, all at once, on a live KiwiSaver portal with 180,000 members and an FMA that cares. In the heat of it, the instinct is to find who broke it — “who deployed last night?” — and to fix each symptom in isolation. Both instincts will fail you. The symptoms are not four unrelated bugs; they are the visible ends of a smaller number of underlying causes, some of which trigger each other. And hunting for a culprit makes the engineers who understand the system best go quiet exactly when you need them talking.

This is the final skill of the capstone, and the one that separates a tester from a senior one: taking a pile of simultaneous production failures, diagnosing the chain that links them to their root causes, and writing it up in a way that fixes the system rather than punishing a person. Everything you learned in the first six weeks — requirements, design, evidence, defects, regression, release — comes back today, because every one of these failures traces to a decision made earlier.

Senior engineer insight

Testers who thrive in this phase stop asking "what broke?" and start asking "what linked these failures?" — the ability to see the failure chain before touching a single fix is what separates a methodical senior from someone who ships three hot-fixes and breaks a fourth thing. The unique pressure of Week 7 is that you have already been living with this codebase for six weeks: you think you know how it hangs together, and that familiarity is exactly what makes you skip the chain-drawing step and trust your gut instead.

Most common capstone mistake at this stage: writing a post-mortem that lists the symptoms as the root causes — "Save button id mismatch" is a symptom; "no contract check on breaking interface changes" is the root cause. One leads to a ten-minute patch, the other leads to a pipeline change that prevents the next six incidents.

From the field

A Wellington team delivering an upgrade to a Benefits NZ self-service portal assumed their post-release week would be quiet — they had green regression results and a sign-off from the accessibility consultant. What they discovered on day three was that two overnight configuration changes had interacted: a CDN cache rule was serving stale API responses to mobile users, and a front-end timeout that was supposed to show a fallback message was silently eating the errors instead, so members were seeing a blank screen with no explanation. The team's first instinct was to roll back the CDN change, which would have masked the timeout bug entirely. What changed their approach was a test lead who refused to touch the rollback until they had drawn the full failure chain on a whiteboard — it took twenty minutes they felt they didn't have, and it revealed that the real fix was a two-line timeout handler change that prevented any recurrence. The government's post-mortem template, modelled on the pattern used by the New Zealand Transport Accident Investigation Commission, requires that every action name a system or process gap rather than a person — a discipline the team later said made the debrief the most honest conversation they'd had all year.

2 The Rule

Multiple simultaneous failures are usually a chain, not a coincidence — trace the symptoms back to their root causes before you fix anything, because fixing a symptom can hide the cause that will strike again. And write the review blamelessly: incidents come from system and process gaps, not bad people, and a hunt for a culprit silences the only people who can fix it.

3 The Analogy

Analogy

An air-accident investigator, not a blame court.

When an aircraft has an incident, NZ’s Transport Accident Investigation Commission does not start by asking which pilot to punish. They reconstruct the chain — this warning was missed because that alert was miscalibrated because this checklist had a gap — because they know accidents are almost never one person’s single mistake. They publish findings that change the system so the same chain cannot happen again, and they do it blamelessly so everyone tells the truth about what they saw.

A production incident review is the same discipline. You reconstruct the failure chain, you look for the system and process gaps that let it happen, and you write it so people speak openly. The moment it becomes a blame court, the truth dries up — and an incident you cannot honestly understand is one you will live through again.

4 Chaos Day: The Four Failures

Here is what the team pieces together over the morning. Four symptoms, and the technical detail behind each.

F1 — Balance won’t load. A new version of the fund administration API was deployed overnight. It renamed the balance field from balanceNZD to balance_nzd. The portal still asks for balanceNZD, gets nothing, and the page hangs waiting.

F2 — Save button does nothing. A front-end change shipped in the same release renamed the Save button’s automation locator/id. The portal’s own click handler was wired to the old id, so clicking Save now does nothing — and the automated regression test that should have caught it was also keyed to the old id, so it “passed” against an element that no longer existed.

F3 — Screen reader can’t use contributions. The fast-follow that added the screen-reader announcement (the Week 6 condition) shipped, but it moved the focus order and removed the label from the rate selector, so a screen reader now reads it as an unlabelled control. The accessibility fix introduced a new accessibility failure.

F4 — Some rate changes recorded twice. Because of F1 the pages were slow, so impatient members clicked Save more than once. With no protection against a double submission and a race in the rate-change handler, two near-simultaneous requests both wrote to the fund administration system — a duplicate rate change.

Notice already that these are not four independent bugs. F4 is partly caused by F1. F2 and F3 both come from changes shipped in the same release that were not properly re-tested. The job is to make that structure explicit.

5 Reading the Failure Chain

A failure chain maps how one thing led to another, separating triggers from consequences. Drawing it stops you fixing symptoms in the wrong order and reveals the few causes behind the many symptoms.

Overnight release (API + front-end) shipped without contract / regression checks

  └─ API renamed balanceNZD → balance_nzd ........... F1 balance won’t load

  │   └─ slow/spinning pages → members double-click Save

  │      └─ no double-submit guard + race ........... F4 duplicate rate change

  └─ front-end renamed Save id; click handler & test both on old id .. F2 Save does nothing

Fast-follow accessibility change moved focus / dropped label ........... F3 screen reader broken

Two root causes: (A) a release shipped with breaking changes that integration

  and regression testing did not catch; (B) changes were not re-tested for the

  effects they had on connected behaviour (accessibility, concurrency).

Once it is drawn, the four symptoms collapse into two root causes. F1 and F2 are both “a breaking change shipped untested.” F3 and F4 are both “a change’s effect on connected behaviour was not re-tested” — the exact regression thinking from Week 5. The chain is what turns “four fires” into “two things to actually fix.”

Pro tip: When several things fail together, resist fixing any of them until you have drawn the chain. Fixing F1 first (so pages load) would have made F4 “disappear” without anyone realising the double-submit race was still there, waiting for the next slow day.

6 Root Cause vs Symptom

A symptom is what the member saw. A root cause is the underlying gap that, if fixed, stops the symptom recurring. The discipline is to keep asking “and why did that happen?” until you reach something you can actually change in the system or process.

Symptom: the Save button does nothing (F2).
Why? The click handler points at an old element id.
Why didn’t testing catch it? The regression test was keyed to the same old id, so it passed against an element that no longer existed — a test that cannot fail is worse than no test.
Why was a breaking rename shipped at all? There was no contract check between front-end and back-end, and no review step that flags a renamed id as breaking.
Root cause: the release process has no guard against breaking interface changes, and the test suite has locators that silently pass when the thing they target is gone.

Fixing the symptom is re-pointing the handler at the new id — ten minutes, and it will happen again next rename. Fixing the root cause is adding a contract check and making locators fail loudly when their target is missing. A post-mortem that stops at the symptom guarantees a repeat.

7 The Blameless Post-Mortem

A post-mortem is the written record of an incident: what happened, the timeline, the chain, the root causes, and the actions to prevent recurrence. “Blameless” does not mean no accountability — it means the actions land on systems and processes, not on naming a person who made a normal mistake under normal conditions.

Post-mortem — Tūāpapa portal incident, Chaos Day

Summary:   For ~2 hours, members could not view balances or change rates;

          screen-reader users were locked out; some rate changes duplicated.

Impact:    ~X members affected; Y duplicate changes corrected; FMA notified.

Timeline:  02:00 release deployed; 08:40 first reports; 09:15 root cause found;

          10:30 API rolled back; 11:00 service restored.

Failure chain: (as drawn in section 5).

Root causes: (A) breaking changes shipped without an interface/contract check

  or effective regression; (B) change effects on connected behaviour

  (accessibility, concurrency) not re-tested.

What went well: on-call escalation was fast; rollback worked cleanly.

Actions:   1. Add a front-end/back-end contract check to the pipeline (owner, date).

          2. Make UI locators fail when their target is missing (owner, date).

          3. Add a double-submit guard on the rate-change handler (owner, date).

          4. Re-test accessibility after every accessibility change (owner, date).

Language:  No individuals named. Causes described as system/process gaps.

Every action is a change to the system, each has an owner and a date, and not a single line names who typed the rename. That is what makes people tell you the truth next time — and what makes the portal genuinely safer, which is the only point of writing it.

Pro tip: Always include a “what went well” section. Incidents are not only failures — the fast escalation and clean rollback are practices worth keeping, and naming them keeps the review honest rather than purely negative.

8 Common Mistakes

🚫 Treating simultaneous failures as unrelated bugs

Why it happens: Four symptoms look like four problems, and fixing each one feels like progress.
The fix: Draw the failure chain first. The duplicate rate change was caused by the slow pages, which were caused by the API rename — fixing them in isolation, or in the wrong order, can hide a cause that strikes again. Several symptoms usually trace to a few root causes.

🚫 Stopping at the symptom

Why it happens: Re-pointing the click handler at the new id makes Save work again, and the fire is out.
The fix: That fixes today and guarantees a repeat at the next rename. Keep asking “and why did that happen?” until you reach a system or process gap you can change — here, no contract check and locators that pass when their target is gone.

🚫 Hunting for who to blame

Why it happens: “Who deployed last night?” feels like the fastest route to accountability.
The fix: Blame makes the people who understand the system best go quiet, so the truth you need to prevent a repeat dries up. The engineer who shipped the rename did a normal thing in a process with no guard. Fix the process; the rename was the trigger, not the cause.

🚫 Actions with no owner or date

Why it happens: “We should add a contract check” feels like a conclusion, so the write-up ends there.
The fix: An action without an owner and a date is a wish that never happens, and the same incident recurs in three months. Every action needs a named owner and a deadline, and someone tracking that they actually land.

9 Now You Try

Three graded exercises on the Chaos Day incident. Write your answer, run it for feedback from your test lead, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Bad Incident Response

A teammate’s reaction to Chaos Day is below. Identify 3 problems with this approach and explain why each will make the incident worse or recur.

“Right, four bugs. Let’s just fix them one by one — I’ll re-point the Save button first so that’s working again. Whoever deployed that API change last night really dropped the ball, we should put that in the report. Once Save works and the balance loads, we’re done — the double-up thing probably won’t happen again now the pages are fast.”

List 3 problems and explain each:

Show model answer

This response is a textbook set of incident mistakes. Any three earn full marks.

1. Treating the four symptoms as four unrelated bugs and not drawing the chain — the duplicate rate change (F4) is caused by the slow pages (F1), which are caused by the API rename. They are not independent, and fixing them in isolation hides the structure.

2. Blaming "whoever deployed the API change" — this is a blame hunt. The rename was the trigger, not the root cause; the cause is that there was no contract check or effective regression to catch a breaking change. Naming a person silences the people who can actually help and puts blame in the report, which is exactly what a blameless review avoids.

3. Stopping at symptoms and declaring the double-submit "probably won't happen again" — fixing F1 makes F4 disappear without fixing the real cause (no double-submit guard + a race). The next slow day, it duplicates rate changes again. Symptom-fixing guarantees a repeat.

Bonus: "once Save works we're done" — no root-cause analysis, no actions with owners/dates, no re-test of the accessibility regression (F3) at all. The right move is to draw the chain, reach the two root causes, and write blameless actions with owners and dates.

🔧 Exercise 2 of 3 — Build the Failure Chain

Using the four failures (F1 balance won’t load; F2 Save does nothing; F3 screen reader broken; F4 duplicate rate change), map the failure chain — show which failures trigger which, and collapse the four symptoms into the two root causes. Show your reasoning, not just a list.

Reminders: F1 = API renamed the balance field (overnight release). F2 = front-end renamed the Save id; both the click handler and the regression test used the old id. F3 = the accessibility fast-follow moved focus and dropped the selector’s label. F4 = slow pages led members to double-click Save; no double-submit guard + a race wrote the change twice.

Map the chain and name the two root causes:

Show model answer

Trigger(s): the overnight release (a back-end API change + a front-end change), plus the accessibility fast-follow.

The chain:
 - API renamed balanceNZD -> balance_nzd; the portal still requests the old name -> F1 balance won't load (page spins).
 - F1's slow/spinning pages made impatient members click Save more than once -> with no double-submit guard and a race in the rate handler, two requests both wrote -> F4 duplicate rate change. (So F4 is a CONSEQUENCE of F1, not independent.)
 - The front-end renamed the Save button's id; the click handler was still on the old id -> F2 Save does nothing. The regression test was also keyed to the old id, so it passed against an element that no longer existed and didn't catch it.
 - The accessibility fast-follow moved focus order and removed the selector's label -> F3 screen reader broken (a new accessibility failure from an accessibility fix).

Two root causes:
 (A) Breaking interface changes (the API field rename, the Save id rename) shipped without an interface/contract check or effective regression — F1 and F2.
 (B) The effect of a change on connected behaviour was not re-tested — concurrency for the rate handler (F4) and accessibility after the accessibility change (F3).

The key insight: four symptoms, two causes; and F4 is downstream of F1, so fixing F1 alone would mask F4's real cause.

🏗️ Exercise 3 of 3 — Write the Blameless Post-Mortem

Write the blameless post-mortem for Chaos Day. Include: Summary, Impact, Timeline, Failure chain, Root causes, What went well, and Actions (each with an owner and a date). Name no individuals; every action must target a system or process.

Show model answer

Post-mortem — Tūāpapa portal incident (Chaos Day)

Summary: For about two hours after an overnight release, members could not view balances or change their contribution rate, screen-reader users were locked out of the contributions section, and a small number of rate changes were recorded twice.

Impact: ~X members affected over the window; Y duplicate rate changes identified and corrected in the fund administration system; FMA notified per our obligations; no funds lost.

Timeline: 02:00 release deployed (API + front-end). 08:40 first member reports. 09:15 root cause identified (API field rename + Save id rename). 10:30 API change rolled back. 11:00 service restored; duplicate changes reconciled by midday.

Failure chain: API renamed the balance field -> balance won't load -> slow pages led members to double-click Save -> no double-submit guard + a race -> duplicate rate change. Separately, the front-end renamed the Save id while the click handler (and the regression test) stayed on the old id -> Save did nothing and the test passed against a missing element. The accessibility fast-follow moved focus and dropped the selector label -> screen reader broken.

Root causes: (A) breaking interface changes shipped without a contract check or effective regression; (B) change effects on connected behaviour (concurrency, accessibility) were not re-tested.

What went well: on-call escalation was fast; the API rollback was clean; duplicates were detected and reconciled the same day.

Actions:
 1. Add a front-end/back-end contract check to the pipeline — Platform team — by 30 June.
 2. Make UI test locators fail when their target element is missing — QA team — by 30 June.
 3. Add a double-submit guard + fix the race on the rate-change handler — Dev team — by 27 June.
 4. Mandate an accessibility re-test after any accessibility change — QA team — ongoing, from next sprint.

Language: no individuals named; all causes framed as system/process gaps.

A strong post-mortem reconstructs the chain, reaches the two root causes, lists actions that each target the system with an owner and date, and names no one. It makes the portal safer instead of making people afraid.

Why teams fail here

They fix symptoms in the order they're reported rather than in the order the chain demands — resolving the API rename first makes the duplicate-write bug disappear from the dashboard, and nobody realises the double-submit race is still live until the next slow deploy day.
They write post-mortem actions that describe the symptom fix rather than the system change — "re-pointed the Save handler to the new id" is a patch note, not a process improvement; the action that prevents a repeat is "add a contract check to the release pipeline."
They let a blame narrative form in the room before the chain is drawn — once someone says "whoever renamed that field broke everything," the engineer who did it stops contributing, and the team loses the person with the deepest context on exactly what changed and why.
They skip the "what went well" section because the incident feels purely negative — omitting it produces a one-sided document that demoralises the team and destroys the practices (fast escalation, clean rollback) that actually limited the damage.

Key takeaway

Week 7 does not teach you how to fix production bugs — it teaches you that the way you document an incident determines whether your team is stronger or more frightened the next time something breaks.

10 Self-Check

Click each question to reveal the answer.

Q1: Why draw the failure chain before fixing any of the symptoms?

Because simultaneous failures are usually linked, and fixing one can hide the cause of another. The duplicate rate change was a consequence of the slow pages caused by the API rename — fixing the pages first would make the duplication “disappear” while the underlying double-submit race sat waiting for the next slow day. The chain turns many symptoms into a few real causes.

Q2: What is the difference between a symptom and a root cause?

A symptom is what the member saw — Save does nothing. A root cause is the underlying gap that, if fixed, stops it recurring — no contract check on breaking changes, and locators that pass when their target is gone. You reach it by repeatedly asking “and why did that happen?” until you hit something you can change in the system or process.

Q3: How did one automated test “pass” against a broken Save button?

The regression test was keyed to the same old element id that the front-end had renamed, so it was checking an element that no longer existed and reported success. A test that cannot fail is worse than no test — it gives false confidence. The fix is locators that fail loudly when their target is missing.

Q4: Why is a blame hunt actively harmful in an incident review?

Because it makes the people who understand the system best go quiet, so the truth you need to prevent a repeat dries up. Incidents come from system and process gaps, not bad people doing normal work; the engineer who renamed the field was the trigger, not the cause. Blameless reviews keep people talking and fix the process.

Q5: What makes a post-mortem action worth writing down?

It targets a system or process gap (not a person), and it has a named owner and a date. “We should add a contract check” with no owner is a wish that never happens and the incident recurs. An action with an owner, a deadline, and someone tracking it is what actually makes the system stronger.

11 Interview Prep

Real questions asked in NZ QA interviews. Read the model answers, then practise your own version.

“Several things break in production at once. Walk me through how you’d approach it.”

First I stabilise — escalate, and if there’s a clean rollback, take it. But before fixing each symptom, I draw the failure chain, because simultaneous failures are usually linked and fixing them in the wrong order can mask a cause. On a recent example, a duplicate-write symptom was actually downstream of slow pages caused by an API field rename — fixing the pages would have hidden the real concurrency bug. Once the chain is drawn, the handful of symptoms usually collapse into a couple of root causes, and I keep asking “why did that happen” until I reach a system or process gap I can actually change.

“What is a blameless post-mortem and why does it matter?”

It’s a written incident review where the actions land on systems and processes, not on naming the person who made a normal mistake. It matters because the moment a review becomes a blame court, the people who understand the system best stop telling you the truth — and an incident you can’t honestly understand is one you’ll repeat. Blameless doesn’t mean no accountability; the accountability is to fix the gap that let a normal human action cause an outage. I’d frame every cause as a system or process gap and give every action an owner and a date.

“A test passed but the feature was clearly broken in production. How does that happen?”

Usually the test was checking the wrong thing or checking something that no longer existed. In the Tūāpapa incident, a Save button’s id was renamed, but the regression test was still keyed to the old id, so it “passed” against an element that wasn’t there — a test that literally couldn’t fail. That’s worse than no test, because it gives false confidence. The fix is making locators fail loudly when their target is missing, and treating an always-green test as a smell. It’s a reminder that a passing suite is only as trustworthy as what the tests actually assert.

← Week 6 — Release & Reporting Back to Capstone Overview →

Incident Review — Chaos Day

1 The Hook

2 The Rule

3 The Analogy

4 Chaos Day: The Four Failures

5 Reading the Failure Chain

6 Root Cause vs Symptom

7 The Blameless Post-Mortem

8 Common Mistakes

9 Now You Try

10 Self-Check

Related techniques

11 Interview Prep