Test Manager Practice Exercise 03

Incident Post-Mortem Facilitation

A production defect slipped through your test process. You now have to lead the post-mortem without making it a blame session — and come out with real process improvements.

Post-Mortem Scenario

What Happened

You are Test Manager at Ōtautahi Health, a Christchurch digital health platform serving 45,000 patients. Last Thursday at 6:14 PM NZST, a deployed feature broke the patient appointment booking flow for 3 hours. Root cause: a null check was removed during a refactor, crashing on first-time patients with no appointment history.

2,340 patients affected (error page when attempting to book)
17 calls to the support line
CEO notified
Hotfix deployed at 9:08 PM NZST (2h 54m outage)

The defect was introduced in a PR merged 4 days earlier. Your test suite had 87% line coverage. The failing scenario — first-time patient with no appointment history — was not in any test case.

Your task: Lead the blameless post-mortem session tomorrow morning. Answer the questions below as you plan your approach.

Question 1 of 4

You open the post-mortem meeting. What is the FIRST thing you say?

Why B is correct: Setting the blameless frame at the very start of the session is not a formality — it actively changes what people say and how honestly they participate. If the first words are about finding fault, people become defensive and the real systemic causes stay hidden. Options A, C, and D all assign blame (to the PR approver, to a coverage target, to the dev team). A post-mortem that starts with blame produces defensive silence, not process improvement.

Question 2 of 4

When reviewing the root cause, the team discovers the null check was removed because the PR reviewer was under time pressure from a release deadline. What is the systemic fix?

Why C is correct: There are two distinct problems here: a missing test scenario (technical gap) and a process pressure that caused a reviewer to rush (systemic gap). You need to fix both. Option A is blame, not a fix — the reviewer was responding rationally to the system they were in. Option B is a blunt instrument; 100% line coverage still would not have caught this specific scenario. Option D treats all PRs identically regardless of risk, which wastes time and creates friction that people will eventually route around.

Question 3 of 4

A developer says "our 87% coverage should have caught this." How do you respond?

Why B is correct: This is a critical distinction for Test Managers to be able to articulate clearly. Line coverage tells you which lines of code were executed by your test suite — it says nothing about which real-world states and user journeys were exercised. The null check code path may well have been "covered" by an existing test that always supplied appointment history. The missing scenario is a boundary condition (first-time user, empty dataset) that requires deliberate test design, not just more coverage percentage. Option A misses the point entirely — 95% line coverage could still skip this scenario. Option C is blame. Option D is unhelpfully dismissive.

Question 4 of 4

At the end of the post-mortem, what is the RIGHT output document?

Why B is correct: The output of a blameless post-mortem is a structured document that creates accountability for process improvements — not for individuals. It should have a clear timeline (what happened, when), a systemic root cause (why the system allowed this), and concrete action items with owners and due dates. Without owners and dates, action items become aspirations. Option A is a disciplinary record, not a quality improvement. Option C is a blame document dressed as a report. Option D is a policy overcorrection that will be gamed (lines covered with trivial tests) rather than genuinely improving scenario coverage.

A Strong Post-Mortem Output

Work through all four questions first, then reveal what a well-run post-mortem produces.

The 5 Sections of a Blameless Incident Report

1. Summary. A 3–4 sentence overview of what happened, when, how many users were affected, and what resolved it. Written for a leadership audience — no jargon.

2. Timeline. A chronological list of events from the PR merge to the hotfix deployment. Include detection time, escalation events, and key decisions. Be precise — timestamps matter.

3. Root Cause (5-Whys). Trace from symptom to systemic cause without stopping at the first human decision. The goal is the process failure, not the person.

4. Contributing Factors. Secondary conditions that made the failure more likely or harder to detect — deadline pressure, test data gaps, missing acceptance criteria for edge cases.

5. Action Items. 3–5 concrete items with named owners and due dates. No owner + no date = no action.

5-Whys: This Incident

Why 1The booking flow crashed for first-time patients.
Why 2A null check on appointment history was removed during a refactor.
Why 3No test case existed for a first-time patient with no appointment history.
Why 4The user story's acceptance criteria did not specify edge cases for empty datasets.
Why 5No 3-Amigos session ran for this story — QA was not involved in acceptance criteria definition.

Root cause: QA is excluded from story refinement, so edge cases are only discovered after code is written — or not at all.

3 Concrete Action Items

Action	Owner	Due
Add "first-time patient, no appointment history" test data and scenarios to the regression suite	Dev + QA lead	3 days
Add an "Edge Cases" section to the story template in Jira; make it mandatory before a story moves to In Progress	Product Owner + Tech Lead	1 week
Flag deadline-pressure PRs (created within 24h of release freeze) for extended review by a second senior engineer	Tech Lead	Process change — effective immediately

NZ health context: Under the Health Information Privacy Code 2020, incidents involving patient data — including unauthorised exposure of booking details or health identifiers — may need to be notified to the Office of the Privacy Commissioner (OPC). Even where patient data was not exposed, a 3-hour outage on a health platform may trigger obligations under your organisation's privacy impact assessment. Always check with the Privacy Officer before the post-mortem closes.