Incident Post-Mortem Facilitation
A production defect slipped through your test process. You now have to lead the post-mortem without making it a blame session — and come out with real process improvements.
What Happened
You are Test Manager at Ōtautahi Health, a Christchurch digital health platform serving 45,000 patients. Last Thursday at 6:14 PM NZST, a deployed feature broke the patient appointment booking flow for 3 hours. Root cause: a null check was removed during a refactor, crashing on first-time patients with no appointment history.
- 2,340 patients affected (error page when attempting to book)
- 17 calls to the support line
- CEO notified
- Hotfix deployed at 9:08 PM NZST (2h 54m outage)
The defect was introduced in a PR merged 4 days earlier. Your test suite had 87% line coverage. The failing scenario — first-time patient with no appointment history — was not in any test case.
Your task: Lead the blameless post-mortem session tomorrow morning. Answer the questions below as you plan your approach.
Question 1 of 4
You open the post-mortem meeting. What is the FIRST thing you say?
Why B is correct: Setting the blameless frame at the very start of the session is not a formality — it actively changes what people say and how honestly they participate. If the first words are about finding fault, people become defensive and the real systemic causes stay hidden. Options A, C, and D all assign blame (to the PR approver, to a coverage target, to the dev team). A post-mortem that starts with blame produces defensive silence, not process improvement.
Question 2 of 4
When reviewing the root cause, the team discovers the null check was removed because the PR reviewer was under time pressure from a release deadline. What is the systemic fix?
Why C is correct: There are two distinct problems here: a missing test scenario (technical gap) and a process pressure that caused a reviewer to rush (systemic gap). You need to fix both. Option A is blame, not a fix — the reviewer was responding rationally to the system they were in. Option B is a blunt instrument; 100% line coverage still would not have caught this specific scenario. Option D treats all PRs identically regardless of risk, which wastes time and creates friction that people will eventually route around.
Question 3 of 4
A developer says "our 87% coverage should have caught this." How do you respond?
Why B is correct: This is a critical distinction for Test Managers to be able to articulate clearly. Line coverage tells you which lines of code were executed by your test suite — it says nothing about which real-world states and user journeys were exercised. The null check code path may well have been "covered" by an existing test that always supplied appointment history. The missing scenario is a boundary condition (first-time user, empty dataset) that requires deliberate test design, not just more coverage percentage. Option A misses the point entirely — 95% line coverage could still skip this scenario. Option C is blame. Option D is unhelpfully dismissive.
Question 4 of 4
At the end of the post-mortem, what is the RIGHT output document?
Why B is correct: The output of a blameless post-mortem is a structured document that creates accountability for process improvements — not for individuals. It should have a clear timeline (what happened, when), a systemic root cause (why the system allowed this), and concrete action items with owners and due dates. Without owners and dates, action items become aspirations. Option A is a disciplinary record, not a quality improvement. Option C is a blame document dressed as a report. Option D is a policy overcorrection that will be gamed (lines covered with trivial tests) rather than genuinely improving scenario coverage.
A Strong Post-Mortem Output
Work through all four questions first, then reveal what a well-run post-mortem produces.
The 5 Sections of a Blameless Incident Report
5-Whys: This Incident
- Why 1The booking flow crashed for first-time patients.
- Why 2A null check on appointment history was removed during a refactor.
- Why 3No test case existed for a first-time patient with no appointment history.
- Why 4The user story's acceptance criteria did not specify edge cases for empty datasets.
- Why 5No 3-Amigos session ran for this story — QA was not involved in acceptance criteria definition.
Root cause: QA is excluded from story refinement, so edge cases are only discovered after code is written — or not at all.
3 Concrete Action Items
| Action | Owner | Due |
|---|---|---|
| Add "first-time patient, no appointment history" test data and scenarios to the regression suite | Dev + QA lead | 3 days |
| Add an "Edge Cases" section to the story template in Jira; make it mandatory before a story moves to In Progress | Product Owner + Tech Lead | 1 week |
| Flag deadline-pressure PRs (created within 24h of release freeze) for extended review by a second senior engineer | Tech Lead | Process change — effective immediately |