Test Estimation & Planning · Lesson 2

Risk-Based Prioritisation

There is never enough time to test everything. The lead who ranks the test backlog by risk — and can defend cutting the bottom of it — ships safely on a fixed date. The lead who tests in the order things were built runs out of time on the things that matter.

Lead Test Estimation & Planning — Lesson 2 of 3 ~30 min read · ~70 min with exercises

1 The Hook

A fictional NZ government agency, Hononga Services, was shipping an online portal upgrade with a fixed go-live tied to a policy start date. The test backlog had about 200 test cases. With two weeks left, the lead realised there was time for maybe 130. He chose the obvious-looking approach: work through the backlog in the order the features had been built, top to bottom.

He got through the first 130. They were the features built first — the account-settings screens, the help pages, the profile editor. All tested, all green. The 70 he never reached were at the bottom of the build order: the payment step, the eligibility calculation, and the integration that pushed applications to the back-office system. The portal went live with its most dangerous components barely touched.

Within days the eligibility calculation was rejecting valid applicants and the back-office integration was silently dropping a slice of submissions. The agency had thoroughly tested the parts that almost did not matter and skipped the parts that could not be allowed to fail. The lead had not run out of time on the important things by accident — he had spent his time on the unimportant things first, by choosing the wrong order.

Here is the lesson. When you cannot test everything — which is always — the order is the whole game. Testing in build order, or alphabetical order, or whatever-is-in-front-of-you order, means the work you drop is chosen by chance. Risk-based prioritisation makes the order deliberate: test the things most likely to fail and most damaging if they do, first, so that whatever you run out of time for is the stuff that matters least.

2 The Rule

You can never test everything, so the only question that matters is order. Rank every test by risk — likelihood of failure times impact if it fails — and test highest risk first. Then whatever you run out of time for is, by design, the least dangerous thing to leave untested. Order is not an afterthought; it is the decision.

3 The Analogy

Analogy

A WOF inspection with the car booked in for only an hour.

A good mechanic with limited time does not start at the front bumper and work backwards, checking every bulb and trim clip before getting to the brakes. They go straight to the things that fail a car and hurt people if they are wrong — brakes, steering, tyres, seatbelts — because those are high impact and, on an older car, reasonably likely to have a problem. The chipped wing mirror gets looked at only if there is time left.

A bad mechanic checks the easy, visible stuff first because it is satisfying to tick off, and runs out of time before the brakes. Both spent the same hour. One produced a safe car and one produced a tidy list of passed bulbs and untested brakes. Risk-based prioritisation is testing like the good mechanic: the brakes before the wing mirror, every time, because the order is what decides whether the time was well spent.

4 Likelihood and Impact

Risk in testing has two dimensions, and you need both. A risk is the chance something fails multiplied by how much it hurts if it does.

Likelihood — how probable is a defect in this area? It is driven by things you can actually assess: how complex the code is, how much it changed in this release, how new the technology is, how experienced the team is with it, and how buggy it has been historically. A brand-new, complex calculation written under time pressure is high likelihood. A stable screen untouched for two years is low.

Impact — if it does fail, how bad is the damage? This is a business question, not a technical one, and it is where NZ context bites. For a bank, impact is financial loss, regulatory breach, and customer harm. For HealthNZ, it is patient safety and continuity of care. For Revenue NZ, it is wrong assessments and money moving incorrectly. A cosmetic glitch on a help page is low impact however likely it is; a wrong benefit calculation is high impact even if unlikely.

You need both because either alone misleads. A highly likely but trivial defect (a typo) is not where your scarce time goes. A catastrophic but essentially impossible failure is not either. The work that earns your first hours is high on both axes — likely to break and devastating if it does.

Pro tip: Set impact with the business, not alone. Testers are good at judging likelihood — you can see the complexity and the change. But impact is the business’s call: only they can tell you that a half-day outage of one screen is survivable while a single wrong payment is not. A risk ranking the business helped set is one they will back when you use it to cut scope.

5 The Risk Matrix

The likelihood-times-impact idea becomes a working tool as a matrix. Score each on a simple scale — say 1 (low) to 3 (high) — and multiply for a risk priority number.

Risk priority = Likelihood (1–3) × Impact (1–3)

             Impact 1   Impact 2   Impact 3

Likelihood 3     3        6        9 ← test first

Likelihood 2     2        4        6

Likelihood 1     1        2        3

Worked example — Hononga portal upgrade:

  Eligibility calculation   L=3 (new, complex) × I=3 (wrong = denied entitlements) = 9

  Back-office integration   L=3 (new interface)  × I=3 (dropped applications) = 9

  Payment step            L=2            × I=3 (money wrong) = 6

  Profile editor         L=2 (changed)   × I=1 (cosmetic) = 2

  Help pages             L=1 (static)    × I=1 = 1

Now the order is obvious and defensible: test the 9s, then the 6, and the 1s and 2s are exactly what you drop if the clock runs out. Compare this to Hononga’s actual build order, which put the help pages (score 1) before the eligibility calculation (score 9). The matrix is the difference between leaving the score-1 items untested and leaving the score-9 items untested.

The numbers are a tool for thinking and a tool for the conversation, not a precise truth. A 9 versus a 6 is a real distinction worth acting on; agonising over whether something is a 5 or a 6 is not. The matrix earns its place by making the ranking visible, shared, and defensible — not by being exact.

6 What to Test First When Time Is Short

The matrix gives you a ranking; turning it into a test order under real time pressure needs a couple of refinements.

Highest risk first, always. Start at the top of the ranking. The score-9 items get tested even if every score-9 item takes longer than planned — they do not get pushed to make room for cheaper, lower-risk tests.
Test the riskiest things early, not just thoroughly. Finding a serious defect on day one leaves time to fix and re-test it. Finding it on the last day does not. So the highest-risk areas go first in the calendar, not just first in the ranking — you want bad news as early as possible.
A shallow test of a high-risk area beats a deep test of a low-risk one. If time forces a choice, a smoke test across all the score-9 areas is worth more than exhaustive testing of one of them and nothing on the rest. Cover the dangerous ground broadly before you go deep anywhere.
Re-rank as you learn. Risk is a forecast. If an area you scored low starts throwing defects, its likelihood just went up — move it up the order. Prioritisation is continuous, not a one-time sort.

The combined rule of thumb: broad coverage of the high-impact areas first and early, then depth where the risk is highest, re-ranking as the evidence comes in.

7 Cutting Scope Defensibly

When the time runs out before the backlog does — and it will — you cut. The skill is cutting in a way you can defend to a delivery manager, an auditor, or yourself after a production incident. Cutting defensibly has three parts.

Cut from the bottom, by risk. You drop the lowest-scored items, and you can show the ranking that put them there. “We didn’t test the help-page changes” is defensible when you can show the help pages scored 1 and everything above them was covered. “We didn’t get to the eligibility calculation” is indefensible if it scored 9 — that is the Hononga failure.

Name the residual risk you are accepting. A cut is a risk taken on knowingly. State it plainly: “dropping the profile-editor regression means a cosmetic defect there could reach production undetected — low likelihood, low impact, acceptable.” The cut is defensible because the risk it leaves is small and stated, not because nothing was left untested.

Put the decision where it belongs and record it. If the cut leaves any non-trivial residual risk, the business accepts it, not you alone. A one-line record — what was cut, the residual risk, who accepted it — turns a quiet omission into a shared, dated decision. That record is what protects you and the team if the risk you cut materialises.

Pro tip: The phrase that makes a cut land well is “here is what we are choosing not to test, and here is why that is the safest thing to drop.” It shows the cut is the result of the ranking, not of running out of energy. A delivery manager will back a cut framed as a deliberate, risk-ranked choice far more readily than one that sounds like “we didn’t get to it.”

8 The Regression-versus-New-Feature Trade-Off

One trade-off comes up on almost every release and deserves its own treatment: with limited time, how much goes to testing the new feature versus regression-testing what already worked?

The instinct is to pour the time into the new feature — it is what the release is “about,” it is unproven, and it is where everyone’s attention is. That instinct is half right. The new feature is high likelihood because it is new. But the trap is forgetting that a change can break things far from itself. A new payment option can break the existing payment path it shares code with. A change to the eligibility rules can break assessments that were correct yesterday.

Apply the same likelihood-times-impact lens to both:

New feature: high likelihood (unproven), impact depends on what it does. Test it — it is usually high-risk by definition.
Regression of areas the change touches or shares code with: moderate likelihood (it could be disturbed), and the impact is whatever those existing areas do — often high, because they are core functions that have always worked and that everyone now assumes are safe.
Regression of areas the change cannot reach: low likelihood, and this is usually the first regression you cut.

So the answer is not “new feature over regression” or the reverse. It is: test the new feature, and regression-test the existing areas the change can plausibly disturb, especially the high-impact ones — and cut regression of the areas the change cannot reach. The regression that protects the existing high-impact functions a customer relies on is not optional padding; it is often higher-risk than parts of the new feature itself.

Pro tip: The most dangerous defect on many releases is not in the new feature — it is a regression in a core function everyone assumed was safe, like an existing payment or an existing benefit calculation, broken by a change nearby. Always reserve regression time for the high-impact existing functions the change touches. That is the gap that produces the worst production incidents.

9 Common Mistakes

🚫 Testing in build order or backlog order instead of risk order

Why it happens: The backlog is already in some order, and working through it top to bottom feels organised.
The fix: Build order has no relationship to risk — the Hononga trap, where the help pages got tested and the eligibility calculation did not. Re-sort the backlog by likelihood times impact and test from the top, so whatever you drop is the least dangerous.

🚫 Ranking on likelihood or impact alone

Why it happens: Testers naturally see likelihood (the complex, changed code) and stop there, or fixate on one scary high-impact area.
The fix: Either axis alone misleads — a likely typo is not your priority, and an impossible catastrophe is not either. Score both and multiply, and set impact with the business so it reflects real cost, not a guess.

🚫 Cutting scope silently

Why it happens: Running out of time feels like a failure to admit, so the untested items just quietly do not get done.
The fix: A silent cut is an unowned risk that surfaces as a production surprise. Cut from the bottom of the ranking, name the residual risk, and have the business accept anything non-trivial — recorded in a line. A stated cut is a decision; a silent one is a landmine.

🚫 Pouring all the time into the new feature and skipping regression

Why it happens: The new feature is what the release is about and where attention naturally goes.
The fix: A change can break core functions far from itself — an existing payment or benefit calculation everyone assumed was safe. Reserve regression time for the high-impact existing areas the change can plausibly disturb; that is often where the worst defect hides.

Senior engineer insight

The insight that changed how I approach risk ranking is that likelihood and impact need to be set by different people — testers own likelihood, but impact is not our call. On a government benefit release I had confidently rated a calculation change as medium impact, only to learn from the policy team it triggered a regulatory threshold that could generate wrong debts at scale. Once I started setting impact in a room with the business owner and a policy expert, the ranking stopped being my opinion and became a shared, defensible position — one the business would stand behind when I used it to cut scope under deadline.

The most common mistake: teams score risk in isolation at the start of a sprint, file the matrix, and never update it. Risk is a forecast that should be re-ranked every time a defect area proves hotter than expected or a late change lands. A static matrix from planning week is a map of what you thought two weeks ago, not where the fires actually are.

From the field

A central government agency in Wellington was migrating a legacy entitlements system to a new platform under a fixed Budget night go-live. The team used the agency's own risk register — a standard NZ Treasury-format likelihood/consequence grid — as the starting point for the test ranking, which felt rigorous and defensible. What they discovered three weeks in was that the risk register rated reputational risk above financial accuracy risk (because the comms team had a larger voice in the risk workshop), which meant the new public-facing confirmation screens were scored 9 while the back-end entitlement calculation was scored 6. Testers followed the register. The screens were pristine; the calculation had a rounding error that underpaid a subset of recipients by $4.20 per fortnight — invisible in UAT because no one had checked it carefully. The lesson: a risk register is a governance document, not a test-ranking tool. Testers need to re-score for their own domain — financial accuracy risk in a payments system is almost always higher than what a multi-stakeholder risk register will give it, because the register averages perspectives across teams who don't live in the data.

10 Now You Try

Three graded exercises: spot a prioritisation failure, re-prioritise a backlog, then build a risk ranking. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Prioritisation Failure

Read how a lead prioritised testing for the fixed-date release below. Identify 3 things wrong with the prioritisation and say what risk-based approach should have been used instead.

A bank mobile-app release, two weeks of testing, ~180 cases:
The lead sorted the test cases alphabetically by screen name and worked through them, because it was easy to track. He spent the first week doing thorough testing of the “Accounts overview” and “Alerts settings” screens (stable, barely changed this release). The new “Pay someone new” flow and the changed funds-transfer limit logic were near the end of the alphabet and got a quick once-over on the final afternoon. No regression was run on the existing transfer path, which shares code with the new flow, because “that already worked.” Nothing was written down about what was skipped.

List 3 prioritisation problems and the fix for each:

Show model answer

There are at least four real problems here; any three well-explained earns full marks.

1. Sorted by an order unrelated to risk — alphabetical by screen name has no connection to likelihood or impact, so the time fell on whatever happened to start with A. Fix: rank by likelihood × impact and test from the top; the new payment flow and changed limit logic are the score-9 items that go first.

2. Spent the most time on low-risk, stable areas — "Accounts overview" and "Alerts settings" were stable and barely changed (low likelihood) and got thorough testing, while the new and changed high-risk flows got a once-over. Fix: highest-risk first and early, so serious defects are found with time to fix them; a stable unchanged screen is exactly what you cut.

3. Skipped regression on the existing transfer path that shares code with the new flow — "that already worked" ignores that a change can break a core function nearby. The existing transfer path is high impact (moving money) and now moderately likely to be disturbed. Fix: reserve regression time for high-impact existing areas the change touches.

Bonus: nothing recorded about what was skipped — a silent cut is an unowned risk. The residual risk should have been named and accepted by the business.

The pattern: the order was chosen for convenience, not risk, so the most dangerous components (new payments, changed limits, shared transfer code) got the least testing — the worst possible outcome on a money-moving app.

🔧 Exercise 2 of 3 — Re-Prioritise This Test Backlog

Below is an unprioritised test backlog for a fictional HealthNZ appointment-booking release with a fixed go-live and only enough time for about half of it. Score each item likelihood (1–3) × impact (1–3), give it a priority number, put them in test order, and state which you’d cut and the residual risk.

Backlog items:
A. New clinician-availability sync (new integration, drives whether appointments can be booked at all)
B. Changed appointment-reminder text (wording only, on an existing reminder)
C. Existing patient-search (unchanged, but shares a database layer the release touches)
D. New online cancellation flow (new feature, lets patients cancel)
E. Help / FAQ page refresh (static content, no logic)

Score, rank, order, and decide what to cut:

Show model answer

A. clinician-availability sync — L=3 (new integration) × I=3 (if it fails, no appointments can be booked) = 9
D. cancellation flow — L=3 (new feature) × I=2 (a failure blocks cancellations / could double-book, real but less severe than no bookings at all) = 6
C. patient-search via shared DB layer — L=2 (unchanged but shares a layer the release touches) × I=3 (clinicians can't find patients) = 6
B. reminder text — L=1 (wording only) × I=2 (wrong reminder could cause a missed appointment) = 2
E. help/FAQ refresh — L=1 (static) × I=1 (cosmetic) = 1

Test order (highest risk first): A (9), then C and D (both 6 — do C and D before B and E; between them, do whichever blocks the core "can patients be booked/found" path first), then B (2), then E (1).

What I'd cut, with ~half the time: cut E (1) and B (2) first; if more must go, reduce B and E to nothing and smoke-test the lower end of the 6s rather than dropping a 6 entirely. Residual risk of cutting E: a stale FAQ page reaches production — low likelihood, low impact, acceptable. Residual risk of cutting B: a reminder with wrong wording could contribute to a missed appointment — low likelihood, low-to-moderate impact; worth a quick check rather than a full cut, and named for the business to accept.

What makes it strong: both axes scored with reasons, the shared-DB-layer item (C) correctly rated high impact despite being "unchanged", the cancellation flow not blindly ranked above search, cuts taken from the bottom, and residual risk named per cut. Weak answers rank the new features top just because they're new and forget that the unchanged-but-shared search is high risk.

🏗️ Exercise 3 of 3 — Build a Risk Matrix for a Release

Build a risk matrix of at least 5 areas for a fictional bank credit-card statement and payments release with a fixed regulatory go-live. For each area give likelihood, impact (in the bank’s terms), the priority number, and a one-line note. Then state your test order and how you’d handle the regression-vs-new-feature trade-off.

Show model answer

Area 1 — Interest/fee calculation on statements — L=3 (changed for the new reg) × I=3 (wrong charges = regulatory breach + customer harm) = 9 — note: the reason the release exists; highest risk.
Area 2 — Payment posting to card balance — L=2 (touched indirectly) × I=3 (money posted wrong = financial loss + complaints) = 6 — note: existing core function the change can disturb.
Area 3 — Minimum-payment / due-date logic — L=2 (changed) × I=3 (wrong due date = unfair default fees) = 6 — note: regulatory-sensitive.
Area 4 — Statement PDF rendering/layout — L=2 (changed template) × I=2 (unclear statement = complaints, possible disclosure issue) = 4 — note: visible to every customer.
Area 5 — Marketing banner on statement page — L=1 (static) × I=1 (cosmetic) = 1 — note: first to cut.

Test order: Area 1 (9) first and early — find calculation defects with time to fix; then Areas 2 and 3 (both 6), prioritising the existing payment-posting path because it's a core money function the change touches; then Area 4 (4); Area 5 (1) only if time remains.

Regression-vs-new-feature decision: the "new" work is the changed interest/fee and due-date logic (Areas 1, 3) — test those hard. But reserve regression time for the existing payment-posting path (Area 2), which shares the balance code and is high impact even though it "already worked". Cut regression of card areas the change cannot reach (e.g. card-replacement requests). The worst-case defect here is a regression in payment posting, not a bug in the new calculation, so it must not be skipped.

Strong matrices: impact expressed in the bank's real terms (regulatory breach, financial loss, customer harm), the existing-but-touched payment path rated high, a defensible order, and a regression decision that protects core money functions. Weak ones score everything by gut, rank new over existing automatically, and drop regression entirely.

Why teams fail here

They borrow the project risk register as the test ranking. Government and enterprise risk registers are stakeholder consensus documents — they average political, reputational, and financial risk across many voices. The resulting test order reflects whoever had the loudest voice in the risk workshop, not the actual likelihood and impact of software defects in this release.
They treat the risk ranking as a one-time sprint ceremony output rather than a live document. A defect cluster in an area originally scored low, a late scope change, or a new dependency can completely flip the ranking mid-sprint — and teams that don't re-rank keep testing in the original order while the real risk has moved.
They score by newness rather than by likelihood times impact. Every new feature gets rated high just because it is unproven, and every existing feature gets rated low because it previously passed. This ignores that an unchanged high-impact function disturbed by a nearby change — an existing payment path, a benefit calculation sharing a data layer — is often the highest-risk item on the release and the origin of the worst production incident.
They cut scope silently and never record it. In NZ government delivery environments, audit trails matter — a test scope reduction that is not documented as a deliberate, risk-ranked decision accepted by the business is indistinguishable from oversight. When the untested area fails in production, the absence of a record turns a defensible risk trade-off into an accountability gap.

Key takeaway

When time runs out before the backlog does — and it always does — the only thing that separates a defensible cut from a production incident is whether the order was chosen by risk or by accident.

11 Self-Check

Click each question to reveal the answer.

Q1: When you cannot test everything, why is the order the most important decision?

Because the order decides what you drop. Test in build order or alphabetical order and the work you run out of time for is chosen by chance — possibly your most dangerous components, as at Hononga. Test in risk order and whatever you drop is, by design, the least dangerous thing to leave untested.

Q2: What are the two dimensions of risk, and why do you need both?

Likelihood (how probable a defect is — driven by complexity, change, newness, team experience, history) and impact (how bad the damage if it fails — a business question: financial loss, patient safety, wrong assessments). You need both because either alone misleads: a likely typo is not your priority, and an impossible catastrophe is not either. Your first hours go to items high on both.

Q3: What does it mean to cut scope defensibly?

Cut from the bottom of the risk ranking (so you can show why those items were the safest to drop), name the residual risk each cut accepts, and have the business accept anything non-trivial — recorded in a line. A defensible cut is a deliberate, risk-ranked, recorded decision, not a silent “we didn’t get to it.”

Q4: Why should you test the highest-risk areas early in the calendar, not just first in the ranking?

Because finding a serious defect on day one leaves time to fix and re-test it, while finding the same defect on the last day does not. Testing the riskiest areas early means bad news arrives when you can still act on it — the worst outcome is discovering a high-impact defect with no time left to fix it.

Q5: How do you handle the regression-versus-new-feature trade-off?

Apply likelihood times impact to both. Test the new feature (high likelihood because it is new), and reserve regression time for the existing high-impact areas the change touches or shares code with — an existing payment or benefit calculation can be broken by a nearby change and is often the worst defect on the release. Cut regression of areas the change cannot reach.

12 Interview Prep

Real questions asked in NZ QA interviews for lead and test-manager roles. Read the model answers, then practise your own version.

“You have time to run about 60% of your test cases before a fixed go-live. How do you decide which 60%?”

I rank every test by risk — likelihood of a defect times impact if it fails — and run from the top. Likelihood I can assess from complexity, how much changed, how new it is, and the area’s history. Impact I set with the business, because only they can tell me a wrong payment is unsurvivable while a cosmetic glitch is fine. I run the highest-risk areas first and early so any serious defect surfaces with time to fix. The 40% I don’t run is the bottom of that ranking — the lowest-risk work — and I name the residual risk of dropping it and have the business accept anything non-trivial. The point is that what gets cut is a deliberate choice, not whatever I happened not to reach.

“A change is small and isolated. The team says no regression is needed. Do you agree?”

Not automatically. “Small and isolated” is a hypothesis about likelihood, and I’d want to check what the change shares code with or sits near. The defects that hurt most are often regressions in a core function everyone assumed was safe — an existing payment path broken by a nearby change. So I’d agree to skip regression of areas the change genuinely cannot reach, but I’d reserve time to regression-test the high-impact existing functions it touches or shares code with. If it really is fully isolated, that is a quick confirmation; if it isn’t, that’s exactly where the worst production incident would come from.

“How do you defend a decision to cut testing scope if something later fails in production?”

By being able to show it was a ranked, shared, recorded decision rather than an accident. I cut from the bottom of a likelihood-times-impact ranking, so I can show the cut items were the lowest risk and everything above them was covered. I named the residual risk each cut accepted, and for anything non-trivial the business accepted it in a recorded line — what was cut, the risk, who agreed. If something I cut fails, the conversation is “we knowingly accepted this small risk together and it materialised”, not “testing missed it.” And if something I didn’t cut fails, the ranking shows it was covered — that’s a different problem from prioritisation.

← Test Estimation Techniques Next: Test Strategy & Planning →

Risk-Based Prioritisation

1 The Hook

2 The Rule

3 The Analogy

4 Likelihood and Impact

5 The Risk Matrix

6 What to Test First When Time Is Short

7 Cutting Scope Defensibly

8 The Regression-versus-New-Feature Trade-Off

9 Common Mistakes

10 Now You Try

11 Self-Check

Related techniques

12 Interview Prep