Test Estimation & Planning · Lesson 1

Test Estimation Techniques

“How long will testing take?” is the first question a lead is asked and the easiest one to answer badly. A single confident number you cannot hit costs you more than an honest range ever will. This lesson teaches you to estimate test effort and communicate it without setting a trap for yourself.

Lead Test Estimation & Planning — Lesson 1 of 3 ~30 min read · ~70 min with exercises

1 The Hook

A new test lead on a fictional NZ bank programme was asked, in a corridor, how long testing would take for a payments upgrade tied to a regulatory go-live. She wanted to look decisive, so she said “about three weeks.” She had not seen the detailed scope, the integration points, or the state of the environments. The number was a guess dressed up as a commitment.

The delivery manager wrote “3 weeks” into the plan that afternoon. It stopped being her rough guess and became the bank’s testing window. Development handed over a week late. Two of the four test environments were not ready. The change touched a settlement interface nobody had mentioned in the corridor. The real testing effort was closer to six weeks of work, and now she was the person who was “running over” — against a number she had invented under pressure.

Here is the part that matters. She was not punished for the testing taking six weeks. The work was the work. She was punished for the gap between her number and reality — a gap she created the moment she gave a single confident figure she had no basis for. A range with her assumptions written down (“four to seven weeks, assuming all four environments are ready and the scope is the eleven stories I’ve seen”) would have protected her and given the programme honest information to plan with.

The lesson hidden in that corridor: an estimate is a forecast under uncertainty, and the skill is not producing a number — it is producing a defensible range, stating what it depends on, and refusing to let it harden into a promise before you have the information to make one.

Senior engineer insight

The moment that changed how I think about estimation was watching a delivery manager circle the word “three weeks” on a whiteboard and draw an arrow to the go-live date — while the lead was still talking about assumptions. The number left the room before the caveats did. From that point on I stopped giving numbers in meetings without a written follow-up the same day that documented every condition attached to them, because verbal ranges get stripped to single figures the moment someone records the minutes.

The most common mistake: estimating execution effort only, then treating it as total test effort. Re-testing, regression after fixes, and environment recovery time are almost always larger than the initial run — and they are never on the first estimate.

2 The Rule

An estimate is a range, not a promise. A single number with no stated assumptions is a trap you set for yourself. Always give a range, always write down what it depends on, and never let a corridor guess harden into a committed date before you have the scope, the environments, and the dependencies in front of you.

3 The Analogy

Analogy

Telling a mate how long the drive from Auckland to Wellington will take.

If they ask and you say “eight hours” flat, you have set a trap. Eight hours assumes no roadworks on the Desert Road, no holiday traffic out of Auckland, no stop longer than for fuel, and good weather over the central plateau. Any one of those is normal, and now you are “late” against a number that was always optimistic.

An honest answer is a range with its conditions: “eight to eleven hours — eight if the roads are clear and we barely stop, eleven if we hit holiday traffic or the Desert Road is down to one lane.” That is a useful answer. Your mate can plan around it, and you have not promised something the road might not let you keep. Test estimation is the same: the honest number is a band with its assumptions attached, not a single optimistic figure that quietly becomes a deadline.

4 Why Test Estimation Is Hard

Test estimation is harder than estimating development, for reasons worth naming so you can defend your ranges.

Testing inherits everyone else’s slippage. Testing sits at the end of the chain. When design and build run late against a fixed go-live, the testing window is what gets squeezed — so your estimate has to survive a scope and schedule you do not control.
Defects are not on the plan. You can estimate the effort to run the tests, but not how many defects you will find, how bad they will be, or how long fixes take. Re-testing and regression after fixes are real effort that no story-point count shows.
Environments and data are silent killers. Half of testing delay on NZ bank and government programmes is waiting for an environment, a test data refresh, or an interface partner. These rarely appear in the first estimate and almost always bite.
“Done” is fuzzy. Development is done when it builds. Testing is done when enough has been tested to the agreed exit criteria — which means your estimate depends on a definition of “enough” that may not exist yet.

None of this makes estimation pointless. It makes the case for ranges, written assumptions, and re-estimating as information arrives — which is what the rest of this lesson gives you.

5 Three-Point and PERT Estimation

The most useful estimation technique for a lead is three-point estimation, because it forces you to think in a range from the start instead of reaching for a single number.

For each piece of work, you produce three figures:

Optimistic (O) — how long if everything goes right: environments ready, few defects, no surprises.
Most likely (M) — your realistic expectation given normal friction.
Pessimistic (P) — how long if the usual things go wrong: a late handover, a flaky environment, a cluster of defects.

PERT (Program Evaluation and Review Technique) combines the three into a single weighted expected value, leaning on the most-likely figure:

Expected (PERT) = (O + 4M + P) / 6

Worked example — system testing a fixed-date Revenue NZ change:

  Optimistic  O = 10 days

  Most likely M = 15 days

  Pessimistic P = 30 days

Expected = (10 + 4×15 + 30) / 6 = (10 + 60 + 30) / 6 = 100 / 6 ≈ 16.7 days

Rough spread (std dev) = (P − O) / 6 = (30 − 10) / 6 ≈ 3.3 days

Report as: ~17 days, range roughly 13–20 days (1 std dev each side).

The value is not the decimal precision — it is that PERT pulls the expected figure away from naive optimism (it is higher than the most-likely 15 because the pessimistic tail is long), and the spread gives you an honest range to report instead of a single point. A wide gap between O and P is itself information: it says “this work is uncertain, and here is how uncertain.”

Pro tip: The size of P minus O is your uncertainty signal. If pessimistic is triple optimistic, do not bury that in an average — surface it. Tell the delivery manager “this item is the riskiest part of the estimate; if we can lock the scope and confirm the environment, the range tightens a lot.” That turns your estimate into a lever for getting what testing actually needs.

6 Analogy-Based Estimation

The fastest credible estimate comes from comparison, not calculation. Analogy-based estimation says: find a past piece of work that resembles this one, take what it actually cost, and adjust for the differences.

It works because your strongest evidence is your own history. “The last payments interface release of about this size took us four weeks of test effort, found around 40 defects, and lost a week to environment issues” is far more defensible than a number conjured from nothing. You are estimating from measured reality, not hope.

The method, in three steps:

Find the closest analogue. A previous release on the same system, of similar scope and integration complexity — ideally one you have real actuals for, not just its original estimate.
Take the actuals, not the estimate. Use what it really cost, including re-testing and environment delays. The old estimate may have been wrong; the actual is the fact.
Adjust for the differences, openly. “This release is similar but touches one more interface and the team has two new testers, so I’m adding 30%.” State the adjustments so they can be challenged.

Analogy-based estimation is most powerful combined with three-point: use a past actual to anchor your most-likely figure, then set optimistic and pessimistic around it based on how this release differs. The history grounds the centre; the three-point spread captures the uncertainty.

Pro tip: Keep a simple record of actuals for every release you lead — effort, defects found, time lost to environments and re-testing. It takes minutes per release and within a year it is the most valuable estimation asset you own. A lead with three years of real actuals estimates faster and more accurately than anyone working from gut feel.

From the field

On a large NZ government programme migrating agency data to a new platform, the test team anchored their estimate to a previous release using analogy — a similar data migration that had taken four weeks. What they did not adjust for was that the previous release had used pre-loaded synthetic data, while this one required a live data extract that turned out to need three days of cleansing before testing could start. The team also used the previous estimate, not its actuals; the earlier release had actually taken five and a half weeks including two days lost to interface issues they had glossed over.

The pattern generalises: the strongest part of analogy-based estimation is the actuals, and the weakest part is the adjustment step. Every difference between the analogue and the current release needs an explicit number attached to it, not a vague “add a bit for complexity.” The team rewrote their estimation template to include a column for each adjustment reason and the days added or removed, which immediately made their ranges defensible in steering meetings.

7 Estimates Are Ranges, Not Promises

This is the heart of estimation as a leadership skill. The maths above produces ranges; the job is to keep them ranges all the way to the people who decide.

A range with stated assumptions does three things a single number cannot. It tells the truth — you genuinely do not know the exact figure, and pretending otherwise is dishonest. It transfers the uncertainty to where it belongs — the people setting the date can see the risk and decide what to do about it, rather than discovering it when you “run over.” And it protects you — you are accountable to a defensible range you can explain, not to a guess that hardened into a deadline.

The way you phrase it matters. Compare:

Trap: “Testing will take three weeks.”
Defensible: “My estimate is four to seven weeks. The four assumes all four environments are ready on day one and the scope is the eleven stories I’ve seen. The seven is if we hit the environment delays we had last release, or the settlement interface is in scope. I’ll tighten this once the scope is locked and the environments are confirmed.”

The second answer is longer, and that is the point. It names the band, the assumptions behind each end, and the conditions that would let you narrow it. It hands the delivery manager something to act on — lock the scope, confirm the environments — rather than a number to hold you to.

Re-estimate as information arrives. An estimate given before scope is locked is a draft. When the scope firms up, the environments are confirmed, and the first defects come in, you revise — and you say so out loud, every time, so no one is surprised. “Updated estimate” is a sign of a lead doing the job, not failing at it.

8 The “There’s No Time for Testing” Conversation

Every lead meets this. Build ran late, the go-live will not move, and someone says testing will have to “fit into the time that’s left.” Handled badly, you either cave and own the consequences silently, or you dig in and become the blocker. Handled well, it is the most valuable conversation you have.

The move is to refuse to argue about time, and instead make the trade-off visible and put the decision where it belongs. You do not control the date or the scope — but you do control the honest statement of what testing the available time buys.

Do not say “that’s not enough time.” It sounds like complaint and invites “just make it work.”
Do say what the time buys. “In the ten days left I can fully test the four highest-risk areas and smoke-test the rest. I cannot regression-test the payment interfaces to the depth we normally would. Here is what that leaves exposed.”
Name the residual risk in their language. For a bank, that is regulatory and financial exposure; for HealthNZ, patient impact; for Revenue NZ, incorrect assessments. Make the cost of the cut concrete.
Hand them the decision. “I can fit testing into the time, but it means accepting that risk, or we move the date, or we cut scope. That’s a call for you and the business — I’ll give you what you need to make it.”

This reframes you from the person blocking the date to the person giving the business clear-eyed options. The risk-based prioritisation that makes “the four highest-risk areas” a defensible answer is exactly what Lesson 2 teaches.

Pro tip: Put the trade-off in writing, briefly, the same day — an email or a line in the plan. “Agreed: testing fits the remaining ten days; payment-interface regression reduced to smoke level; residual risk accepted by [name].” Not to cover yourself in a blame sense, but because a decision made out loud and recorded is one the whole team can plan around — and it stops the same argument repeating the night before go-live.

9 Common Mistakes

🚫 Giving a single number when asked on the spot

Why it happens: A confident single figure feels decisive, and a corridor question feels like it wants a quick answer.
The fix: The single number becomes the plan and you own the gap when reality differs — the corridor trap. Give a range with assumptions, even quickly: “roughly four to seven weeks, depending on scope and environments — let me confirm once I’ve seen the detail.”

🚫 Estimating only the test execution and forgetting everything around it

Why it happens: Running the tests is the visible work, so it is what gets counted.
The fix: Test design, environment and data setup, defect re-testing, and regression after fixes are often more effort than the first run. An estimate that omits them is optimistic by a wide margin. Build them in — the pessimistic figure exists for exactly this.

🚫 Padding silently instead of stating the range

Why it happens: You know it will take longer than the optimistic figure, so you quietly add a buffer.
The fix: Hidden padding gets negotiated away by anyone who suspects it is there, and it hides the real uncertainty. State the range and its assumptions openly instead — honest uncertainty survives challenge in a way a padded single number never does.

🚫 Treating the first estimate as final

Why it happens: Once a number is in the plan it feels fixed, and revising it feels like admitting you were wrong.
The fix: The first estimate was made with the least information you will ever have. Re-estimate as scope locks, environments confirm, and defects arrive — and say so. An updated estimate is the job done well, not a failure.

10 Now You Try

Three graded exercises: critique an estimate, fix one, then build one. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Critique This Estimate

A test lead gave the estimate below for a fixed-date release. Identify 3 things wrong with it as an estimate — not the number itself, but how it was produced and communicated — and say what should have been done instead.

The estimate, given in a stand-up:
“Testing for the HealthNZ patient-records cut-over will take two weeks. I counted the 24 user stories and figured roughly half a day each to test, so 12 days, call it two weeks. We’ll be fine.” The lead had not seen the integration list, did not know if the test environment was ready, and the figure covered test execution only — no environment setup, defect re-testing, or regression. It went straight into the programme plan as the testing window.

List 3 problems with this estimate and the fix for each:

Show model answer

There are at least four real problems here; any three well-explained earns full marks.

1. Single number, no range or assumptions — "two weeks" is a point estimate that went straight into the plan as a commitment. It should have been a range with stated assumptions (e.g. "two to four weeks, assuming the environment is ready and the integration scope is what I've seen"), so the uncertainty was visible to the people setting the date.

2. Execution only — the figure covered test execution and ignored environment/data setup, defect re-testing, and regression, which are often more effort than the first run. The estimate is optimistic by a wide margin. Use a three-point estimate so the pessimistic figure captures this.

3. Estimated without the information — the lead had not seen the integration list or confirmed the environment, yet gave a firm number. An estimate made with that little information is a draft; it should have been flagged as provisional and re-done once scope and environments were known.

Bonus: "half a day each story" is naive uniform estimation — stories vary enormously in test effort, and a patient-records cut-over has high-risk integration points that need far more than half a day. Analogy to a past cut-over's actuals would anchor it far better.

The pattern: the number may even be roughly right, but it was produced and communicated like a promise instead of a forecast under uncertainty.

🔧 Exercise 2 of 3 — Fix the Estimate

Rewrite the weak estimate below into a three-point estimate communicated as a defensible range. Show your O / M / P figures, the PERT expected value, the range you would report, and the assumptions behind each end. Use a fictional KiwiFirst Bank online-banking release as the context.

Original (a trap):
“Testing the online-banking release will take 15 days. That should be plenty.”

Rewrite as a three-point estimate with a reported range:

Show model answer

Optimistic (O) = 12 days — assumes: all environments ready day one, scope is the stories seen, few defects, no new integrations.

Most likely (M) = 18 days — assumes: normal defect volume and re-testing, one short environment delay, scope roughly as expected (anchored to the last online-banking release, which actually took ~17 days).

Pessimistic (P) = 32 days — assumes: a late build handover, a flaky environment for several days, a cluster of defects in the payments path, or one extra integration pulled into scope.

PERT expected = (12 + 4×18 + 32) / 6 = (12 + 72 + 32) / 6 = 116 / 6 ≈ 19.3 days

Range I would report: about 19 days expected, realistic range roughly 16–23 days (≈1 std dev = (32−12)/6 ≈ 3.3 days each side).

What I'd say to the delivery manager: "My estimate is roughly 16 to 23 days, expected around 19. The low end assumes all environments are ready on day one and no new integrations; the high end is if we hit the environment delays we had last release or a defect cluster in payments. I've anchored the middle to the last online-banking release, which actually took 17 days. Lock the scope and confirm the environments and I can tighten this."

What makes it strong: three figures with explicit assumptions, the PERT calculation shown, a reported range rather than a single number, an analogy to a real past actual anchoring the middle, and a clear ask (lock scope, confirm environments) that would narrow the range. The original had none of these — it was an unbacked single number framed as "plenty".

🏗️ Exercise 3 of 3 — Build the Estimate and the Conversation

For a fictional Revenue NZ change with a fixed start-of-tax-year go-live, build a short test estimate AND script the “there’s no time for testing” conversation. Cover: a three-point estimate with assumptions; how you’d use a past release as an analogy; and 4 lines of what you’d say when told testing must fit a window shorter than your estimate.

Show model answer

1. Three-point estimate: O = 8 days (environments ready, scope is the seen changes, low defect volume); M = 14 days (normal defects and re-testing, one short environment wait); P = 26 days (late handover, an extra calculation rule in scope, or a defect cluster in the assessment logic). PERT = (8 + 4×14 + 26) / 6 = 90/6 = 15 days; report ~15 days, range 12–18.

2. Analogy: last year's start-of-tax-year change to the same assessment module actually took 13 days of test effort and found ~25 defects, with 2 days lost to a data refresh. This change is similar but touches one more calculation rule, so I add ~15% — which lines up with the ~15 day PERT figure. Using the real actual, not last year's original estimate, anchors it.

3. What the time buys: "In the 9 days available I can fully test the changed assessment calculations and the highest-risk integration, and smoke-test the unchanged paths. I cannot run the full regression on the existing assessment rules to our normal depth."

4. Residual risk in Revenue NZ's language: "The exposure that leaves is incorrect tax assessments on the paths I can only smoke-test — wrong amounts going out to taxpayers at the start of the year, which is a correction and reputational cost, not just a defect."

5. Handing back the decision: "So the options are: fit testing into the 9 days and accept that regression risk, move the go-live, or cut the lower-risk changes from this release. That's a business call — I'll give you the risk breakdown you need to make it, and I'll put whatever we decide in writing today."

Strong answers: a three-point estimate anchored to a real past actual, a clear statement of what the shorter window buys and does not buy, residual risk expressed as Revenue NZ's actual cost (wrong assessments), and the decision handed back with concrete options. Weak answers argue "there isn't enough time" without making the trade-off visible or offering the business a choice.

Why teams fail here

They treat Wideband Delphi as a one-round vote rather than a structured convergence process — the first anonymous round surfaces genuine disagreement, but without facilitated discussion of the outliers the estimate just averages noise instead of surfacing knowledge.
They anchor three-point estimates to the optimistic case rather than the most-likely case, so the PERT expected value looks better than reality and the pessimistic figure is too close to the optimistic to capture real uncertainty.
They use the old estimate as the analogue rather than the actuals — meaning they compare one guess to another instead of comparing to measured history, so optimism compounds across releases.
They give a range verbally and then let it be minuted as a single number, never following up in writing with the assumptions — so the range they gave evaporates and the hard end becomes the plan.

11 Self-Check

Click each question to reveal the answer.

Q1: Why is giving a single confident number worse than giving a range?

Because the single number stops being your guess and becomes the plan — the commitment everyone holds you to — while the real uncertainty stays hidden until you “run over.” A range with stated assumptions tells the truth, transfers the uncertainty to the people setting the date, and leaves you accountable to something defensible rather than to a guess that hardened into a deadline.

Q2: What is the PERT formula, and why does it lean on the most-likely figure?

Expected = (O + 4M + P) / 6. It weights the most-likely figure four times because that is your best single guess, while still letting the optimistic and pessimistic ends pull the expected value — so a long pessimistic tail raises the expected figure above naive optimism. The spread (P−O)/6 gives you the range to report.

Q3: What makes analogy-based estimation defensible, and what is the key rule when using it?

It estimates from measured history rather than from hope — a past release of similar size is your strongest evidence. The key rule is to use the past release’s actuals, not its original estimate (the old estimate may have been wrong; the actual is a fact), and to state your adjustments for the differences openly so they can be challenged.

Q4: Why is test estimation harder than estimating development?

Testing sits at the end of the chain, so it inherits everyone else’s slippage against a fixed date; defect volume and fix times cannot be predicted, so re-testing and regression effort is unknown up front; environment and test-data delays are common and rarely in the first estimate; and “done” depends on agreed exit criteria that may not exist yet.

Q5: How do you handle “there’s no time for testing” without becoming the blocker?

Do not argue about time. State what the available time buys (which areas you can fully test, which you can only smoke-test), name the residual risk in the stakeholder’s own language (regulatory exposure, patient impact, wrong assessments), and hand the decision back with options — fit it and accept the risk, move the date, or cut scope. Then put the agreed trade-off in writing the same day.

Key takeaway

A test estimate is not a number — it is a range with conditions attached, and the moment you let it harden into a single figure you have transferred all the risk from the schedule to yourself.

12 Interview Prep

Real questions asked in NZ QA interviews for lead and test-manager roles. Read the model answers, then practise your own version.

“A delivery manager asks you in a meeting how long testing will take. You haven’t seen the detail yet. What do you say?”

I’d give a range with assumptions and flag it as provisional, rather than a single number. Something like: “Roughly four to seven weeks — the low end if all environments are ready and the scope is what I’ve seen, the high end if we hit the environment delays we had last release or extra scope comes in. Let me confirm once I’ve seen the integration list and the environment status.” The point is to be useful and honest at once: I give them something to plan with, I make the uncertainty visible, and I don’t let a corridor guess harden into a committed date before I have the information to commit.

“Walk me through how you’d estimate test effort for a release on a system you’ve tested before.”

I’d anchor with analogy and refine with three-point. First I find the closest past release and take its actuals — real effort, defects found, time lost to environments and re-testing — not its original estimate. That anchors my most-likely figure. Then I set optimistic and pessimistic around it based on how this release differs: more integrations, new testers, tighter environment situation. I run PERT, (O + 4M + P) / 6, to get an expected value and use (P−O)/6 for the spread, so I report a range. And I include everything around execution — design, environment setup, defect re-testing, regression — because those are usually where the real effort hides.

“The build is late, the go-live can’t move, and you’re told to make testing fit. How do you respond?”

I don’t argue about the time — I make the trade-off visible and hand the decision to the business. I’d say what the remaining time actually buys: the highest-risk areas fully tested, the rest smoke-tested, and the specific regression I can’t do to our normal depth. Then I name the residual risk in their terms — for a bank that’s regulatory and financial exposure. Then I lay out the options: fit it and accept that risk, move the date, or cut lower-risk scope. That’s a business call, and I give them the risk breakdown to make it. And I put whatever we agree in writing the same day, so the whole team can plan around it and we don’t reopen the argument the night before go-live.

← Test Estimation & Planning Next: Risk-Based Prioritisation →

Test Estimation Techniques

1 The Hook

2 The Rule

3 The Analogy

4 Why Test Estimation Is Hard

5 Three-Point and PERT Estimation

6 Analogy-Based Estimation

7 Estimates Are Ranges, Not Promises

8 The “There’s No Time for Testing” Conversation

9 Common Mistakes

10 Now You Try

11 Self-Check

Related techniques

12 Interview Prep