Test Estimation Techniques
“How long will testing take?” is the first question a lead is asked and the easiest one to answer badly. A single confident number you cannot hit costs you more than an honest range ever will. This lesson teaches you to estimate test effort and communicate it without setting a trap for yourself.
1 The Hook
A new test lead on a fictional NZ bank programme was asked, in a corridor, how long testing would take for a payments upgrade tied to a regulatory go-live. She wanted to look decisive, so she said “about three weeks.” She had not seen the detailed scope, the integration points, or the state of the environments. The number was a guess dressed up as a commitment.
The delivery manager wrote “3 weeks” into the plan that afternoon. It stopped being her rough guess and became the bank’s testing window. Development handed over a week late. Two of the four test environments were not ready. The change touched a settlement interface nobody had mentioned in the corridor. The real testing effort was closer to six weeks of work, and now she was the person who was “running over” — against a number she had invented under pressure.
Here is the part that matters. She was not punished for the testing taking six weeks. The work was the work. She was punished for the gap between her number and reality — a gap she created the moment she gave a single confident figure she had no basis for. A range with her assumptions written down (“four to seven weeks, assuming all four environments are ready and the scope is the eleven stories I’ve seen”) would have protected her and given the programme honest information to plan with.
The lesson hidden in that corridor: an estimate is a forecast under uncertainty, and the skill is not producing a number — it is producing a defensible range, stating what it depends on, and refusing to let it harden into a promise before you have the information to make one.
2 The Rule
An estimate is a range, not a promise. A single number with no stated assumptions is a trap you set for yourself. Always give a range, always write down what it depends on, and never let a corridor guess harden into a committed date before you have the scope, the environments, and the dependencies in front of you.
3 The Analogy
Telling a mate how long the drive from Auckland to Wellington will take.
If they ask and you say “eight hours” flat, you have set a trap. Eight hours assumes no roadworks on the Desert Road, no holiday traffic out of Auckland, no stop longer than for fuel, and good weather over the central plateau. Any one of those is normal, and now you are “late” against a number that was always optimistic.
An honest answer is a range with its conditions: “eight to eleven hours — eight if the roads are clear and we barely stop, eleven if we hit holiday traffic or the Desert Road is down to one lane.” That is a useful answer. Your mate can plan around it, and you have not promised something the road might not let you keep. Test estimation is the same: the honest number is a band with its assumptions attached, not a single optimistic figure that quietly becomes a deadline.
4 Why Test Estimation Is Hard
Test estimation is harder than estimating development, for reasons worth naming so you can defend your ranges.
- Testing inherits everyone else’s slippage. Testing sits at the end of the chain. When design and build run late against a fixed go-live, the testing window is what gets squeezed — so your estimate has to survive a scope and schedule you do not control.
- Defects are not on the plan. You can estimate the effort to run the tests, but not how many defects you will find, how bad they will be, or how long fixes take. Re-testing and regression after fixes are real effort that no story-point count shows.
- Environments and data are silent killers. Half of testing delay on NZ bank and government programmes is waiting for an environment, a test data refresh, or an interface partner. These rarely appear in the first estimate and almost always bite.
- “Done” is fuzzy. Development is done when it builds. Testing is done when enough has been tested to the agreed exit criteria — which means your estimate depends on a definition of “enough” that may not exist yet.
None of this makes estimation pointless. It makes the case for ranges, written assumptions, and re-estimating as information arrives — which is what the rest of this lesson gives you.
5 Three-Point and PERT Estimation
The most useful estimation technique for a lead is three-point estimation, because it forces you to think in a range from the start instead of reaching for a single number.
For each piece of work, you produce three figures:
- Optimistic (O) — how long if everything goes right: environments ready, few defects, no surprises.
- Most likely (M) — your realistic expectation given normal friction.
- Pessimistic (P) — how long if the usual things go wrong: a late handover, a flaky environment, a cluster of defects.
PERT (Program Evaluation and Review Technique) combines the three into a single weighted expected value, leaning on the most-likely figure:
Worked example — system testing a fixed-date IRD change:
Optimistic O = 10 days
Most likely M = 15 days
Pessimistic P = 30 days
Expected = (10 + 4×15 + 30) / 6 = (10 + 60 + 30) / 6 = 100 / 6 ≈ 16.7 days
Rough spread (std dev) = (P − O) / 6 = (30 − 10) / 6 ≈ 3.3 days
Report as: ~17 days, range roughly 13–20 days (1 std dev each side).
The value is not the decimal precision — it is that PERT pulls the expected figure away from naive optimism (it is higher than the most-likely 15 because the pessimistic tail is long), and the spread gives you an honest range to report instead of a single point. A wide gap between O and P is itself information: it says “this work is uncertain, and here is how uncertain.”
6 Analogy-Based Estimation
The fastest credible estimate comes from comparison, not calculation. Analogy-based estimation says: find a past piece of work that resembles this one, take what it actually cost, and adjust for the differences.
It works because your strongest evidence is your own history. “The last payments interface release of about this size took us four weeks of test effort, found around 40 defects, and lost a week to environment issues” is far more defensible than a number conjured from nothing. You are estimating from measured reality, not hope.
The method, in three steps:
- Find the closest analogue. A previous release on the same system, of similar scope and integration complexity — ideally one you have real actuals for, not just its original estimate.
- Take the actuals, not the estimate. Use what it really cost, including re-testing and environment delays. The old estimate may have been wrong; the actual is the fact.
- Adjust for the differences, openly. “This release is similar but touches one more interface and the team has two new testers, so I’m adding 30%.” State the adjustments so they can be challenged.
Analogy-based estimation is most powerful combined with three-point: use a past actual to anchor your most-likely figure, then set optimistic and pessimistic around it based on how this release differs. The history grounds the centre; the three-point spread captures the uncertainty.
7 Estimates Are Ranges, Not Promises
This is the heart of estimation as a leadership skill. The maths above produces ranges; the job is to keep them ranges all the way to the people who decide.
A range with stated assumptions does three things a single number cannot. It tells the truth — you genuinely do not know the exact figure, and pretending otherwise is dishonest. It transfers the uncertainty to where it belongs — the people setting the date can see the risk and decide what to do about it, rather than discovering it when you “run over.” And it protects you — you are accountable to a defensible range you can explain, not to a guess that hardened into a deadline.
The way you phrase it matters. Compare:
Defensible: “My estimate is four to seven weeks. The four assumes all four environments are ready on day one and the scope is the eleven stories I’ve seen. The seven is if we hit the environment delays we had last release, or the settlement interface is in scope. I’ll tighten this once the scope is locked and the environments are confirmed.”
The second answer is longer, and that is the point. It names the band, the assumptions behind each end, and the conditions that would let you narrow it. It hands the delivery manager something to act on — lock the scope, confirm the environments — rather than a number to hold you to.
Re-estimate as information arrives. An estimate given before scope is locked is a draft. When the scope firms up, the environments are confirmed, and the first defects come in, you revise — and you say so out loud, every time, so no one is surprised. “Updated estimate” is a sign of a lead doing the job, not failing at it.
8 The “There’s No Time for Testing” Conversation
Every lead meets this. Build ran late, the go-live will not move, and someone says testing will have to “fit into the time that’s left.” Handled badly, you either cave and own the consequences silently, or you dig in and become the blocker. Handled well, it is the most valuable conversation you have.
The move is to refuse to argue about time, and instead make the trade-off visible and put the decision where it belongs. You do not control the date or the scope — but you do control the honest statement of what testing the available time buys.
- Do not say “that’s not enough time.” It sounds like complaint and invites “just make it work.”
- Do say what the time buys. “In the ten days left I can fully test the four highest-risk areas and smoke-test the rest. I cannot regression-test the payment interfaces to the depth we normally would. Here is what that leaves exposed.”
- Name the residual risk in their language. For a bank, that is regulatory and financial exposure; for Te Whatu Ora, patient impact; for IRD, incorrect assessments. Make the cost of the cut concrete.
- Hand them the decision. “I can fit testing into the time, but it means accepting that risk, or we move the date, or we cut scope. That’s a call for you and the business — I’ll give you what you need to make it.”
This reframes you from the person blocking the date to the person giving the business clear-eyed options. The risk-based prioritisation that makes “the four highest-risk areas” a defensible answer is exactly what Lesson 2 teaches.
9 Common Mistakes
🚫 Giving a single number when asked on the spot
Why it happens: A confident single figure feels decisive, and a corridor question feels like it wants a quick answer.
The fix: The single number becomes the plan and you own the gap when reality differs — the corridor trap. Give a range with assumptions, even quickly: “roughly four to seven weeks, depending on scope and environments — let me confirm once I’ve seen the detail.”
🚫 Estimating only the test execution and forgetting everything around it
Why it happens: Running the tests is the visible work, so it is what gets counted.
The fix: Test design, environment and data setup, defect re-testing, and regression after fixes are often more effort than the first run. An estimate that omits them is optimistic by a wide margin. Build them in — the pessimistic figure exists for exactly this.
🚫 Padding silently instead of stating the range
Why it happens: You know it will take longer than the optimistic figure, so you quietly add a buffer.
The fix: Hidden padding gets negotiated away by anyone who suspects it is there, and it hides the real uncertainty. State the range and its assumptions openly instead — honest uncertainty survives challenge in a way a padded single number never does.
🚫 Treating the first estimate as final
Why it happens: Once a number is in the plan it feels fixed, and revising it feels like admitting you were wrong.
The fix: The first estimate was made with the least information you will ever have. Re-estimate as scope locks, environments confirm, and defects arrive — and say so. An updated estimate is the job done well, not a failure.
10 Now You Try
Three graded exercises: critique an estimate, fix one, then build one. Write your answer, run it for AI feedback, then compare to the model answer.
A test lead gave the estimate below for a fixed-date release. Identify 3 things wrong with it as an estimate — not the number itself, but how it was produced and communicated — and say what should have been done instead.
“Testing for the Te Whatu Ora patient-records cut-over will take two weeks. I counted the 24 user stories and figured roughly half a day each to test, so 12 days, call it two weeks. We’ll be fine.” The lead had not seen the integration list, did not know if the test environment was ready, and the figure covered test execution only — no environment setup, defect re-testing, or regression. It went straight into the programme plan as the testing window.
List 3 problems with this estimate and the fix for each:
Show model answer
There are at least four real problems here; any three well-explained earns full marks. 1. Single number, no range or assumptions — "two weeks" is a point estimate that went straight into the plan as a commitment. It should have been a range with stated assumptions (e.g. "two to four weeks, assuming the environment is ready and the integration scope is what I've seen"), so the uncertainty was visible to the people setting the date. 2. Execution only — the figure covered test execution and ignored environment/data setup, defect re-testing, and regression, which are often more effort than the first run. The estimate is optimistic by a wide margin. Use a three-point estimate so the pessimistic figure captures this. 3. Estimated without the information — the lead had not seen the integration list or confirmed the environment, yet gave a firm number. An estimate made with that little information is a draft; it should have been flagged as provisional and re-done once scope and environments were known. Bonus: "half a day each story" is naive uniform estimation — stories vary enormously in test effort, and a patient-records cut-over has high-risk integration points that need far more than half a day. Analogy to a past cut-over's actuals would anchor it far better. The pattern: the number may even be roughly right, but it was produced and communicated like a promise instead of a forecast under uncertainty.
Rewrite the weak estimate below into a three-point estimate communicated as a defensible range. Show your O / M / P figures, the PERT expected value, the range you would report, and the assumptions behind each end. Use a fictional Kiwibank online-banking release as the context.
“Testing the online-banking release will take 15 days. That should be plenty.”
Rewrite as a three-point estimate with a reported range:
Show model answer
Optimistic (O) = 12 days — assumes: all environments ready day one, scope is the stories seen, few defects, no new integrations. Most likely (M) = 18 days — assumes: normal defect volume and re-testing, one short environment delay, scope roughly as expected (anchored to the last online-banking release, which actually took ~17 days). Pessimistic (P) = 32 days — assumes: a late build handover, a flaky environment for several days, a cluster of defects in the payments path, or one extra integration pulled into scope. PERT expected = (12 + 4×18 + 32) / 6 = (12 + 72 + 32) / 6 = 116 / 6 ≈ 19.3 days Range I would report: about 19 days expected, realistic range roughly 16–23 days (≈1 std dev = (32−12)/6 ≈ 3.3 days each side). What I'd say to the delivery manager: "My estimate is roughly 16 to 23 days, expected around 19. The low end assumes all environments are ready on day one and no new integrations; the high end is if we hit the environment delays we had last release or a defect cluster in payments. I've anchored the middle to the last online-banking release, which actually took 17 days. Lock the scope and confirm the environments and I can tighten this." What makes it strong: three figures with explicit assumptions, the PERT calculation shown, a reported range rather than a single number, an analogy to a real past actual anchoring the middle, and a clear ask (lock scope, confirm environments) that would narrow the range. The original had none of these — it was an unbacked single number framed as "plenty".
For a fictional IRD change with a fixed start-of-tax-year go-live, build a short test estimate AND script the “there’s no time for testing” conversation. Cover: a three-point estimate with assumptions; how you’d use a past release as an analogy; and 4 lines of what you’d say when told testing must fit a window shorter than your estimate.
Show model answer
1. Three-point estimate: O = 8 days (environments ready, scope is the seen changes, low defect volume); M = 14 days (normal defects and re-testing, one short environment wait); P = 26 days (late handover, an extra calculation rule in scope, or a defect cluster in the assessment logic). PERT = (8 + 4×14 + 26) / 6 = 90/6 = 15 days; report ~15 days, range 12–18. 2. Analogy: last year's start-of-tax-year change to the same assessment module actually took 13 days of test effort and found ~25 defects, with 2 days lost to a data refresh. This change is similar but touches one more calculation rule, so I add ~15% — which lines up with the ~15 day PERT figure. Using the real actual, not last year's original estimate, anchors it. 3. What the time buys: "In the 9 days available I can fully test the changed assessment calculations and the highest-risk integration, and smoke-test the unchanged paths. I cannot run the full regression on the existing assessment rules to our normal depth." 4. Residual risk in IRD's language: "The exposure that leaves is incorrect tax assessments on the paths I can only smoke-test — wrong amounts going out to taxpayers at the start of the year, which is a correction and reputational cost, not just a defect." 5. Handing back the decision: "So the options are: fit testing into the 9 days and accept that regression risk, move the go-live, or cut the lower-risk changes from this release. That's a business call — I'll give you the risk breakdown you need to make it, and I'll put whatever we decide in writing today." Strong answers: a three-point estimate anchored to a real past actual, a clear statement of what the shorter window buys and does not buy, residual risk expressed as IRD's actual cost (wrong assessments), and the decision handed back with concrete options. Weak answers argue "there isn't enough time" without making the trade-off visible or offering the business a choice.
11 Self-Check
Click each question to reveal the answer.
Q1: Why is giving a single confident number worse than giving a range?
Because the single number stops being your guess and becomes the plan — the commitment everyone holds you to — while the real uncertainty stays hidden until you “run over.” A range with stated assumptions tells the truth, transfers the uncertainty to the people setting the date, and leaves you accountable to something defensible rather than to a guess that hardened into a deadline.
Q2: What is the PERT formula, and why does it lean on the most-likely figure?
Expected = (O + 4M + P) / 6. It weights the most-likely figure four times because that is your best single guess, while still letting the optimistic and pessimistic ends pull the expected value — so a long pessimistic tail raises the expected figure above naive optimism. The spread (P−O)/6 gives you the range to report.
Q3: What makes analogy-based estimation defensible, and what is the key rule when using it?
It estimates from measured history rather than from hope — a past release of similar size is your strongest evidence. The key rule is to use the past release’s actuals, not its original estimate (the old estimate may have been wrong; the actual is a fact), and to state your adjustments for the differences openly so they can be challenged.
Q4: Why is test estimation harder than estimating development?
Testing sits at the end of the chain, so it inherits everyone else’s slippage against a fixed date; defect volume and fix times cannot be predicted, so re-testing and regression effort is unknown up front; environment and test-data delays are common and rarely in the first estimate; and “done” depends on agreed exit criteria that may not exist yet.
Q5: How do you handle “there’s no time for testing” without becoming the blocker?
Do not argue about time. State what the available time buys (which areas you can fully test, which you can only smoke-test), name the residual risk in the stakeholder’s own language (regulatory exposure, patient impact, wrong assessments), and hand the decision back with options — fit it and accept the risk, move the date, or cut scope. Then put the agreed trade-off in writing the same day.
12 Interview Prep
Real questions asked in NZ QA interviews for lead and test-manager roles. Read the model answers, then practise your own version.
“A delivery manager asks you in a meeting how long testing will take. You haven’t seen the detail yet. What do you say?”
I’d give a range with assumptions and flag it as provisional, rather than a single number. Something like: “Roughly four to seven weeks — the low end if all environments are ready and the scope is what I’ve seen, the high end if we hit the environment delays we had last release or extra scope comes in. Let me confirm once I’ve seen the integration list and the environment status.” The point is to be useful and honest at once: I give them something to plan with, I make the uncertainty visible, and I don’t let a corridor guess harden into a committed date before I have the information to commit.
“Walk me through how you’d estimate test effort for a release on a system you’ve tested before.”
I’d anchor with analogy and refine with three-point. First I find the closest past release and take its actuals — real effort, defects found, time lost to environments and re-testing — not its original estimate. That anchors my most-likely figure. Then I set optimistic and pessimistic around it based on how this release differs: more integrations, new testers, tighter environment situation. I run PERT, (O + 4M + P) / 6, to get an expected value and use (P−O)/6 for the spread, so I report a range. And I include everything around execution — design, environment setup, defect re-testing, regression — because those are usually where the real effort hides.
“The build is late, the go-live can’t move, and you’re told to make testing fit. How do you respond?”
I don’t argue about the time — I make the trade-off visible and hand the decision to the business. I’d say what the remaining time actually buys: the highest-risk areas fully tested, the rest smoke-tested, and the specific regression I can’t do to our normal depth. Then I name the residual risk in their terms — for a bank that’s regulatory and financial exposure. Then I lay out the options: fit it and accept that risk, move the date, or cut lower-risk scope. That’s a business call, and I give them the risk breakdown to make it. And I put whatever we agree in writing the same day, so the whole team can plan around it and we don’t reopen the argument the night before go-live.