1 The report that lied with true numbers

“The test report said 94% pass rate. Release went out. Three P1 defects found in production within 6 hours.”

The 94% was accurate. It was also completely useless. Every number in the report was correct. The conclusion it implied — that the software was ready — was wrong. That is a metrics problem, not a testing problem.

This scenario plays out regularly in NZ delivery teams. The dashboard looks green. The report lands in the exec’s inbox with a confident percentage at the top. The go/no-go call is made on the strength of that number. Then production disagrees.

What went wrong? The team measured activity — how many tests ran, how many passed — instead of outcome. The 94% came from 1,400 automated tests that covered well-trodden happy paths. The three P1 defects lived in a payment exception flow that had been added two sprints ago and never had a test case written for it. The pass rate measured what was tested, not what mattered.

This lesson is about metrics that tell the truth. You will learn which numbers to trust, which to ignore, how to frame quality data for different audiences, and how to build a QA OKR that a delivery manager and a CTO both care about. The goal is not a prettier report — it is a report that would have caught that release.

2 The fundamental problem: measuring activity, not outcomes

Most QA metrics were invented to answer the question “how busy are the testers?” not “how safe is the software?” Those are different questions with different answers.

Goodhart’s Law in QA

Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. Apply it to QA:

  • Target: pass rate. Teams write tests they know will pass, or mark intermittent failures as “known issues” to keep the dashboard green.
  • Target: test cases per sprint. Testers write shallow, low-value cases to inflate the count. Coverage of risky edge cases drops.
  • Target: defects found. Testers log trivial cosmetic defects to hit a number, masking a real signal that genuine defect discovery is slowing down because the product is getting tested out.
  • Target: automation percentage. Teams automate easy, low-risk paths first. The percentage climbs. The critical, complex paths stay manual and fragile.

The gaming problem: Any metric reported up the chain without context becomes an incentive to optimise the metric, not the quality. Pair every metric with its definition and its known failure modes. If you cannot explain what the metric would miss, do not report it.

Activity vs outcome: the distinction

Activity metric (be careful)Outcome metric (prefer)
Test cases executedDefect Escape Rate
Pass rate %Change Failure Rate (DORA)
Defects found in sprintDefects found in production vs pre-release
Automation % of total test casesAutomation coverage of critical paths
Tests run per dayMean Time to Recovery after a defect reaches prod

Activity metrics are not worthless — they give context. A pass rate of 94% means something if you also know the requirement coverage is 98% and the Defect Escape Rate over 12 releases is 0.3%. Without that context, the 94% is noise.

3 DORA metrics for QA

DORA (DevOps Research and Assessment) metrics were developed by Google’s research team and replicated across thousands of organisations. They are the closest thing to validated, evidence-based benchmarks for software delivery performance. QA teams that understand DORA metrics can speak the language of engineering leadership and CTO-level conversations instead of being trapped in a tester-only vocabulary.

DORA #1
Deployment Frequency

How often the team deploys to production. Elite teams deploy multiple times per day. High: once per week to once per day.

QA angle: fast deployment frequency demands fast feedback. If your regression suite takes 4 hours, it is the bottleneck on deployment frequency.

DORA #2
Lead Time for Changes

Time from a commit being merged to that code running in production. Elite: <1 hour. High: <1 day.

QA angle: testing time is a component of lead time. Manual regression gates inflate lead time. Shift-left and automation compress it.

DORA #3
Change Failure Rate

Percentage of deployments that cause a production incident requiring rollback, hotfix, or degradation. Elite: 0–5%. High: 5–15%.

QA angle: this is the primary DORA metric QA owns. Better regression coverage, risk-based testing, and earlier defect detection directly reduce CFR.

DORA #4
Mean Time to Recovery

How long it takes to restore service after a production incident. Elite: <1 hour. High: <1 day.

QA angle: faster defect detection (in staging, UAT, or canary) means you catch failures before they become long outages. Monitoring and alerting test coverage reduces MTTR.

How QA directly affects each DORA metric

DORA metricQA leverWhat to do
Deployment Frequency Test suite speed Parallelise test execution; eliminate flaky tests; run risk-tiered gates (smoke → regression → full) so low-risk changes can deploy on a fast path.
Lead Time for Changes Shift-left coverage Move testing earlier: unit tests at commit, integration tests at PR, end-to-end tests in CI pipeline. Reduces the manual “testing window” at the end of a sprint.
Change Failure Rate Regression quality & risk coverage Track which production defects were not caught pre-release. Trace back to missing coverage. Automate that path. CFR is the QA report card for release quality.
Mean Time to Recovery Early detection & diagnosis speed Test logging and alerting. Verify that error messages are actionable. Run production smoke suites post-deploy. Synthetic monitoring in staging catches the same defects before prod.

NZ team benchmarks

For context on where NZ teams tend to sit based on 2024 State of DevOps data and typical NZ bank/government delivery profiles:

  • NZ government agencies (large): Deployment frequency typically monthly to quarterly; CFR 15–30%. Significant opportunity in regression automation.
  • NZ banks and fintechs: Weekly to fortnightly deployments; CFR 8–18%. Best performers in NZ financial sector hit CFR <8% with mature regression suites.
  • NZ SaaS / product companies: Daily to weekly; CFR 5–12%. Constrained by manual regression gates. Automation investment has fastest ROI here.
  • Elite benchmark (global): Multiple deploys/day; CFR <5%; MTTR <1 hour.

Knowing where your team sits against these benchmarks lets you have a concrete conversation with engineering leadership about where testing investment will move the numbers — rather than asking for headcount or tools in the abstract.

4 Defect metrics that actually matter

Most teams track the wrong defect metrics and then wonder why the data does not tell them anything useful. Here are the five defect metrics worth tracking, with their formulas and why they matter.

The five defect metrics worth tracking

Metric 1
Defect Escape Rate (DER)

Production defects ÷ (production defects + pre-release defects) × 100.

Target: <5% for mature teams. <10% is acceptable. Above 15% signals a regression coverage gap. DER is your report card: it tells you what QA missed.

Metric 2
Defect Detection Effectiveness (DDE)

Defects found before release ÷ total defects (including post-release) × 100.

Target: ≥90% for production-critical systems. Banking and health systems in NZ often set this at ≥95%. DDE is the inverse of DER — use whichever frame resonates with your stakeholder.

Metric 3
Defect Removal Effectiveness (DRE)

Defects removed before delivery ÷ total defects × 100, often measured per phase (unit, integration, system, UAT).

Use: shows where in the pipeline defects are being caught. If DRE at unit test is low but high at system test, defects are leaking past cheap early catches and being caught expensively late.

Metric 4
Age of Open Defects

Average and maximum days a defect has been open without resolution, segmented by severity.

Watch for: P1/P2 defects open >5 days indicate blocked resolution. Defect age by component surfaces which teams are not actioning QA findings. Old open defects become invisible technical debt.

Metric 5
Arrival vs Resolution Rate

New defects logged per sprint vs defects closed per sprint, plotted as a trend line.

Signal: if arrivals consistently outpace resolutions, defect debt is accumulating. The trend line matters more than any single sprint. Intersection point is your “burning down” date.

Vanity metrics to stop reporting

Stop reporting these

  • Total defects found — higher is not better or worse without context. Depends entirely on scope tested.
  • Tests executed — running 2,000 tests that all pass is not evidence of quality.
  • Pass rate % — the opening example. Measures what was tested, not whether the right things were tested.
  • Defects by severity count (static) — a snapshot. “12 P3 open” means nothing without the trend and the component.
  • Hours of testing — effort is not outcome. Eight hours of exploratory testing on a payment flow is more valuable than 80 hours of scripted regression on low-risk screens.

Report these instead

  • Defect Escape Rate trend — are we getting better release over release?
  • DDE by phase — where are we catching defects in the pipeline?
  • P1/P2 open count with age — actionable, time-sensitive, severity-ranked.
  • Defect arrival vs resolution trend — are we burning down or accumulating debt?
  • Critical path automation coverage — are the flows that matter protected?

5 Test coverage metrics — and what they miss

Coverage metrics are attractive because they are quantitative and appear objective. They are also some of the most misunderstood numbers in software delivery. Here is what each type measures and, critically, what it does not.

Requirement coverage

The percentage of documented requirements that have at least one corresponding test case. If you have 120 requirements and 96 have test cases, requirement coverage is 80%.

Useful for: demonstrating to auditors, clients, or regulators (RMA, RBNZ, MPI) that contractually required features have been tested. Strong for compliance-heavy NZ government and financial services work.

Misses: untested requirements due to poor specification, integration behaviour that falls between requirements, non-functional requirements often left out of traceability matrices entirely.

Risk coverage

The percentage of identified risk areas that have corresponding test coverage, weighted by risk score (likelihood × impact). A team with a risk register of 40 identified risks has 90% risk coverage if 36 of those risks have at least one test designed to detect them.

Useful for: communicating to a risk-conscious stakeholder (CTO, CISO, board) that the things most likely to hurt the organisation are the things being tested most thoroughly. This is the metric that should accompany a go/no-go recommendation.

Misses: risks not identified in the risk register. The risks that hurt you are often the ones nobody thought to add.

Code coverage

The percentage of lines, branches, or paths in the codebase executed by the test suite. Measured by tools like JaCoCo (Java), Istanbul/nyc (JavaScript), Coverage.py (Python).

Warning: 100% code coverage does not mean zero bugs. Code coverage measures whether a line was executed, not whether the behaviour was correct. You can have a test that calls calculateTax(100) and asserts the result is not null — it will give you code coverage without testing that the tax rate is correct. Coverage without assertions is theatre. High coverage with weak assertions is arguably worse than lower coverage with strong assertions, because it creates false confidence.

Common code coverage targets and their context:

  • 70–80%: Typical commercial target. Pragmatic for most NZ product teams. Ensures the mainline is exercised.
  • 80–90%: Appropriate for payment processing, health data handling, financial calculations. Worth the cost.
  • 90%+: Often mandated for safety-critical or regulated systems. Returns diminish sharply above 85% — the last 10–15% of coverage usually costs more to write and maintain than the risk it protects.
  • 100%: Usually a sign that someone is gaming the metric. Some code is genuinely untestable (error handling for conditions that cannot be simulated). Chasing 100% produces low-value tests that slow the suite.

The right question to ask

Instead of “what is our code coverage?”, ask: “What percentage of our highest-risk user journeys are covered by automated tests with assertions that would catch a regression?” That question is harder to game and more directly connected to production safety.

6 OKRs for QA: connecting testing to business outcomes

OKRs (Objectives and Key Results) are the planning framework used by most NZ technology companies at leadership level. A QA team that can write OKRs in the same language as engineering and product leadership will get budget, headcount, and tooling approvals that a team reporting in QA-only metrics will not.

The structure

An Objective is qualitative, inspirational, and time-bound. It answers “where do we want to be?” A Key Result is quantitative, measurable, and binary. It answers “how will we know we got there?”

Example QA OKR — aligned to a delivery team

Objective: Ship faster with higher confidence in production safety — Q3 FY2025

KR1 Change Failure Rate reduced from 12% to ≤5% by end of Q3, measured across all production deployments.
KR2 P1 defect escape rate <1 per release across all releases in Q3 (currently averaging 2.4 per release).
KR3 Automated test coverage of the 8 critical user journeys reaches 85% by end of Q3 (currently 52%).
KR4 Average regression suite execution time reduced from 4.2 hours to ≤45 minutes, enabling same-day deployment capability.

Example QA OKR — NZ government/health context

Objective: Zero P1 production defects on regulated patient-facing systems in H2 FY2025

KR1 Defect Detection Effectiveness ≥95% for all releases to production on Patient Portal (currently 87%).
KR2 All 23 HIPC compliance test scenarios automated and integrated into CI pipeline by end of July.
KR3 MTTR for production defects reduced from 6.2 hours to ≤2 hours through synthetic monitoring coverage of all critical API endpoints.

How to align QA OKRs to business OKRs

A QA OKR must trace to a business OKR or it will not survive the planning cycle. The mapping looks like this:

Business OKRQA OKR that supports it
Increase revenue by shipping 3 new product features per quarter Reduce regression suite runtime to ≤30 min so CI does not gate feature velocity
Reduce customer churn by improving app reliability Reduce Change Failure Rate from 14% to ≤5%; P1 escape rate <1 per quarter
Achieve ISO 27001 certification 100% coverage of security test scenarios in automated suite; DDE ≥95% on auth flows
Reduce operational cost by 20% Reduce manual regression effort from 80 hrs/release to ≤20 hrs through automation of critical paths

The pitch to leadership: “We are not asking for tooling budget to improve our pass rate. We are asking for tooling budget to reduce Change Failure Rate from 12% to 5%, which, at our current deployment frequency of 24 releases per year, eliminates approximately 1.7 production incidents per year. At our average incident cost of $18,000 (SLA penalties, ops time, customer credit), that is $30,000 annual saving from a $15,000 tooling investment.” That is a conversation that gets approved.

7 What to report to whom

The same data, sent to the wrong audience, wastes everyone’s time and trains people to ignore your reports. Different stakeholders need different signals at different frequencies. One report for all audiences is a report nobody reads.

Audience Frequency What they need to know Key metrics
Developers Daily / per-PR Which tests failed and why; which components have the most defects; flaky test rate; test run time trend Defect age by component; flaky test rate %; CI run time; test failure rate by suite
Scrum Master / PM End of sprint Testing completion vs plan; blockers that need action; defect status by priority; anything that threatens the release commitment Planned vs executed test cases; open P1/P2 count; blockers list; defect arrival vs resolution trend
Product Owner End of sprint / release Requirement coverage: which accepted stories have been tested; which P1/P2 defects are open and blocking sign-off; release risk summary Requirement coverage %; P1/P2 open by story; release recommendation (go/no-go with rationale); DDE for this release
Exec / Client Release / monthly Is the system safe to release? Are we improving over time? What is the production defect trend? Release recommendation (one sentence); Defect Escape Rate trend (last 6 releases); Change Failure Rate; P1 production defects this quarter vs last

The release recommendation

The most important report a QA lead writes is the release recommendation. It goes to the exec or client, it is one page maximum, and it answers one question: should we ship?

RELEASE RECOMMENDATION — Sprint 24 / Release 4.7.2 Date: 2026-06-27 RECOMMENDATION: GO SUMMARY All exit criteria met. 3 known P3 defects deferred with PO sign-off (logged in JIRA as tech debt). No P1/P2 open. Regression suite: 847/847 passed. Critical paths: 100% automated coverage. RISK ACCEPTED DEF-2341 (P3): Cart item count flicker on slow 3G — deferred, affects <1% of users, NZ-monitored. EVIDENCE Requirement coverage: 98% (192/196 requirements tested) DDE this release: 93% (28 defects found pre-release, 2 found in UAT staging, 0 in prod so far) Regression pass rate: 100% (847/847) Critical path coverage: 100% automated Defect Escape Rate (last 6 releases): 4.1% → 3.8% → 3.2% → 2.9% → 2.7% → 2.1% ✓ trend SIGN-OFF QA Lead: [name] | Date: 2026-06-27 PO acceptance of deferred P3: [name] | Date: 2026-06-26

Note what this report does not contain: test execution counts, hours spent, which testers worked on what. Executives do not need that information. They need a clear recommendation with the evidence behind it and the risks they are accepting.

8 The QA dashboard: trend lines, not snapshots

A dashboard built from snapshots tells you where you are. A dashboard built from trend lines tells you where you are going. For quality assurance, direction matters more than current state. A team with a 12% escape rate that has improved from 25% over 6 months is in a better position than a team with an 8% escape rate that has risen from 4%.

Core dashboard components

  • Defect Escape Rate over time — 12-release rolling trend, goal line visible. This is your headline quality metric.
  • Change Failure Rate (DORA) — per deployment or per sprint, with industry benchmark overlay.
  • Automation coverage of critical paths — percentage with trend, broken down by product area.
  • Defect arrival vs resolution — two lines on one chart. The gap is your defect debt. The trend tells you if it is growing or shrinking.
  • CI pipeline health — pass rate, execution time, and flaky test rate over the last 30 days. A rising flaky rate is an early warning of test suite rot.
  • Mean Time to Recovery — rolling average per quarter. Tracks whether early detection is reducing the blast radius of production defects.

Tools for NZ teams

ToolBest forNotes
Azure DevOps Analytics Teams using Azure DevOps for both work tracking and CI/CD Built-in test results trending, work item queries for defect age and velocity. Most NZ banks and government agencies already have ADO — no additional tooling cost.
Jira Dashboards + eazyBI or Tableau Teams tracking defects in Jira Default Jira dashboards are limited. eazyBI or a Tableau connector gives proper trend line visualisation. Common in NZ telcos and retailers.
Grafana + data pipeline Engineering-led teams with CI/CD metrics already in Prometheus/InfluxDB Best for DORA metrics visualisation. DORA metrics from deployment pipeline → Prometheus → Grafana gives real-time CFR and MTTR dashboards. Popular in NZ SaaS and product companies.
Xray (Jira plugin) dashboards Teams using Xray for test management Native test execution dashboards. Requirement traceability matrix built in. Good for regulatory compliance reporting.

Dashboard design principle: one dashboard is not enough

Build three views off the same data: a team dashboard (daily, operational, granular), a sprint dashboard (for the SM/PM — completion vs plan, blockers, defect status), and an executive summary (one page, trend lines, release recommendation). Same underlying data, different zoom levels and different audiences.

What to put on the wall

If your team uses a physical information radiator (whiteboard, TV screen, shared monitor), the most valuable live metric to display is the CI pipeline status + current defect count by severity. Not a pass rate percentage — that number creates false comfort. Open P1/P2 defects create urgency and focus. A failing CI pipeline is visible immediately. Both drive correct team behaviour.

9 Self-check

Click each question to reveal the answer.

A test report shows 94% pass rate on 1,400 automated tests. The release goes out and two P1 defects are found in production within 24 hours. A developer says “but we had a 94% pass rate—QA signed off”. How do you explain what went wrong, and what metric would have surfaced the problem before release?

The 94% pass rate measured what was tested, not whether the right things were tested. If the two P1 defects were in flows that had no test coverage, a 100% pass rate on the existing suite would still have missed them. The metric that would have surfaced this is requirement or risk coverage — specifically, whether the payment exception flow that produced the P1s had any test cases written for it. You would also look at Defect Escape Rate from prior releases: if this team had an escape rate of 8% across the previous 6 releases, the 94% pass rate should have prompted “what are the other 6% likely to be?” not “we’re good to ship.”

Your team’s Change Failure Rate is 18%. Engineering leadership wants to improve it. Explain what CFR is, what QA can do to directly reduce it, and what a realistic target would be.

Change Failure Rate (CFR) is the percentage of production deployments that result in an incident requiring rollback, hotfix, or degraded service — one of the four DORA metrics. At 18% it is in the “medium performer” band; elite performers run 0–5%. QA reduces CFR by (a) improving regression coverage so more defects are caught pre-release, (b) risk-stratifying test selection so the highest-risk changes get the most scrutiny, (c) running post-deploy smoke suites immediately after release to catch issues before they become incidents, and (d) using shift-left practices to catch defects at unit/integration level rather than system test, reducing the chance of late surprises. A realistic 6-month target for a team at 18% is ≤10%; 12-month target ≤6%.

A product manager asks why you are reporting Defect Escape Rate instead of just defect counts. They say “12 defects this sprint vs 18 last sprint looks like improvement.” How do you explain why DER is a more reliable signal?

Defect count depends heavily on scope. If this sprint had half the new development of last sprint, 12 defects could represent a worse relative defect density than 18. Also, raw defect count does not tell you whether defects are being caught pre-release or post-release — the place that matters. Defect Escape Rate (production defects ÷ total defects) shows whether the defects that exist are being caught in testing or escaping to production. A sprint with 5 defects found, all in testing, is far safer than a sprint with 5 defects found, 3 of which are in production P1s. DER separates those two situations. Raw defect count does not.

Write a KR for the following objective: “Make our NZ banking app reliable enough to be our primary acquisition channel.” The current P1 production escape rate is 3.2 per release and MTTR is 5.4 hours.

Good example KRs: KR1: P1 defect escape rate reduced from 3.2 to ≤0.5 per release across all Q3 releases, measured by post-release production incident log. KR2: MTTR for production P1 defects reduced from 5.4 hours to ≤1 hour by end of Q3, through synthetic monitoring of all critical payment and authentication flows. KR3: Automated regression coverage of the 12 critical acquisition and onboarding journeys reaches 90% by end of July (currently 44%). The KRs should be quantitative, include a baseline, include a target, and specify the measurement method. Vague KRs like “improve test coverage” are not KRs — they are activities.

You are building a QA dashboard for a NZ government client. They use Azure DevOps. What five metrics would you put on the executive view, and why those five?

The five for an executive view: (1) Release recommendation — go/no-go with one-line rationale; this is what executives need before every release. (2) Defect Escape Rate trend (last 6 releases) — shows whether quality is improving over time, which is the fundamental QA promise. (3) Change Failure Rate — DORA metric that connects directly to service reliability and public trust, which government agencies care deeply about. (4) P1/P2 production defects this quarter vs prior quarter — the outcome measure that maps directly to ministerial or client reporting requirements. (5) Requirement coverage % for current release — for a government client, demonstrating that contractually or legislatively required features have been tested is often a compliance obligation. ADO’s Analytics views can surface all five without custom tooling.

10 Knowledge graph

These topics connect directly to what you have just learned.

▲ Prerequisites  ·  ▶ Related topics  ·  ▼ Where this leads