20 min read · 9 self-checks · Updated June 2026

Modern Practice · Cost & Carbon

Sustainable Testing

Also called green testing: cut the cost and the carbon footprint of testing without weakening the safety net. Run the tests that actually tell you something, right-size the pipeline that runs them, and stop paying — in dollars and emissions — for full regression on every commit when a fraction of it would do.

Junior Senior Modern practice — 2026 trend

1 The Hook

A Wellington SaaS team had a CI rule they were proud of: every push to any branch ran the full regression suite — 9,000 tests, 40 minutes, on a fleet of cloud runners spun up on demand. It felt rigorous. It also felt slow, and the cloud bill kept climbing.

When someone finally read the numbers, the picture was stark. Most pushes touched one or two files, yet every one fired all 9,000 tests across a dozen parallel runners. Roughly 70% of those runs were nightly and weekend builds against branches nobody had changed since the last green run. The suite was passing the same tests, on the same unchanged code, again and again — burning compute, cash, and electricity for no new information.

They made three small changes: only run the tests affected by the files that changed, drop scheduled full builds on idle branches, and reuse the dependency cache instead of rebuilding it each time. The suite still caught real regressions — but the monthly cloud spend fell by more than a third, the median pull request went green in eight minutes instead of forty, and the energy the pipeline drew dropped with it. Faster, cheaper, and lower-emission turned out to be the same change.

💬

Senior Engineer Insight

The counterintuitive part: teams that implement test impact analysis first usually make their pipeline worse before they make it better. They spend a sprint configuring TIA, get it mostly right, and then immediately drop the full-regression backstop because they assume the mapping is complete. It never is. Shared fixtures, generated code, environment config — none of that shows up in a file-change graph. The real win order is: fix your flaky tests first, then reuse caches, then go ephemeral on environments. Do those three and you typically recover 50–60% of your waste before TIA is even configured. TIA is the icing. The teams I've seen on NZ government projects — Revenue NZ, Benefits NZ — that skipped the fundamentals and went straight to impact analysis spent months chasing ghost failures they couldn't reproduce, because they were debugging the mapping, not the software.

2 The Rule

A test run that gives you no new information is pure waste — of time, money, and energy. Run the tests the change actually affects, right-size the pipeline that runs them, and a lean fast suite is also the low-cost, low-carbon one.

3 The Analogy

Analogy

Heating the whole house to make one cup of tea.

If you want a hot drink, you fill the jug with one cup of water and boil that. Filling the jug to the brim and boiling it for a single cup wastes power, takes longer, and the extra water just sits there cooling — you paid for heat you never used. Running a full 9,000-test regression to check a one-line copy change is boiling a full jug for one cup.

Sustainable testing is boiling only the water you need, and boiling it when power is cheapest and cleanest. You still get your tea — the confidence that the change is safe — but you stop paying, in dollars and emissions, for heat that does nothing.

Senior engineer insight

The moment sustainable testing clicked for me was when I stopped thinking about it as "running fewer tests" and started thinking about it as "not paying twice for the same answer." Every time you re-run a test against code nobody touched, you're buying information you already own. That reframe changes which questions you ask — instead of "how do we make the suite faster?" you ask "which runs are producing net-new signal, and which are just receipts?"

The most common mistake: teams implement test impact analysis and immediately drop the periodic full-regression backstop, assuming the dependency map is complete. It never is — shared config, generated code, and environment variables don't show up in file-change graphs. Keep the backstop. TIA governs the per-commit run; full regression governs what ships.

From the field

A Wellington fintech team I worked with had invested three years in a 7,000-test Playwright suite they were rightfully proud of — it caught real bugs and the team trusted it. The problem was it cost $4,200 NZD a month to run and took 52 minutes per commit on 16 parallel runners. When the CTO asked for a 30% cost reduction, the first instinct was to delete the slowest 1,500 tests. That would have worked on the invoice but hollowed out two years of regression coverage built after painful production incidents. Instead, we spent a week introducing test impact analysis and moving full regression to merge-to-main only. Monthly cost dropped by 38%, median PR feedback went from 52 minutes to 9, and not a single test was deleted. The lesson: the waste is almost never in the tests themselves — it's in the scheduling of them.

What it is

Sustainable testing (often called green testing) is the practice of reducing the environmental and cost footprint of testing while keeping the same confidence in quality. Every automated test run consumes compute, and compute consumes electricity, which has a carbon cost — especially when it runs on cloud infrastructure billed by the minute. The aim is to spend that compute deliberately: run what tells you something, skip what does not, and time and size the work so it is efficient rather than wasteful.

It is not about testing less for the sake of it. It is about removing redundant work — the runs that re-confirm code nobody changed, the parallelism that sits idle, the environments that stay up overnight doing nothing — so the genuine safety net stays intact while the waste around it disappears.

Why it matters now

Three forces have made this a live 2026 concern. Cloud CI is billed by the compute-minute, so wasted runs show up directly on the invoice. Suites have grown to tens of thousands of tests, so “run everything, every time” is now genuinely expensive. And organisations increasingly report on emissions, so the energy a pipeline draws is no longer invisible. The happy result is that the cheap option and the low-carbon option are usually the same option — a fast, lean pipeline costs less and emits less at the same time.

Efficient test selection

The biggest savings come from not running tests that cannot tell you anything new:

Test impact analysis: map which tests exercise which code, so a change runs only the tests that touch the changed files (and their dependents) — not the whole suite.
Risk-based trimming: weight test effort towards the highest-risk areas and run lower-risk suites less often. See risk-based testing for how to rank that risk.
Avoid redundant full regression: reserve the full regression run for merges to the main line or releases, rather than firing it on every commit to every branch.

Right-sizing CI

Once you are running the right tests, make the pipeline that runs them efficient:

Parallelism without waste: parallel runners cut wall-clock time, but over-provisioning spins up runners that finish early and idle. Match runner count to the actual shard sizes.
Cache reuse: reuse dependency, build, and container-layer caches instead of rebuilding from scratch each run — rebuilds are some of the most wasteful compute in a pipeline.
Ephemeral environments: spin test environments up on demand and tear them down immediately after, so nothing runs idle overnight.
Scheduled vs every-commit: decide deliberately what runs on every commit (fast, affected tests) versus on a schedule (broader suites) — and stop scheduled runs against branches that have not changed.

Carbon-aware pipelines

Beyond running less, you can run smarter about when and where. A carbon-aware pipeline shifts non-urgent work — nightly suites, heavy data jobs — to times or regions where the electricity grid is cleaner. New Zealand’s grid is largely renewable but its carbon intensity still varies through the day; scheduling a heavy nightly run for a low-intensity window, or choosing a lower-carbon cloud region, trims emissions for the same work. Urgent feedback on a pull request still runs immediately — carbon-awareness applies to the work that can safely wait.

Measuring the footprint

You cannot improve what you do not measure. The useful signals are compute-minutes consumed per pipeline, the cloud cost attached to them, and an estimate of energy or emissions derived from that compute. Track them as test metrics alongside the usual pass-rate and flake numbers, so a regression in efficiency is as visible as a regression in quality. The trend over time — minutes per merged pull request, cost per release — matters more than any single figure.

Real-world NZ Example: a Wellington SaaS team trims the nightly build

A Wellington SaaS team audited a CI pipeline that ran full regression on every push and a heavy nightly suite against all branches. Their changes:

Test impact analysis: per-commit runs execute only the tests affected by the changed files, not all 9,000.
Pruned schedules: nightly full builds skip branches with no commits since the last green run.
Cache reuse + ephemeral envs: dependency caches are reused and test environments torn down straight after each run.
Carbon-aware timing: the remaining heavy nightly suite is scheduled for a lower grid-intensity window.

Result: cloud spend down by more than a third, median pull request feedback from 40 minutes to 8, and a measurable drop in pipeline energy — with the regression safety net for the main line unchanged.

Worked example

A team classifies its test stages by how often each truly needs to run. The principle: match the frequency and footprint of a stage to the information it gives back.

Right-sizing a pipeline — before and after

Stage	Before (wasteful)	After (sustainable)
Unit + affected tests	Full suite, every push	Only tests impacted by changed files, every push
Full regression	Every push, every branch	On merge to main and on release only
Dependency build	Rebuilt from scratch each run	Restored from cache; rebuilt only on lockfile change
Test environments	Long-lived, idle overnight	Ephemeral — created on demand, torn down after
Heavy nightly suite	All branches, fixed 2am slot	Changed branches only, low grid-intensity window

The key insight: none of these changes weakens the safety net for the code that ships. Full regression still guards the main line and every release. What disappears is the redundant work — re-running unchanged code, rebuilding cached artefacts, idling environments — which is exactly the work that costs money and emits carbon without producing new information.

Common mistakes

✗ Cutting tests instead of cutting waste

Sustainable testing removes redundant runs, not coverage. Deleting valuable tests to save minutes trades a one-off saving for an open-ended risk. Trim the re-running of unchanged code, not the safety net itself.

✗ Trusting test impact analysis blind

Impact mapping can miss indirect dependencies — config, shared fixtures, generated code. Keep a periodic full regression on the main line as a backstop so a gap in the mapping cannot quietly ship a regression.

✗ Over-parallelising

Throwing more runners at a suite cuts wall-clock time but can raise total compute and cost if runners finish early and sit idle. Size parallelism to the shard work, and measure compute-minutes, not just elapsed time.

✗ Treating every-commit full regression as “safe by default”

Running everything on every commit feels safe but mostly re-confirms unchanged code. It is expensive, slows feedback, and rarely catches more than affected-test selection plus a gated full run on merge.

✗ Optimising what you do not measure

Without compute-minute, cost, and energy metrics you are guessing. Track them as first-class test metrics so an efficiency regression is as visible as a quality one, and so savings can be proven rather than claimed.

4 Industry Reality

🏭 What you actually encounter on the job

Test impact analysis is rarely plug-and-play. Most tools require weeks of configuration to correctly map test-to-code coverage, especially in polyglot codebases or where tests call shared infrastructure like databases. NZ teams running Rails or .NET monoliths often spend a sprint just getting the dependency graph accurate enough to trust.
The "carbon-aware" conversation is still new here. Most NZ engineering teams have never calculated the carbon cost of their pipeline. You may be the first person to raise it. Expect scepticism, but lead with cost savings — those land immediately, and the emissions reduction is a bonus argument for ESG-reporting clients.
Full regression "just in case" is a cultural habit, not a technical requirement. Senior engineers who built the suite often resist trimming it because they remember the last regression it caught. The sustainable approach doesn't remove that backstop — you move it to merge time — but you'll need to make that distinction clearly and repeatedly.
Flaky tests make impact analysis less trusted. If 5% of your suite fails randomly, people stop trusting selective runs even more than full runs. Before sustainable testing really sticks, the team usually needs to stabilise the flaky tests first, or the targeted run still produces noise.
Cloud cost visibility is often siloed. In many organisations the CI bill goes to a central platform team, not the feature team generating the waste. Sustainable testing advocacy often starts with getting that cost data in front of the teams whose pipelines produce it — which requires cross-team conversations before any pipeline change happens.

5 When to Use It — and When Not To

⚡ Decision guide

✓ Use it when

Your CI pipeline takes more than 15 minutes on a routine commit — the feedback loop is already hurting developer velocity.
Your cloud CI bill is visible and growing — you have a concrete cost to reduce that can be measured before and after.
You run the same full suite on every branch push, including branches nobody has touched in days — the redundant-run problem is real and easy to prove.
Your test suite is large enough (2,000+ tests) that a targeted selection can meaningfully reduce per-commit runtime while still covering the change.
Your team is under pressure to meet sustainability or ESG reporting targets — carbon-aware pipelines become a credible contribution with measurable numbers.

✗ Skip it when

Your suite runs in under 5 minutes total — the overhead of configuring impact analysis exceeds the saving. Fix flaky tests first.
Your codebase has poor modularity, so almost every change touches shared code and triggers the entire suite anyway — impact analysis won't help much until the architecture improves.
You don't yet have reliable test coverage — sustainable optimisation on top of a broken or sparse suite just makes the gaps harder to see.
The pipeline is already optimised (selective runs, cache reuse, ephemeral envs in place) — you're at diminishing returns; incremental gains aren't worth the configuration effort.
Compliance requires a full regression on every build (e.g., certain medical device or financial software standards) — you can still optimise cost and carbon, but selective runs may be off the table entirely.

Context guide

How the right level of sustainable testing effort changes based on project context.

Context	Priority	Why
Benefits NZ or CoverNZ benefits platform — large .NET suite on cloud CI, 50+ deployments/month	Essential	High commit cadence means redundant full-regression runs accumulate fast. Test impact analysis and cache reuse can cut per-commit cloud cost by 60%+ without touching coverage of the payment and eligibility logic.
TransitNZ / TransitNZ tolling or licensing system — high compliance, mandated audit trail	High	Full regression must still gate every release; sustainability wins come from trimming per-commit and branch-level runs. Carbon-aware scheduling is straightforward for overnight jobs — urgency requirements don’t apply to them.
Harbour Bank or Pacific Bank — core banking feature team with a 3,000-test Playwright suite	High	Cloud CI bill is large and visible to the CTO. Risk-based test selection combined with ephemeral environments typically halves monthly spend. Financial sector compliance still requires a full regression gate on every production release.
Spark or Pacific Air — internal microservices team, 12 services, mixed tech stack	Medium	Polyglot codebases make TIA harder to configure correctly — the dependency map spans language boundaries. Start with cache reuse and ephemeral environments for immediate wins; defer impact analysis until the suite is modular enough for the mapping to be reliable.
HealthNZ system with Privacy Act 2020 data — regulated, full regression mandated per build	Low	Selective test runs may conflict with audit requirements. Sustainability effort is better directed at carbon-aware scheduling of the mandated suite and at fixing flaky tests that inflate retry costs — not at skipping tests that the compliance regime requires.
Small NZ startup — 200-test suite, 5-minute CI run, no cloud cost pressure	Low	The overhead of configuring TIA exceeds the saving at this scale. Fix any flaky tests, set up cache reuse, and revisit sustainable testing when the suite exceeds 1,500 tests or the CI bill becomes visible on the P&L.

Trade-offs

What you gain and what you give up when you choose sustainable testing.

Advantage	Disadvantage	Use instead when…
Faster PR feedback — median time from push to green drops from 40+ minutes to under 10 when TIA is combined with cache reuse, directly accelerating developer velocity.	Impact analysis configuration takes weeks to tune for a large codebase and can miss indirect dependencies (shared config, generated artefacts, database fixtures) if the dependency graph is incomplete.	The codebase is a tightly coupled monolith (e.g. a legacy Revenue NZ or TransitNZ system) where nearly every change touches shared utilities — TIA expands to most of the suite anyway, so the saving is minimal.
Measurable cost and carbon reduction with no coverage trade-off — cloud spend typically falls 30–60% by removing redundant runs, and the safety net for shipped code is unchanged because full regression still gates every release.	Cultural resistance is real — senior engineers who built the suite may distrust selective runs and insist on full regression everywhere. Winning that argument requires proof data (escaped defect rates, compute-minute trends), not just theory.	A compliance mandate (e.g. HealthNZ audit requirements, certain financial software standards) requires a full regression on every build — spend the effort on carbon-aware scheduling of the mandated runs instead.
Becomes a credible ESG contribution — organisations subject to New Zealand’s mandatory climate-related disclosures (large publicly listed companies, large NZX issuers) can report a quantified reduction in Scope 3 digital emissions from pipeline optimisation.	Optimising a flaky suite with TIA amplifies existing gaps — if 8% of tests fail randomly, selective runs make those gaps harder to spot because fewer tests run per commit and noise is harder to attribute.	The suite has significant flakiness or sparse coverage — stabilise and fill coverage first; applying sustainable testing on top of an unreliable suite just makes the quality problems less visible.
Ephemeral environments eliminate inter-run pollution and remove one of the most common sources of flaky failures, improving suite reliability as a side effect of cutting costs.	Carbon-aware scheduling adds orchestration complexity — pipelines need to query grid intensity APIs or use provider tools (e.g. Azure’s carbon-aware SDK), and non-NZ cloud regions (AWS Sydney, Azure Australia East) require separate grid data.	The pipeline is already well-optimised (selective runs, cache reuse, ephemeral environments in place) — further incremental gains may not justify the additional configuration and maintenance overhead.

Enterprise reality

How Sustainable Testing changes at 200–300-developer scale in NZ enterprise — where the waste is structural, the compliance obligations are non-negotiable, and getting it wrong is expensive in ways a small team never encounters.

At scale, test impact analysis and caching move from nice-to-have to essential infrastructure — Revenue NZ's cloud CI footprint across 30+ squads means even a 10% reduction in redundant runs saves hundreds of thousands of dollars annually; in small teams the same change saves a few hundred and nobody notices.
Governance and audit obligations constrain what you can skip: the Privacy Act 2020 and NZISM (NZ Information Security Manual) require documented evidence that security-relevant controls were tested before deployment — selective runs must still produce a full regression audit trail for any release touching personal data or security boundaries, regardless of what impact analysis says is "unaffected."
Tooling decisions compound at volume — at 10+ squads you cannot have each team choose its own CI runner, caching strategy, or environment provisioning tool; organisations like TechServNZ and TeleNZ standardise on a shared platform (GitHub Actions with self-hosted runners, Terratest-managed ephemeral environments) so sustainable-testing gains replicate across every team rather than being re-invented squad by squad.
Cross-team coordination at 10+ squad scale means a single team's decision to add 500 expensive end-to-end tests to a shared regression suite affects every other squad's pipeline time and cloud bill; without a central test-architecture guild and a clear policy on what belongs at each layer, enterprise suites degrade into thousands of redundant end-to-end tests that no individual team owns or can safely delete.

◆ What I would do

Professional judgment — when to reach for sustainable testing, when to skip it, and what to watch for.

If…

I was joining an Benefits NZ benefit-payment portal team whose CI suite runs the full 8,000 tests on every branch push, the cloud bill is $6,000 NZD/month and growing, and median PR feedback takes 45 minutes

I would…

Start with a one-week measurement sprint: export compute-minutes broken down by branch type, time-of-day, and stage. Fix any flaky tests first (each flaky test multiplies retry cost across thousands of runs). Then introduce test impact analysis for per-commit runs, move full regression to merge-to-main only, and add lockfile-keyed caching. I would not touch parallelism until after TIA is running and the effective test count per commit has settled — otherwise I am sizing runners to a workload that is about to change dramatically.

If…

I was working on a TransitNZ road-tolling system where a compliance requirement mandated a full regression on every build and the platform lead asked me to “make it greener”

I would…

Accept that selective test runs are off the table and focus on what is achievable within the mandate: move the full regression to a lower grid-intensity window (NZ grid intensity varies meaningfully between 2am hydro-surplus and 6pm demand-peak), switch to lockfile-keyed dependency caches so the dependency install stage disappears on most runs, tear down test environments immediately after each run, and confirm whether the cloud region is NZ-based or Australian — if it is running in AWS Sydney, switching to Catalyst Cloud (NZ-based, cleaner grid) for the mandated jobs is a concrete emissions reduction with no compliance risk.

If…

I was the lead tester on an FamiliesNZ case-management system rebuild — a new React/Node app, 18-month project, 4 developers, team just starting to write automated tests — and was asked to set up CI from scratch

I would…

Build sustainability in from day one rather than retrofitting it: cache restoration before every run, ephemeral environments as the default, full regression on merge-to-main only, and a dashboard tracking compute-minutes per merged PR from the first sprint. I would hold off on test impact analysis until the suite reaches 1,000 tests — at smaller scale the configuration cost outweighs the saving — but having the metrics baseline from the start means I can demonstrate the ROI of TIA to stakeholders the moment it makes sense to introduce it.

The bottom line: Sustainable testing is almost never about removing tests — it is about removing the scheduling of tests against code that has not changed. The organisations that get it wrong conflate the two and discover six months later that their efficiency win came with an invisible coverage debt. Run each test exactly as often as it can tell you something new. Not once more.

6 Best Practices

✓ What experienced testers do

Measure before you cut. Run a week of pipeline analytics — compute-minutes, cost, where time goes — before making any change. Gut feelings about what's slow are often wrong, and you need a baseline to prove savings.
Treat full regression as a gate, not a default. Reserve the complete suite for merges to main and release cuts. This is the contract that keeps the safety net intact while removing waste everywhere else.
Keep the impact analysis backstop. Always run a full regression on a schedule (weekly or nightly against main only), even when using impact analysis per-commit. This catches indirect dependencies the mapping misses — config drift, shared fixtures, generated code.
Cache dependencies aggressively, invalidate precisely. Lock cache keys to the lockfile hash (package-lock.json, Gemfile.lock, go.sum). Rebuild only when the lockfile changes. This alone can shave 3–8 minutes off most Node or Ruby pipelines.
Tear down environments immediately. Long-lived shared test environments are one of the most common hidden costs. Ephemeral environments — created per run, destroyed on completion — cost less and remove inter-run pollution at the same time.
Separate urgent from deferrable work explicitly. Mark pipeline stages as immediate (PR feedback) or deferred (heavy nightly suites). Only the deferred stages are candidates for carbon-aware scheduling and low-intensity windows.
Track efficiency metrics alongside quality metrics. Compute-minutes per merged PR, cost per release, and energy per run should appear in the same dashboard as pass rate and flake rate. An efficiency regression is as worth catching as a quality regression.
Don't over-parallelise. Size runner count to the shard work. Over-provisioning runners that idle wastes total compute even when wall-clock time drops. The right question is compute-minutes consumed, not elapsed seconds.
Communicate the approach as a developer experience win first. "Faster PR feedback" lands better than "green testing" with most engineering teams. Lead with the velocity improvement; the cost and carbon savings follow.
Document what each pipeline stage covers and why it runs when it does. Shared understanding prevents teammates from adding new tests to the wrong stage, or restoring a wasteful pattern because they don't know why it was changed.

7 Common Misconceptions

❌ Myth: Sustainable testing means running fewer tests and accepting more risk.

Reality: Sustainable testing removes redundant runs of the same tests against unchanged code — not the tests themselves. Full regression still guards every release; what disappears is re-running it on every commit to every branch. Coverage stays; the redundant scheduling of it is what goes.

❌ Myth: Carbon-aware pipelines are an academic concern — NZ's grid is mostly renewable anyway.

Reality: New Zealand's grid is largely renewable but carbon intensity still varies significantly through the day and by island. Beyond that, cloud CI often runs in overseas regions with higher carbon intensity. Shifting deferrable work to cleaner windows or regions has a real, measurable impact — and the NZ government's emissions reporting requirements mean organisations are increasingly expected to account for Scope 3 digital emissions.

❌ Myth: Once you set up test impact analysis, you can skip full regression entirely.

Reality: Impact analysis maps which tests exercise which code, but it can miss indirect dependencies — shared config, database fixtures, environment variables, generated artefacts. A periodic full regression (at minimum on every merge to main) is a non-negotiable backstop. Teams that drop it entirely typically discover a gap in the mapping only when a bug ships to production.

8 Now You Try

Three graded exercises — spot, fix, then build. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot: find the waste

A pipeline runs the full 9,000-test suite on every push to every branch, rebuilds all dependencies from scratch each run, keeps a shared test environment up 24/7, and runs a heavy nightly suite against all branches at a fixed time. Identify the wasteful practice in each of the four items and name the sustainable alternative for each.

Show model answer

1. Full suite every push, every branch — Waste: re-runs tests against unchanged code that give no new information. Fix: test impact analysis — run only the tests affected by the changed files per push, and reserve full regression for merges to main and releases.

2. Rebuild all dependencies each run — Waste: rebuilding cached artefacts is some of the most wasteful compute in a pipeline. Fix: reuse the dependency / build / container-layer cache; rebuild only when the lockfile actually changes.

3. Shared test environment up 24/7 — Waste: environments idle overnight still consume compute and cost. Fix: ephemeral environments created on demand and torn down immediately after the run.

4. Heavy nightly suite, all branches, fixed time — Waste: runs against branches nobody changed, at a time that ignores grid carbon intensity. Fix: run only against changed branches, and schedule the heavy work for a lower grid-intensity (carbon-aware) window.

The common thread: remove runs that produce no new information, then time and size the rest efficiently.

🔧 Exercise 2 of 3 — Fix: repair an over-aggressive plan

A team read about green testing and proposed the plan below to cut costs. It saves money but weakens the safety net and misses cheaper wins. Rewrite it so it cuts waste without reducing real coverage.

Flawed plan:
1. Delete the 2,000 slowest tests to save minutes.
2. Stop running regression entirely, including on releases.
3. Add 50 parallel runners so any suite finishes fast.
4. We’ll know it worked because the bill looks lower.

Rewrite as a sound sustainable-testing plan:

Show model answer

Sound sustainable-testing plan:

1. Keep the tests; apply test impact analysis so each push runs only the tests affected by the changed files. Slow tests that still add value are kept but run less often, not deleted.
2. Keep full regression as a gate on merges to main and on every release; trim it from every-commit and idle-branch runs instead.
3. Size parallelism to the shard work and reuse caches; over-provisioning 50 runners that idle raises total compute and cost even if wall-clock time drops.
4. Measure compute-minutes, cost, and estimated energy per merged pull request and per release — a lower bill alone could just mean coverage was cut.

What was wrong with the original:
- It cut coverage (deleting tests, dropping release regression) rather than cutting redundant runs — that trades a saving for open-ended risk.
- Over-parallelising can increase total compute and cost, the opposite of the goal.
- "The bill looks lower" is not a metric; you need compute, cost, and energy figures to prove waste fell without coverage falling.

🏗️ Exercise 3 of 3 — Build: design a sustainable pipeline

You join a NZ SaaS team whose CI runs the full suite on every push, rebuilds dependencies each time, and runs a heavy nightly suite against every branch. Design a sustainable pipeline: state what runs per commit, what runs on merge, what runs on a schedule, how you reuse compute, how you make it carbon-aware, and which three metrics you would track to prove it worked.

Show model answer

Sustainable pipeline design:

Per commit: run only the tests affected by the changed files (test impact analysis), plus a fast smoke set. Fast feedback, minimal compute.

On merge to main / release: run the full regression suite as a gate — this is the safety net, and it runs when it matters, not on every branch push.

On a schedule: a broader nightly or weekly suite, but only against branches with commits since the last green run; skip idle branches entirely.

Compute reuse: restore dependency, build, and container-layer caches; rebuild only when the lockfile changes. Use ephemeral environments created on demand and torn down after each run.

Carbon-aware choice: schedule the heavy non-urgent nightly suite for a lower grid-intensity window (or a lower-carbon region); keep urgent pull-request feedback immediate.

Three metrics to track: (1) compute-minutes per merged pull request, (2) cloud cost per release, (3) estimated energy or emissions per pipeline run — trended over time alongside pass rate and flake rate.

A senior would add: size parallelism to the shard work to avoid idle runners, and keep a periodic full regression as a backstop in case impact analysis misses an indirect dependency.

Why teams fail here

Conflating "fewer tests" with "fewer test runs" — deleting valuable tests instead of trimming redundant scheduling of them, trading permanent coverage loss for a temporary cost saving.
Dropping the full-regression backstop after enabling test impact analysis — assuming the dependency map is complete when it almost never captures shared fixtures, config, or generated artefacts.
Over-parallelising without measuring compute-minutes — adding runners cuts wall-clock time but raises total compute cost when shards finish early and runners sit idle, the opposite of the goal.
Optimising before measuring — making pipeline changes without a baseline of compute-minutes, cost, and energy data, so savings (and regressions) can't be proven and leadership can't see the value.

Key takeaway

A sustainable test suite doesn't run fewer tests — it runs each test exactly as often as it can produce new information, and not once more.

How this has changed

The field moved. Here is how Sustainable Testing evolved from its origins to current practice.

Pre-2010

Test suites grow without governance. Teams add tests freely, rarely remove or refactor them. Slow, flaky, or duplicate tests accumulate. A "test debt" pattern emerges — suites become a maintenance burden that slows delivery without adding proportional value.

2010s

Continuous delivery pressure makes slow and flaky tests a delivery blocker rather than an inconvenience. Teams begin treating tests as production code — requiring review, refactoring, and deletion when no longer valuable. The "testing ice cream cone" anti-pattern is named.

2015

The testing pyramid (unit → integration → end-to-end) becomes the standard architectural guidance for sustainable test portfolios. Teams audit their suites and shift tests down the pyramid to improve speed and reliability.

2019

Test flakiness receives serious research attention. Google publishes data showing flaky tests are the leading cause of CI pipeline degradation. Flakiness quarantine, retry logic, and flakiness detection become standard CI practices.

Now

AI tools can identify flaky tests by analysing failure patterns across many runs, suggest test refactoring, and flag tests that duplicate coverage already provided by faster tests. The conversation has matured from "how do we get coverage?" to "how do we keep the test suite fast, reliable, and maintainable as the codebase scales?"

Self-Check

Click each question to reveal the answer.

Interview Questions

What NZ hiring managers ask about Sustainable Testing — and what strong answers look like.

What makes a test flaky, and how do you diagnose and fix flakiness?

Strong answer: Flakiness is non-deterministic test failure — the test passes and fails without code changes. Common causes: shared mutable state between tests (test B depends on side effects from test A), time-dependent logic (tests that sleep for a fixed duration or assert on timestamps), asynchronous operations that complete at varying speeds (waiting for a UI element that sometimes takes longer), and environment-dependent behaviour (network calls to external services). To diagnose: run the test 100 times in isolation and 100 times in the full suite — if it fails more in the suite, the cause is test interaction. If it fails in isolation, it is intrinsically flaky. Fix by making tests independent, using explicit waits instead of sleeps, and mocking external dependencies.

Junior/Mid

The team wants to keep all tests, no matter how old or slow, because "they might catch something." How do you make the case for test deletion?

Strong answer: I acknowledge the fear: deleting a test that catches a future regression is a real cost. But the actual cost of keeping all tests is a slow, unreliable suite that teams begin to distrust and skip. I propose a concrete standard: any test that (1) has not caught a regression in the last 12 months AND (2) provides coverage that is already provided by faster tests at a lower level is a candidate for deletion. I run the deletion past the team for review — they may know of a risk I do not. I also track the coverage impact. For high-risk tests, I retire rather than delete: mark them as inactive but keep them in the repo, reviewable if a related bug emerges.

Mid/Senior

Q1: What is the core idea of sustainable testing in one sentence?

Remove redundant test runs — the ones that re-confirm unchanged code or rebuild cached work — so the same confidence in quality is delivered for less time, cost, and energy. It cuts waste, not coverage.

Q2: How does test impact analysis reduce footprint without losing coverage?

It maps which tests exercise which code, so a commit runs only the tests affected by the changed files and their dependents instead of the whole suite. Coverage of the change is preserved; what disappears is the re-running of tests against code nobody touched. A periodic full regression on the main line backstops any gaps in the mapping.

Q3: Why can adding more parallel runners sometimes make things worse?

Parallelism cuts wall-clock time, but over-provisioning spins up runners that finish their shard early and sit idle, raising total compute-minutes and cost. The goal is lower total footprint, so parallelism should be sized to the shard work and judged on compute-minutes, not just elapsed time.

Q4: What does a carbon-aware pipeline actually change?

It shifts non-urgent work — nightly suites, heavy data jobs — to times or regions where the electricity grid is cleaner, trimming emissions for the same work. Urgent pull-request feedback still runs immediately; carbon-awareness applies only to work that can safely wait.

Q5: Which metrics prove a sustainability change worked, and why isn’t a lower bill enough on its own?

Track compute-minutes per merged pull request, cloud cost per release, and estimated energy or emissions per run — trended over time. A lower bill alone could simply mean coverage was cut; pairing cost with compute and energy metrics, alongside pass and flake rates, shows waste fell without the safety net weakening.

Q: Your team is building the new Benefits NZ benefit-payment portal and the CI suite takes 35 minutes per commit on 12 parallel runners. The platform lead wants to cut cloud costs by 40% before the next budget cycle. What sustainable-testing changes do you propose first, and which one would you leave until last?

A: Start with test impact analysis so each commit runs only the tests touching the changed files — this alone typically cuts per-commit runtime by 60-80% on a suite of this size without any coverage trade-off. Then reuse the dependency cache (lockfile-keyed) and move full regression to merge-to-main only. Leave parallelism tuning until last: once the effective test count per commit drops dramatically, the current 12 runners may already be over-provisioned, so measuring compute-minutes after TIA is applied tells you exactly how many runners you actually need rather than guessing.

Q: What is the key difference between sustainable testing and risk-based testing, given that both result in running fewer tests on any given commit?

A: Risk-based testing selects tests based on the business impact and likelihood of failure in different areas — it is a coverage strategy that answers "which areas deserve the most testing effort?" Sustainable testing selects tests based on what actually changed in the code — it is an efficiency strategy that answers "which tests can tell us anything new right now?" The two are complementary: risk-based testing shapes your suite design and release gating; sustainable testing shapes how often and on what trigger each part of that suite runs.

Q: A developer says "We shouldn't bother with carbon-aware scheduling — New Zealand runs mostly on renewables so our pipeline is already green." What is wrong with this reasoning and how do you respond?

A: There are two flaws. First, NZ's grid carbon intensity still varies significantly through the day and between islands, so scheduling a heavy nightly suite at a cleaner window is a real, measurable gain even domestically. Second, cloud CI often executes in overseas data centres — AWS Sydney, Azure Australia East — whose grids are considerably more carbon-intensive than NZ's. The response: "Our pipeline isn't running on the Manapouri turbines — it runs in Sydney. And even on NZ's own grid, intensity at 2am hydro-peak differs from 6pm demand-peak. Let's check where our runner region is and what the grid looks like before we assume it's free."

Q: An interviewer asks you to describe a situation where you would advise a team NOT to implement test impact analysis, even though their CI pipeline is slow. What scenarios disqualify it?

A: Three situations make TIA a bad fit. First, if the codebase has poor modularity — for example a large Revenue NZ or TransitNZ legacy monolith where nearly every change touches shared utilities — the impact graph expands to most of the suite anyway and the configuration effort exceeds the saving. Second, if the existing test suite is unreliable or sparse, applying TIA just makes the coverage gaps harder to spot; stabilise flaky tests and fill coverage first. Third, if a compliance regime — such as certain financial software standards or health sector audit requirements relevant to HealthNZ systems — mandates a full regression on every build, selective runs may not be permissible regardless of efficiency gains.

Interview Prep

“What is sustainable or green testing, and why is it relevant now?”

It is reducing the cost and carbon footprint of testing while keeping the same confidence in quality — by running only the tests a change affects, right-sizing CI, and timing heavy work for cleaner energy. It matters now because cloud CI is billed by the compute-minute, suites have grown huge, and organisations report on emissions, so the cheap option and the low-carbon option are usually the same one.

“A team wants to halve its CI bill. What do you do first?”

Measure before cutting — capture compute-minutes, cost, and where the time goes. Then attack the biggest waste: test impact analysis so each commit runs only affected tests, full regression gated to merges and releases, and cache reuse instead of rebuilding dependencies. I would not delete valuable tests or drop release regression; that cuts the safety net rather than the waste.

“How do you make sure efficiency changes don’t let a regression slip through?”

Keep full regression as a gate on the main line and every release, so impact analysis only governs per-commit feedback, never what ships. Run a periodic full backstop in case the impact mapping misses an indirect dependency like shared fixtures or config. And track flake and escaped-defect rates alongside the cost metrics, so a drop in quality would show up immediately.

Sustainable testing decides what to run and how often, so it leans directly on regression testing — the suite whose redundant runs you are trimming — and on risk-based testing to weight effort towards the areas that most deserve it.

You cannot prove the savings without numbers, so pair it with test metrics to track compute, cost, and energy over time and confirm that waste fell while coverage held.

Where sustainable testing pays off most: large suites on cloud CI, teams running full regression on every commit, and pipelines with long-lived environments or rebuilt-every-time dependencies. The bigger and busier the pipeline, the more redundant compute there is to remove — and the larger the joint cost and carbon saving.

9 Measuring Test Suite Environmental Impact

AI and cloud testing have real energy costs. A 10,000-test Playwright suite running 20 times per day on cloud runners consumes measurable electricity — it just rarely appears on anyone’s radar because the cost is buried in a monthly invoice and the emissions are invisible.

What to measure

Signal	What it tells you	Where to find it
CI runner minutes per day	Total compute consumed; spot idle or redundant runs	GitHub Actions Usage tab, GitLab CI analytics
Peak concurrent runners	Whether you’re over-provisioning parallelism	CI billing dashboard, infrastructure cost reports
Test infrastructure cloud cost	Energy proxy: cost tracks compute which tracks electricity	AWS Cost Explorer, Azure Cost Management

Quick wins

Kill redundant tests. Tests covering identical ground give no extra confidence but consume real compute. Audit for coverage overlap before optimising anything else.
Fix flaky tests first. Each retry is 2–3× the energy cost of a passing test. A 5% flake rate on 500 tests means roughly 75 extra test executions per run before you’ve done anything useful.
Run only affected tests on non-main branches. Selective runs via test impact analysis are the single largest lever. Most teams see a 60–80% reduction in per-commit compute from this change alone.

The 80/20 Rule

20% of your tests cover 80% of your real risk. The other 80% are often regression tests that existed to catch historical bugs that are long since fixed — and that no longer fail on any regular code path. Running all of them on every commit is the testing equivalent of auditing last year’s accounts every morning to make sure they still balance.

Pro tip: Track CI runner minutes per merged pull request as a trend metric — not per run. A rising trend signals accumulating waste even if individual runs look normal.

10 AI Testing Energy Costs

Running AI-powered tests has an energy cost that traditional testing does not. A single LLM inference call is orders of magnitude more compute-intensive than executing a unit test — and the numbers add up fast in a CI context.

The numbers in context

Example: 500-token prompt eval
  × 1,000 test runs per CI execution
  × 100 CI runs per month
  ────────────────────────────────
  = 50,000,000 tokens / month

At typical API pricing (~$0.003 / 1K input tokens):
  = $150/month just for the eval calls

Add output tokens, retries, and parallelism → costs compound quickly.

That’s a single eval suite. Teams running AI-powered test generation, LLM-as-judge frameworks, or semantic similarity checks at scale can easily push into four figures per month if the evals run on every commit.

Green strategies for AI testing

Strategy	How it helps
Use smaller models for deterministic checks	If you’re checking JSON schema conformance or exact field presence, a fast small model costs 10–50× less than a frontier model
Cache model responses	Where non-determinism is not the point of the test, cache the response and skip the inference call entirely
Run AI evals nightly, not on every commit	Reserve expensive LLM-as-judge checks for scheduled runs; fast unit checks on every push
Use semantic similarity, not exact match	Exact-match failures on semantically correct output trigger unnecessary retries, multiplying cost and energy for no quality gain

NZ context: New Zealand’s electricity is 84% renewable, but cloud computing still carries a carbon footprint. When you call an LLM API, that inference runs in a data centre in Virginia, Dublin, or Sydney — not on NZ’s grid. Catalyst Cloud (NZ-based) publishes carbon metrics and runs on a significantly cleaner grid than offshore providers. For AI workloads where latency permits, NZ-based compute is a meaningful green choice.

11 Flaky Test Environmental Cost

Flaky tests are usually framed as a quality problem: they erode trust in the suite and mask real failures. But they are also a sustainability problem with a concrete number attached.

The maths

5% flaky rate × 500-test suite = 25 flaky tests per run
Each retried 3 times                = 75 extra test executions per run

Run CI 20× per day:
  75 × 20 = 1,500 wasted executions / day
           × 365 = 547,500 wasted executions / year

At 100ms average test duration:
  547,500 × 0.1s = 54,750 seconds = ~15 hours of wasted compute / day

Fifteen hours of compute per day — every day — producing nothing. No new information, no safety signal, no coverage benefit. Pure waste.

Framing this for leadership

Teams often struggle to get budget for flaky test remediation because it feels like housekeeping. The sustainability frame gives you a second argument: flaky tests are not just a quality risk, they are a cost and emissions issue with a quantifiable number. “We are burning 15 hours of cloud compute per day to run tests that are failing randomly” lands differently than “some of our tests are a bit unreliable.”

✗ Don’t measure carbon before fixing determinism

Calculating the exact carbon footprint of a flaky test is premature when the test itself is broken. The right sequence is: fix test determinism first, then measure the reduced compute cost and emissions. Accurate numbers come from a stable suite — chasing carbon metrics on a flaky suite just creates unreliable sustainability data on top of unreliable test data.

The Leaky Tap Analogy

A flaky test suite is like a tap that drips 30% of the time. You can calculate how many litres per year are wasted, but the right response is to fix the washer, not to start tracking water consumption more precisely. Fix the drip first — the savings take care of themselves.

Sustainable Testing

1 The Hook

2 The Rule

3 The Analogy

What it is

Why it matters now

Efficient test selection

Right-sizing CI

Carbon-aware pipelines

Measuring the footprint

Worked example

Common mistakes

4 Industry Reality

5 When to Use It — and When Not To

✓ Use it when

✗ Skip it when

Context guide

Trade-offs

◆ What I would do

6 Best Practices

7 Common Misconceptions

8 Now You Try

How this has changed

Self-Check

Interview Questions

Interview Prep

9 Measuring Test Suite Environmental Impact

What to measure

Quick wins

10 AI Testing Energy Costs

The numbers in context

Green strategies for AI testing

11 Flaky Test Environmental Cost

The maths

Framing this for leadership

Prerequisites

Related Techniques

What to Learn Next

Also in Bootcamp

Sustainable Testing

1 The Hook

2 The Rule

3 The Analogy

What it is

Why it matters now

Efficient test selection

Right-sizing CI

Carbon-aware pipelines

Measuring the footprint

Worked example

Common mistakes

4 Industry Reality

5 When to Use It — and When Not To

✓ Use it when

✗ Skip it when

Context guide

Trade-offs

◆ What I would do

6 Best Practices

7 Common Misconceptions

8 Now You Try

How this has changed

Related techniques

Self-Check

Interview Questions

Interview Prep

Related techniques

9 Measuring Test Suite Environmental Impact

What to measure

Quick wins

10 AI Testing Energy Costs

The numbers in context

Green strategies for AI testing

11 Flaky Test Environmental Cost

The maths

Framing this for leadership

Prerequisites

Related Techniques

What to Learn Next

Also in Bootcamp