Feature Flags & Progressive Delivery
A feature flag lets you ship code to production with the feature switched off — so “deployed” and “released” stop meaning the same thing. That is powerful, and it doubles what you have to test. This lesson teaches you to test the flag, not just the feature.
1 The Hook
A fictional NZ utility, Moana Power, was rolling out a new direct-debit billing engine behind a feature flag. The plan was textbook progressive delivery: deploy the code to production with the flag off, switch it on for internal staff, then 5% of customers, then everyone — backing out instantly by flicking the flag if anything went wrong. The new engine was tested hard, in its on state, and it worked.
The incident came from the state nobody tested. A second, older flag — “legacy-rounding” — had been left on for a small group of grandfathered commercial accounts. When the new billing engine was switched on for those accounts, the two flags collided: the new engine and the legacy-rounding path made different assumptions about where a fractional cent went, and a few hundred commercial customers were billed a few cents wrong on every line item. Individually trivial. Across a month of usage charges, it was a reconciliation mess and an awkward set of phone calls.
Two lessons hide in that story. The first: the team tested the new feature in its on state and never tested the system in the off state or, worse, in combination with the other flags already live. A feature flag does not add one code path — it adds two, and every other live flag multiplies them. The second: the legacy-rounding flag was years old. It had long since served its purpose and should have been removed, but it sat in the code as a quiet, untested branch waiting for exactly this collision. A flag that outlives its rollout is not a convenience — it is technical debt with a trigger.
This lesson teaches you to test in a flagged world: both states of every flag, the combinations that matter, and the stale flags that turn into defects months after anyone remembers them.
2 The Rule
A feature flag is not one feature to test — it is at least two code paths (on and off) that must both work, plus every combination with the other flags already live. “Deployed” no longer means “released”, so the flag itself is under test: its on state, its off state, its default if the flag service is unreachable, and its interaction with everything else switched on. And a flag that has finished its rollout but stayed in the code is an untested branch waiting to fail.
3 The Analogy
The light switches in a house you have just wired.
Every switch you add has two positions, and an electrician signing off the house tests both — not just that the light comes on, but that it also goes properly off. Now add a second switch on the same circuit, a two-way switch on a stairwell, and the job doubles again: you have to check every combination, because a wiring fault often only shows when this switch is up and that one is down. A feature flag is a switch on a live circuit. Testing only the “on” position is signing off a house having flicked each switch once and never checking what happens when two are thrown together.
And the stale flag? It is the mystery switch in the hallway that does not seem to do anything — until one day someone discovers it was cross-wired to the hot-water cylinder all along. Moana Power had a mystery switch.
4 Deploy Is Not Release
The single idea that makes flags worth the trouble: a feature flag separates deploying code from releasing a feature. Deploying is shipping the code to production servers. Releasing is turning it on for users. Without flags they are the same event; with flags they are two events you control independently.
That separation is what enables progressive delivery — turning a feature on for a widening audience rather than all at once:
Internal / dogfood — on for staff only. Real production, friendly users, fast feedback.
Percentage rollout — on for 1%, then 5%, then 25% of customers. Blast radius stays small while signal grows.
Full release — on for everyone. The flag has done its job and should now be retired.
Kill switch — flick it off instantly if metrics turn bad. Rollback without a redeploy — seconds, not minutes.
For a tester this reframes the job. The kill switch is only real if the off path actually works — so “flip the flag off and confirm the system returns cleanly to the old behaviour” is itself a test, and arguably the most important one. A kill switch you have never exercised is a fire escape no one has checked is unlocked.
5 Testing Both States — and the Default
The core discipline of flag testing is simple to state and easy to skip: every flag has two states, and both ship to production, so both must be tested. There are really three things to verify.
- Flag on — the new behaviour: the new feature works as specified. This is the test everyone remembers to write.
- Flag off — the old behaviour intact: with the flag off, the system behaves exactly as it did before the feature existed. This is the kill-switch guarantee and the most commonly skipped test. If turning the flag off does not cleanly restore the previous behaviour, you have no real rollback.
- The default — flag service unreachable: flags are usually served by a flag-management service. If that service is slow or down, what state does the flag fall back to? The safe default is almost always off (old, known-good behaviour). Test that an unreachable flag service does not, for example, accidentally enable a half-finished feature for everyone.
6 Flag Combinations — the Real Risk
One flag is two paths. Two independent flags are four combinations. Ten flags, in the worst case, are over a thousand. You cannot test every combination of a large flag set — the maths defeats you — so flag testing is a risk-based selection problem, not an exhaustive one. The Moana Power defect was a combination defect: the new billing flag was fine alone, and only failed in combination with legacy-rounding.
You decide which combinations to test by asking where flags actually interact:
- Flags that touch the same data or calculation: two flags that both affect billing, rounding, or a customer record are prime collision candidates — test them together. Two flags on unrelated screens almost never interact.
- A new flag against every other flag currently on in production: the combination that matters most is your new feature crossed with what is already live for real customers right now — not every theoretical pairing.
- Flags with shared dependencies: if two features both call the same downstream service or write to the same table, turning both on can change load or ordering in ways neither does alone.
- Mutually exclusive flags: some combinations should be impossible (two different billing engines on at once). Test that the system prevents or safely handles the invalid combination, rather than assuming it can never occur.
The discipline: keep a current inventory of which flags are on in production, and when you test a new flag, deliberately test it against the live ones that share data or dependencies. That is a handful of targeted tests, not a thousand — and it is exactly the test Moana Power was missing.
7 The Stale-Flag Risk
A flag is meant to be temporary scaffolding for a rollout. Once a feature is fully released and stable, the flag and its old code path should be removed. Flags that are never cleaned up accumulate into a quiet, expensive problem.
Combinatorial explosion — every flag left in the code multiplies the combinations a tester theoretically owns. Dead flags inflate real risk.
Hidden coupling — a stale flag’s default can silently change behaviour when something else moves. Nobody remembers what it does.
Audit ambiguity — for a regulated NZ system, “which code path actually ran for this customer?” gets harder to answer with each orphaned flag.
The tester’s role here is partly hygiene and partly evidence. Flag the flags: as part of release work, ask which flags are now permanently on, fully rolled out, and safe to remove — and make removing them part of “done.” A flag inventory with an owner and a planned removal date for each temporary flag turns an invisible, growing risk into a tracked, shrinking one.
8 A Flag-Driven Test Strategy
Pulling it together, here is what a test approach for a flagged feature looks like — framed as a small, auditable plan rather than a vague “we tested the feature.” Here it is for the Moana Power billing flag:
Type / lifespan: Release flag — temporary, remove after full rollout
On-state test: New engine bills sample accounts correctly to the cent.
Off-state test: With flag off, billing output is byte-identical to the
current production engine for the same accounts (rollback proof).
Default test: Flag service unreachable → flag resolves OFF (old engine).
Combination tests: billing-engine-v2 ON × legacy-rounding ON for grandfathered
commercial accounts — cents reconcile. (The missed test.)
Rollout gate: Promote 1% → 5% → 25% → 100% only while billing-error
rate and complaint rate stay within tolerance.
Cleanup: Remove flag and old engine path within one sprint of 100%.
Traceability: Risk R-05 (billing collision between concurrent flags).
Notice what makes this a strategy rather than a checklist: it names both states, the default, the specific combination that carries risk, a measurable rollout gate, and a cleanup obligation tied to a date. The combination line is the one Moana Power was missing — and it is the one a flag-aware tester writes by reflex, because they keep an inventory of what is already live. That inventory and those rollout gates carry straight into Lesson 3, where the metrics that gate a percentage rollout become the metrics that gate a canary.
9 Common Mistakes
🚫 Testing only the flag’s on state
Why it happens: The new feature is the interesting work, so testing focuses on proving it works.
The fix: The off state ships too, and it is the kill-switch guarantee. If flicking the flag off does not cleanly restore the old behaviour, the “we can always turn it off” safety net is fiction. Write the off-state test first and treat it as the rollback criterion.
🚫 Ignoring flag combinations with what is already live
Why it happens: A new flag is tested in isolation, as if it were the only flag in the system.
The fix: The Moana Power collision. Keep an inventory of which flags are on in production and deliberately test your new flag against the live ones that share data, calculations, or dependencies. It is a handful of targeted tests, not every theoretical pairing.
🚫 Leaving flags in the code after full rollout
Why it happens: Once a feature is 100% on the team moves to the next thing; removing the flag feels like low-value cleanup.
The fix: A stale flag is an untested branch and a multiplier on combinatorial risk — exactly what made legacy-rounding dangerous. Treat a fully-rolled-out flag as open until the flag and the old code path are removed.
🚫 Not testing the flag service’s failure default
Why it happens: The flag service is assumed to always answer, so nobody asks what happens when it does not.
The fix: If the flag service is down or slow, the flag must resolve to a safe default — almost always off / old behaviour. Test that an unreachable flag service cannot accidentally switch a half-finished feature on for everyone.
10 Now You Try
Three graded exercises: spot the flag risks, fix a flag test plan, then build a combination-test matrix. Write your answer, run it for AI feedback, then compare to the model answer.
Read the description of a flagged release for a fictional IRD myIR online-services portal below. Identify 3 feature-flag risks that could cause an incident, and name the test that addresses each.
refund-calc-v2The team tested
refund-calc-v2 thoroughly in its on state and it calculates refunds correctly. They plan to roll it out 5% → 50% → 100%. The flag is served by a flag-management service; nobody has checked what the app does if that service times out. There is already a live flag, provisional-tax-rules, on for self-employed customers, which also touches the refund figure. Once refund-calc-v2 hits 100% the team plans to move straight to the next epic. The old calculator code will stay in place “in case we need it.”
List 3 risks and the test for each:
Show model answer
There are at least four real risks; any three well-explained earns full marks. 1. Off-state / kill switch never tested — only the on state was tested, so there is no proof that flicking refund-calc-v2 off cleanly restores the old calculator. The rollback safety net is unverified. Test: with the flag off, refund output is identical to the current production calculator for the same customers (rollback proof). 2. Flag combination ignored — provisional-tax-rules is already live for self-employed customers and also touches the refund figure, so the two flags can collide on exactly those customers. Test: refund-calc-v2 ON × provisional-tax-rules ON for self-employed accounts — refund reconciles to the cent. 3. Flag-service default untested — nobody knows what happens if the flag service times out; it could fail open and enable v2 for everyone. Test: with the flag service unreachable, the flag resolves to a safe default (OFF / old calculator), not on. Bonus risk: Stale flag planned in — keeping the old calculator "in case we need it" after 100% leaves an untested branch and combinatorial debt. Test/action: schedule removal of the flag and old path within a sprint of 100%, and treat the rollout as incomplete until then. The trap: every one of these is invisible if you only test the new feature in its on state — which is exactly what the team did.
The flag test plan below is incomplete in the classic way. Rewrite it into a complete flag-driven test plan for a fictional Waka Kotahi online vehicle-relicensing feature behind flag relicense-v2, with these elements: On-state test, Off-state test, Default test, Combination tests, Rollout gate, Cleanup, Traceability.
“Turn the flag on and check the new relicensing flow works. If it works, roll it out to everyone.”
Rewrite as a complete flag-driven test plan:
Show model answer
Flag: relicense-v2 Type / lifespan: Release flag — temporary, remove after full rollout. On-state test: With relicense-v2 ON, a sample of vehicle relicensing transactions (car, motorcycle, heavy vehicle, expired vs current) complete correctly end to end, including payment and confirmation. Off-state test: With relicense-v2 OFF, the relicensing flow is identical to the current production flow for the same scenarios — same outputs, same records written. This is the kill-switch / rollback proof. Default test: With the flag-management service unreachable or timing out, relicense-v2 resolves to a safe default of OFF (old flow), never on. Verify no half-finished v2 path is exposed when the flag service is down. Combination tests: relicense-v2 ON crossed with any live flag that touches the same payment or vehicle-record path (e.g. a concurrent payments-gateway flag) — confirm the combination still completes and reconciles. Test relicense-v2 against the flags actually on in production right now, not every theoretical pairing. Rollout gate: Promote internal → 5% → 50% → 100% only while transaction success rate, payment-error rate, and support-contact rate stay within agreed tolerance; halt or kill on breach. Cleanup: Remove relicense-v2 and the old relicensing code path within one sprint of reaching 100%. Treat the rollout as incomplete until removed. Traceability: Linked to the DevOps risk register (e.g. R-06: relicensing release regression / payment path collision). The original tested only the on state and equated "works" with "release to everyone" — missing the off state, the default, every combination, the rollout gate, and cleanup.
A fictional ANZ mobile-banking app has three flags live in production: new-dashboard (UI), instant-payments (touches the payment path), and spending-insights (reads transaction data). You are releasing a fourth flag, scheduled-payments-v2 (also touches the payment path). Design a risk-based combination-test matrix of 3 combinations you would actually test, and for each: which flags are on, why this combination is risky, and the acceptance criterion.
Show model answer
Combination 1 | Flags ON: scheduled-payments-v2 + instant-payments | Why risky: both touch the payment path and may share the same payment service, transaction table, or idempotency logic — the highest-risk pairing | Acceptance criterion: a scheduled payment and an instant payment for the same account both complete, with no double-charge, no lost payment, and correct ordering/idempotency. Combination 2 | Flags ON: scheduled-payments-v2 + spending-insights | Why risky: spending-insights reads transaction data; a new scheduled-payment record could be miscounted, double-counted, or shown before it settles | Acceptance criterion: spending-insights reflects scheduled payments correctly — pending vs settled handled right, totals reconcile to the ledger. Combination 3 | Flags ON: scheduled-payments-v2 + instant-payments + spending-insights (all three payment/data-related flags) | Why risky: the realistic production state for an active customer; emergent issues (load on the shared payment service, ordering across all three) only appear with everything on | Acceptance criterion: end-to-end, all features function and the spending totals reconcile with the payment records; no errors under the combined load. Would NOT prioritise: scheduled-payments-v2 + new-dashboard alone. new-dashboard is UI-only and does not touch the payment path or transaction data, so a data/payment collision is implausible. A light smoke check that the new dashboard renders the scheduled-payments entry is enough; it does not warrant a full combination test. Strong matrices: pick combinations where flags share data, calculations, or dependencies (here, the payment path), include the realistic all-on production state, give measurable criteria, and consciously DESELECT the low-risk pairing with a reason — that judgement is the skill being marked.
11 Self-Check
Click each question to reveal the answer.
Q1: Why does a feature flag separate “deployed” from “released”, and why does that matter to a tester?
Deploying ships the code to production; releasing turns it on for users. A flag lets you do them separately — code can be live in production with the feature off, so you can test in production conditions with zero user risk and turn the feature on progressively. For a tester it means the flag itself is under test: its on state, its off state, and its safe default.
Q2: Why is the off-state test the most important one, and what does it prove?
Because the off state is the kill switch — the “we can always turn it off” safety net. If flicking the flag off does not cleanly restore the old behaviour (same outputs, same data handling), there is no real rollback no matter how well the on state works. The off-state test is the rollback acceptance criterion.
Q3: You cannot test every flag combination. How do you choose which to test?
By where flags actually interact: flags that touch the same data, calculation, or downstream dependency; your new flag crossed with whatever is already live in production right now; and any combinations that should be impossible (test they are prevented or handled safely). Keep an inventory of live flags so you can target the handful of risky combinations rather than thousands of theoretical ones.
Q4: Why is a flag left in the code after full rollout a defect, not just untidy?
The flagged-off old path stops being tested but still ships — it becomes an untested landmine, like Moana Power’s legacy-rounding. It also multiplies the combinations a tester owns and, for a regulated NZ system, muddies the audit question of which code path actually ran for a customer. Treat a fully-rolled-out flag as open until the flag and old path are removed.
Q5: What should a flag resolve to if the flag-management service is unreachable, and why test it?
A safe default — almost always off, the old known-good behaviour. Test it because a flag service that fails open could switch a half-finished feature on for everyone at the worst possible moment. The unreachable-service case is a negative test that proves the failure mode is safe.
12 Interview Prep
Real questions asked in NZ QA interviews for DevOps-adjacent roles. Read the model answers, then practise your own version.
“We use feature flags everywhere. How does that change your test approach?”
It roughly doubles the surface and adds a combination dimension. For each flag I test both states — the on state for the new behaviour, and the off state to prove the kill switch cleanly restores the old behaviour, which is the rollback guarantee people most often skip. I test the safe default if the flag service is unreachable. Then I look at combinations: I keep an inventory of which flags are on in production and deliberately test a new flag against the live ones that share data, calculations, or dependencies, because that is where collisions hide. And I treat a fully-rolled-out flag as unfinished work until the flag and its old code path are removed, so stale flags do not pile up into untested branches.
“A feature was tested and worked, but caused an incident in production. It was behind a flag. What probably went wrong?”
My first guess is a combination defect: the feature was tested alone, in its on state, but collided with another flag already live that touched the same data or path — the classic concurrent-flags billing or calculation clash. My second guess is the off-state or default was never tested, so a rollback or a flag-service outage exposed a broken path. I’d pull the inventory of flags that were on for the affected users, reproduce the specific combination, and check what the flag resolves to when its service is unavailable. “Tested and worked” almost always means “tested in isolation, on state only” — and that is precisely the gap.
“How do you stop feature flags becoming a maintenance and risk problem over time?”
By treating them as temporary scaffolding with an owner and an expiry, not permanent furniture. Each temporary flag goes in an inventory with who owns it and a planned removal date, and removing the flag and its old code path is part of “done” for the rollout — I treat a 100%-on flag as an open item until that happens. That keeps the combinatorial space small, removes untested branches before they become landmines like the legacy-rounding flag, and — important for a regulated NZ system — keeps it answerable which code path actually ran for a given customer. The hygiene is cheap; the stale-flag incident it prevents is not.