Applying 42119 in a Real NZ Project
Five lessons of theory come together here. This is what it looks like when a team applies 42119 from the first sprint to the live system — and why doing it that way is the difference between catching a problem in week one and discovering it in a regulator’s letter.
1 The Hook
For once, a story where it goes right. Tahi Bank, a fictional NZ bank, set out to deploy an AI fraud-detection system on its retail transactions. The QA lead had read 42119 and made a decision at the very start: this would not be a project where testing showed up at the end. It would be built in from sprint one.
So in sprint 1, before a model existed, the team built the AI risk register — naming the data, model, fairness, explainability, and drift risks the system carried, each with a likelihood, an impact, and a test approach. During development, data quality test cases were written alongside the model: representativeness across customer segments and regions, provenance and lawful basis for the transaction data, label correctness on the confirmed-fraud examples. At UAT, model performance was tested against agreed precision and recall thresholds, with adversarial tests probing whether fraudulent patterns could slip past. Before go-live, fairness testing confirmed the model did not flag legitimate transactions at different rates across customer groups. And throughout, every test produced an audit-ready artefact, traced to a numbered risk.
Three months after go-live, the continuous validation suite did its job. A scheduled drift test showed fraud-detection recall slipping — a new scam pattern was emerging that the model had not been trained on. The alert fired, the team retrained, and the gap closed before any customer lost money. No six-month silent decline, because someone had built the drift test in at design time.
Then the FMA, reviewing AI use across the sector, asked Tahi Bank to demonstrate how the system had been tested. The QA lead did not panic. She produced the risk register, the test plan tracing every risk to its tests, the per-group fairness tables, the drift time series with the caught-and-fixed event, and the dated sign-offs. The whole conversation took an afternoon. That is what good looks like — and every piece of it came from a lesson in this module.
2 The Rule
ISO/IEC 42119 is not a compliance checkbox done at the end of a project. It is a test approach applied throughout the AI system lifecycle — from the risk register in sprint one, through data and model testing during development, fairness testing before go-live, and drift detection running for as long as the system is live. Applied early, it is cheap and powerful. Bolted on late, it is expensive and usually too late.
3 The Analogy
A Great Walk, not a single weather check at the car park.
Nobody walks the Routeburn by checking the forecast once in the car park and then ignoring the mountains for three days. You plan against the known risks before you leave, you check conditions and your gear at every hut, and DOC monitors the track the whole season — closing it when a slip or a storm makes it unsafe. Safety is not a gate you pass at the start; it is attention applied the whole way through.
Applying 42119 is the same. The risk register is your trip plan. Data and model testing are the gear checks at each hut. Fairness testing before go-live is the last check before the exposed alpine section. And continuous validation is DOC watching the track after you have set out — because the conditions that were fine at the start do not stay fine on their own. A team that tests AI only at go-live is checking the forecast in the car park and hoping.
4 End-to-End Walkthrough
Here is how 42119 maps onto a typical NZ AI project, sprint by sprint, using the Tahi Bank fraud system as the worked example.
| Phase | 42119 activity | Artefact produced |
|---|---|---|
| Sprint 1 — Design | Build the AI risk register (Lesson 1). Define quality characteristics and acceptance thresholds. Decide the fairness definition the system owes. Plan the label-collection needed for drift testing later. | AI risk register; draft test plan with coverage rationale. |
| Development | Data quality testing alongside model build (Lesson 2): representativeness, provenance, label correctness. Begin model performance testing as the model matures (Lesson 3). | Data test results; provenance records; early performance metrics — all risk-traced. |
| UAT | Full model testing (Lesson 3): performance against thresholds, adversarial tests, explainability checks on sample decisions. System and integration testing of the model inside the application. | Performance and adversarial results by group; explainability sample; integration test results. |
| Before go-live | Fairness testing (Lesson 4): demographic parity and counterfactual pairs across the groups in the register. Final test summary report with open risks (Lesson 5). | Per-group fairness tables; counterfactual results; signed summary report with go/no-go recommendation. |
| Post-deployment | Continuous validation (Lesson 3): scheduled drift tests on fresh labelled data, with thresholds and alerts. Periodic fairness re-checks for feedback-loop bias. | Dated drift time series; alert and retraining records — the artefacts that answered the FMA. |
The shape to notice: testing is continuous and front-loaded, not a phase before release. The most important decisions — what to test, the fairness definition, the drift label-collection — are made in sprint 1, and the largest single block of testing (continuous validation) runs after go-live for the life of the system.
5 When to Do What
If you remember one timing map from this module, make it this one:
- Data quality testing — during development. You test the data as it is assembled and before the model trains on it. Finding a representativeness gap after the model ships means retraining; finding it during development means fixing the dataset.
- Model testing — during development and UAT. Performance and adversarial testing start as the model matures and intensify at UAT, when the model is stable enough to test against final thresholds.
- Fairness testing — before go-live. Once the model is final, before it touches real customers. Fairness is the last gate, because it is the failure that most directly harms people and most attracts regulators.
- Drift testing — post-deployment, continuously. It only makes sense once the model is live and the world can move under it. But the mechanism — the label collection — must be designed in sprint 1.
6 Raising 42119 with a Team That Has Not Heard of It
You will often be the only person in the room who knows 42119 exists. Leading with “we must comply with ISO/IEC TS 42119-2:2025” alienates developers and product managers who hear it as process overhead. Lead with the risk instead.
- Talk in failures, not clause numbers. “If the model performs worse for one region and we can’t show we checked, that’s a complaint and a headline” lands better than “the standard requires demographic parity testing.”
- Frame it as protecting the team. The risk register and audit-ready artefacts are what let everyone answer a regulator calmly — the Tahi Bank afternoon versus the Taurangi Wealth scramble. It is insurance, not bureaucracy.
- Start with the cheap, high-value tests. A representativeness comparison and a drift plan cost little and cover the most common, most damaging failures. Win trust with those before proposing the fuller programme.
- Make the developers allies. Bias mitigation is their craft; bias testing is yours. Position 42119 as giving their good work the evidence it deserves, not as checking up on them.
7 The 42119 Roadmap
42119-2 is the overview part, and it is a Technical Specification — deliberately an early, evolving document. More parts are coming, and a senior tester should know what is on the horizon:
- Part 3 — Verification and Validation (expected 2026): deeper treatment of how AI systems are verified and validated, building on the overview.
- Part 7 — Red Teaming: structured adversarial testing of AI systems by dedicated teams trying to break them — an expansion of the adversarial testing in Lesson 3.
- Part 8 — Generative AI Testing: test approaches specific to generative models, where output is open-ended text or images rather than a classification — the area the Test with AI / CT-GenAI modules already touch.
The practical takeaway: adopt 42119-2 now as best practice, but write your test approach so it can absorb the new parts as they land. And keep framing organisational claims as “aligned to ISO/IEC TS 42119-2:2025” — the series is still growing, and certification against a TS is not a thing.
8 NZ Regulatory Alignment
42119 does not exist in a vacuum in Aotearoa. It operationalises commitments that several NZ frameworks already make — which is exactly why adopting it is a practical move, not just an international nicety.
NZ Government Algorithm Charter
Signatory agencies commit to transparency, regular review, and managing the risks of algorithmic decision-making. 42119’s risk register, fairness testing, and audit-ready artefacts are concrete ways to evidence those commitments.
Public Service AI work programme
Central guidance for the safe, responsible adoption of AI across NZ government. A documented 42119 test approach is a direct way for an agency to show its AI use is assured and accountable, not ad hoc.
GCDO / GCDA AI assurance direction
The Government Chief Digital Officer’s assurance expectations for digital and AI systems point in the same direction as 42119: identify risks, test against them, and keep evidence. Aligning to 42119 helps satisfy that assurance lens.
NZ government AI guidance is evolving quickly — confirm the current name and owner of each programme with your agency’s digital or assurance team before citing it in a formal document.
9 Common Mistakes
🚫 Treating 42119 as a phase before release instead of a lifecycle approach
Why it happens: Traditional testing is a stage near the end, so AI testing gets slotted into the same place.
The fix: The highest-value 42119 work — the risk register, the fairness definition, the drift label-collection — happens in sprint 1, and continuous validation runs for the life of the system. Bolt it on at the end and you have missed the cheap, decisive moves.
🚫 Leaving drift’s label-collection until after go-live
Why it happens: Drift testing runs in production, so teams assume it can be designed in production.
The fix: Drift testing runs late but must be planned first. Without a mechanism to learn real outcomes, you cannot measure drift — the model degrades invisibly. Design the label-collection in sprint 1.
🚫 Selling 42119 to the team with clause numbers instead of risk
Why it happens: The standard is the tester’s frame, so it is the first thing out of their mouth.
The fix: Developers and PMs hear “the standard requires” as overhead. Lead with the concrete failure and the protection the artefacts give the whole team — the calm afternoon with the regulator instead of the scramble.
🚫 Claiming “certified against 42119”
Why it happens: “Certified” sounds stronger in a tender or board paper.
The fix: 42119-2 is a Technical Specification, not a certifiable International Standard, and the series is still growing. Claim “aligned to ISO/IEC TS 42119-2:2025.” Over-claiming is itself a compliance risk.
10 Now You Try
Three graded exercises bringing the whole module together. Write your answer, run it for AI feedback, then compare to the model answer.
Below is a project plan for a fictional Auckland Council permit assessment AI (it recommends approve / decline / refer for building and resource consent applications). Identify 5 points where 42119-aligned testing activity should be added but is currently missing, and say which phase each belongs in.
• Sprint 1–2: gather requirements, design the model, source 6 years of historical consent decisions for training.
• Sprint 3–6: build and train the model.
• Sprint 7: functional testing of the web form and the recommendation API; load testing.
• Sprint 8: UAT with the consents team; fix defects.
• Go-live: release to all council consent officers.
• Post-go-live: monitor system uptime and API error rates.
List 5 missing 42119 testing activities and the phase each belongs in:
Show model answer
The plan tests the plumbing (form, API, load, UAT, uptime) but no AI-specific testing at all. Five gaps: Gap 1: AI risk register — Phase: Sprint 1–2 (design). There is no risk register, so nothing drives or justifies AI test scope. This is the foundation everything else traces to. Gap 2: Data quality testing (representativeness, provenance, label correctness) — Phase: Sprint 3–6 (development), alongside training. The 6 years of historical decisions must be tested for representativeness (all consent types, regions, applicant types), provenance (lawful basis), and label correctness (were past decisions correct, or do they encode past officer bias?). Training on untested history risks reproducing it. Gap 3: Model performance testing by segment — Phase: Sprint 7–8 (UAT). Functional testing checks the API works; nothing checks whether the recommendations are accurate, with precision/recall against adjudicated outcomes, broken down by consent type and area. Gap 4: Fairness testing — Phase: before go-live. No demographic parity or counterfactual testing — does the model refer or decline at different rates across areas or applicant groups for equivalent applications? Consent decisions about people and property are high fairness-risk. Gap 5: Drift testing / continuous validation — Phase: post-go-live. "Monitor uptime and API errors" is not model monitoring. There is no drift test on recommendation quality as building patterns, costs, and rules change — and no label-collection planned to enable it (should have been designed in Sprint 1). Bonus: explainability testing — consent decisions must be explainable to applicants, and audit-ready artefacts/traceability are absent throughout.
The test strategy below covers only functional and regression testing for the same Auckland Council permit AI. Add a 42119 AI testing annex that specifies: which AI test types are required, at which lifecycle phase, and with what evidence requirements.
Write the 42119 AI testing annex:
Show model answer
42119 AI TESTING ANNEX — Auckland Council Permit Assessment AI AI test type | Lifecycle phase | What it verifies | Evidence required 1. Data quality (representativeness, provenance, label correctness) | Development | Training data covers all consent types/areas/applicant types; lawful basis under Privacy Act; past-decision labels are correct, not encoded bias | Distribution tables vs live population; lineage + lawful-basis records; label re-check + inter-labeller agreement 2. Model performance (segmented) | Development + UAT | Recommendation accuracy vs adjudicated outcomes, precision/recall by consent type and area, against agreed thresholds | Per-segment precision/recall tables; benchmark set ID; model version 3. Adversarial | UAT | Model behaves safely on edge/boundary/malformed applications and cannot be gamed | Input transformations; per-case results vs robustness threshold 4. Fairness (demographic parity + counterfactual) | Before go-live | Refer/decline rates do not differ across areas/applicant groups for equivalent applications | Per-group outcome tables; counterfactual pairs + results; chosen fairness metric + tolerance 5. Drift / continuous validation | Post-deployment, scheduled | Recommendation quality does not degrade as building patterns, costs, and rules change | Dated metric time series; intervention threshold; alert + retraining records; fresh labelled-data snapshots 6. Explainability | UAT + sampled in production | Each decision carries a defensible reason an applicant and auditor can rely on | Sampled decisions + explanations + expert confirmation Traceability approach: Every test case above traces to a numbered risk in the AI risk register (built in Sprint 1) AND to any related requirement. The test summary report at go-live shows coverage by risk and states open risks. Drift label-collection is designed in Sprint 1 so post-deployment validation is possible. Strong annex: names specific AI test types (not "test the AI"), places each in the right phase, gives concrete evidence per type, and ties everything to risk-based traceability.
Write a 42119 test approach for a new fictional NZ government or financial-services AI system of your choice. Include: a system description, an AI risk register of 5 risks, the test types selected per risk, the lifecycle phase for each, and one sample test case per AI test type used. This pulls together all six lessons.
Show model answer (one worked example)
SYSTEM: Te Whatu Ora ED triage support AI Description: An AI that suggests a triage priority (1–5) for patients arriving at a hospital emergency department, from presenting symptoms, vitals, age, and history. It supports — does not replace — the triage nurse. AI RISK REGISTER: Risk | Category | Likelihood | Impact | Test type | Lifecycle phase 1. Training data under-represents Māori and Pasifika presentations | Data | Med | High | Data representativeness testing | Development 2. Model under-triages a genuinely urgent patient (misses a priority-1) | Model performance | Med | Critical | Model performance testing (recall on priority-1) | Dev + UAT 3. Triage priority differs by ethnicity for equivalent presentations | Fairness | Med | Critical | Demographic parity + counterfactual testing | Before go-live 4. Accuracy degrades as presentation patterns change (e.g. a new seasonal illness) | Drift | High | High | Continuous validation | Post-deployment 5. Nurse cannot see why a priority was suggested | Explainability | Med | High | Explainability testing | UAT + production sample SAMPLE TEST CASES (one per type used): TC-DATA-01 | Data | Representativeness | Verify training data covers ethnicity groups in proportion to the ED's live population | Each group within ±3pp of live ED presentations (rolling 12mo) | Distribution table vs live snapshot; query; date | Risk 1 TC-PERF-01 | Model | Performance | Measure recall on adjudicated priority-1 cases | Recall on true priority-1 ≥ 0.98 (a missed priority-1 is critical) | Confusion matrix; benchmark set ID; model version | Risk 2 TC-FAIR-01 | Fairness | Demographic parity + counterfactual | Compare triage-priority distribution across ethnicity for matched-equivalent presentations; plus counterfactual pairs varying only ethnicity proxy | Priority distribution within tolerance across groups; counterfactual pairs return identical priority | Per-group tables; counterfactual pairs + results | Risk 3 TC-DRIFT-01 | Drift | Continuous validation | Monthly recall on priority-1 for recent patients with confirmed outcomes | Alert if recall drops below 0.95 or 3pp below baseline | Dated metric series; threshold; alert log; labelled snapshot | Risk 4 TC-EXPL-01 | Explainability | Explainability | On a sample of suggestions, confirm the stated reason matches the driving factors and a nurse accepts it | Reason present, accurate, and nurse-confirmed in ≥95% of sample | Sampled cases + explanations + nurse sign-off | Risk 5 Strong answers: a clear system description; 5 risks each with a category, likelihood, impact, test type and phase; and one sample test case per AI test type with a measurable criterion, concrete evidence, and traceability to a risk. The hallmark of mastery is that the priority metric is chosen for the harm (recall ≥0.98 on priority-1 because under-triage can kill) — not a generic accuracy target.
11 Self-Check
Click each question to reveal the answer.
Q1: Why is 42119 described as a lifecycle approach rather than a phase before release?
Because its highest-value activities happen at the start (the risk register, the fairness definition, the drift label-collection in sprint 1) and its largest block of testing (continuous validation) runs after go-live for the life of the system. Slotting AI testing into a pre-release phase misses both ends.
Q2: Map each AI test type to the lifecycle phase it belongs in.
Data quality — during development; model performance & adversarial — development and UAT; fairness — before go-live; drift / continuous validation — post-deployment, but planned (its label-collection) in sprint 1.
Q3: What is the one timing mistake that quietly wrecks AI projects?
Leaving drift’s label-collection until after go-live. Without a mechanism to learn real outcomes you cannot measure drift, so the model degrades invisibly — the Meridian Digital failure. Drift testing runs late but must be planned first.
Q4: How should you raise 42119 with a team that has never heard of it?
Lead with risk and failures, not clause numbers. Frame the risk register and audit-ready artefacts as protecting the whole team (the calm afternoon with a regulator), start with cheap high-value tests (representativeness, a drift plan), and position bias testing as giving the developers’ mitigation work the evidence it deserves.
Q5: Why claim “aligned to” rather than “certified against” 42119, and what is coming next?
Because 42119-2 is a Technical Specification, not a certifiable International Standard, and the series is still growing — Part 3 (Verification and Validation, ~2026), Part 7 (Red Teaming), Part 8 (Generative AI Testing). Over-claiming certification is itself a compliance risk.
12 Interview Prep
Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.
“Walk me through how you’d apply AI testing across a project, start to finish.”
In sprint 1, before any model exists, I build the AI risk register and decide what we’ll test against — including the fairness definition and the label-collection drift will need later. During development I run data quality testing alongside the model build — representativeness, provenance, label correctness — and start model performance testing as it matures. At UAT I do full model testing, adversarial and explainability checks, and the usual system and integration testing. Before go-live I run fairness testing and write a summary report that states open risks honestly. Then continuous validation runs in production for the life of the system — scheduled drift tests with alerts. The theme is that testing is continuous and front-loaded, not a phase at the end, and every test produces an audit-ready artefact traced to a risk.
“The developers and PM have never heard of 42119 and think it’s overhead. How do you bring them along?”
I don’t lead with the standard — I lead with the risk. “If the model performs worse for one region and we can’t show we checked, that’s a complaint and a headline” lands better than quoting a clause. I frame the risk register and the artefacts as protecting the whole team when a regulator calls — insurance, not bureaucracy. I’d start with the cheap, high-value tests like a representativeness check and a drift plan to build trust, and I’d make the developers allies by being clear that de-biasing is their craft and testing just gives their good work the evidence it deserves.
“Where is the AI testing standards space heading, and how do you keep your approach current?”
42119-2 is the overview and it’s a Technical Specification, so it’s deliberately early and evolving. More parts are coming — Part 3 on verification and validation around 2026, Part 7 on red teaming, Part 8 on generative AI testing. I adopt 42119-2 now as best practice but write the test approach so it can absorb those parts as they land, and I keep claims framed as “aligned to” rather than “certified against”. In an NZ context I also keep an eye on the Government Algorithm Charter and the public-service AI guidance, because 42119 is a concrete way to evidence the commitments those frameworks make.