Test with AI · ISO/IEC 42119

Applying 42119 in a Real NZ Project

Q: Map each AI test type to the lifecycle phase it belongs in.

Data quality — during development; model performance & adversarial — development and UAT; fairness — before go-live; drift / continuous validation — post-deployment, but planned (its label-collection) in sprint 1.

Q: What is the one timing mistake that quietly wrecks AI projects?

Leaving drift’s label-collection until after go-live. Without a mechanism to learn real outcomes you cannot measure drift, so the model degrades invisibly — the Meridian Digital failure. Drift testing runs late but must be planned first.

Five lessons of theory come together here. This is what it looks like when a team applies 42119 from the first sprint to the live system — and why doing it that way is the difference between catching a problem in week one and discovering it in a regulator’s letter.

Test with AI ISO/IEC TS 42119-2:2025 — Lesson 6 of 6 ~35 min read · ~80 min with exercises

1 The Hook

For once, a story where it goes right. Tahi Bank, a fictional NZ bank, set out to deploy an AI fraud-detection system on its retail transactions. The QA lead had read 42119 and made a decision at the very start: this would not be a project where testing showed up at the end. It would be built in from sprint one.

So in sprint 1, before a model existed, the team built the AI risk register — naming the data, model, fairness, explainability, and drift risks the system carried, each with a likelihood, an impact, and a test approach. During development, data quality test cases were written alongside the model: representativeness across customer segments and regions, provenance and lawful basis for the transaction data, label correctness on the confirmed-fraud examples. At UAT, model performance was tested against agreed precision and recall thresholds, with adversarial tests probing whether fraudulent patterns could slip past. Before go-live, fairness testing confirmed the model did not flag legitimate transactions at different rates across customer groups. And throughout, every test produced an audit-ready artefact, traced to a numbered risk.

Three months after go-live, the continuous validation suite did its job. A scheduled drift test showed fraud-detection recall slipping — a new scam pattern was emerging that the model had not been trained on. The alert fired, the team retrained, and the gap closed before any customer lost money. No six-month silent decline, because someone had built the drift test in at design time.

Then the FMA, reviewing AI use across the sector, asked Tahi Bank to demonstrate how the system had been tested. The QA lead did not panic. She produced the risk register, the test plan tracing every risk to its tests, the per-group fairness tables, the drift time series with the caught-and-fixed event, and the dated sign-offs. The whole conversation took an afternoon. That is what good looks like — and every piece of it came from a lesson in this module.

2 The Rule

ISO/IEC 42119 is not a compliance checkbox done at the end of a project. It is a test approach applied throughout the AI system lifecycle — from the risk register in sprint one, through data and model testing during development, fairness testing before go-live, and drift detection running for as long as the system is live. Applied early, it is cheap and powerful. Bolted on late, it is expensive and usually too late.

3 The Analogy

Analogy

A Great Walk, not a single weather check at the car park.

Nobody walks the Routeburn by checking the forecast once in the car park and then ignoring the mountains for three days. You plan against the known risks before you leave, you check conditions and your gear at every hut, and DOC monitors the track the whole season — closing it when a slip or a storm makes it unsafe. Safety is not a gate you pass at the start; it is attention applied the whole way through.

Applying 42119 is the same. The risk register is your trip plan. Data and model testing are the gear checks at each hut. Fairness testing before go-live is the last check before the exposed alpine section. And continuous validation is DOC watching the track after you have set out — because the conditions that were fine at the start do not stay fine on their own. A team that tests AI only at go-live is checking the forecast in the car park and hoping.

4 End-to-End Walkthrough

Here is how 42119 maps onto a typical NZ AI project, sprint by sprint, using the Tahi Bank fraud system as the worked example.

Phase	42119 activity	Artefact produced
Sprint 1 — Design	Build the AI risk register (Lesson 1). Define quality characteristics and acceptance thresholds. Decide the fairness definition the system owes. Plan the label-collection needed for drift testing later.	AI risk register; draft test plan with coverage rationale.
Development	Data quality testing alongside model build (Lesson 2): representativeness, provenance, label correctness. Begin model performance testing as the model matures (Lesson 3).	Data test results; provenance records; early performance metrics — all risk-traced.
UAT	Full model testing (Lesson 3): performance against thresholds, adversarial tests, explainability checks on sample decisions. System and integration testing of the model inside the application.	Performance and adversarial results by group; explainability sample; integration test results.
Before go-live	Fairness testing (Lesson 4): demographic parity and counterfactual pairs across the groups in the register. Final test summary report with open risks (Lesson 5).	Per-group fairness tables; counterfactual results; signed summary report with go/no-go recommendation.
Post-deployment	Continuous validation (Lesson 3): scheduled drift tests on fresh labelled data, with thresholds and alerts. Periodic fairness re-checks for feedback-loop bias.	Dated drift time series; alert and retraining records — the artefacts that answered the FMA.

The shape to notice: testing is continuous and front-loaded, not a phase before release. The most important decisions — what to test, the fairness definition, the drift label-collection — are made in sprint 1, and the largest single block of testing (continuous validation) runs after go-live for the life of the system.

5 When to Do What

If you remember one timing map from this module, make it this one:

Data quality testing — during development. You test the data as it is assembled and before the model trains on it. Finding a representativeness gap after the model ships means retraining; finding it during development means fixing the dataset.
Model testing — during development and UAT. Performance and adversarial testing start as the model matures and intensify at UAT, when the model is stable enough to test against final thresholds.
Fairness testing — before go-live. Once the model is final, before it touches real customers. Fairness is the last gate, because it is the failure that most directly harms people and most attracts regulators.
Drift testing — post-deployment, continuously. It only makes sense once the model is live and the world can move under it. But the mechanism — the label collection — must be designed in sprint 1.

Pro tip: The one timing mistake that quietly wrecks projects is leaving drift’s label-collection until after go-live. By then there is no clean way to learn the real outcomes, so you cannot measure drift, so the model degrades invisibly — the Meridian Digital story from Lesson 3. Drift testing runs late, but it must be planned first.

6 Raising 42119 with a Team That Has Not Heard of It

You will often be the only person in the room who knows 42119 exists. Leading with “we must comply with ISO/IEC TS 42119-2:2025” alienates developers and product managers who hear it as process overhead. Lead with the risk instead.

Talk in failures, not clause numbers. “If the model performs worse for one region and we can’t show we checked, that’s a complaint and a headline” lands better than “the standard requires demographic parity testing.”
Frame it as protecting the team. The risk register and audit-ready artefacts are what let everyone answer a regulator calmly — the Tahi Bank afternoon versus the Taurangi Wealth scramble. It is insurance, not bureaucracy.
Start with the cheap, high-value tests. A representativeness comparison and a drift plan cost little and cover the most common, most damaging failures. Win trust with those before proposing the fuller programme.
Make the developers allies. Bias mitigation is their craft; bias testing is yours. Position 42119 as giving their good work the evidence it deserves, not as checking up on them.

7 The 42119 Roadmap

42119-2 is the overview part, and it is a Technical Specification — deliberately an early, evolving document. The 42119 family is being built out in parts by ISO/IEC JTC 1/SC 42, with further parts covering areas such as verification and validation, adversarial and red-team testing, and the testing of generative AI. A senior tester should know the series is still growing, and check ISO’s SC 42 programme for the current published parts rather than relying on a fixed list.

The practical takeaway: adopt 42119-2 now as best practice, but write your test approach so it can absorb the new parts as they land. And keep framing organisational claims as “aligned to ISO/IEC TS 42119-2:2025” — the series is still growing, and certification against a TS is not a thing.

8 NZ Regulatory Alignment

42119 does not exist in a vacuum in Aotearoa. It operationalises commitments that several NZ frameworks already make — which is exactly why adopting it is a practical move, not just an international nicety.

NZ Government Algorithm Charter

Signatory agencies commit to transparency, regular review, and managing the risks of algorithmic decision-making. 42119’s risk register, fairness testing, and audit-ready artefacts are concrete ways to evidence those commitments.

Public Service AI work programme

Central guidance for the safe, responsible adoption of AI across NZ government. A documented 42119 test approach is a direct way for an agency to show its AI use is assured and accountable, not ad hoc.

Government Chief Digital Officer (GCDO) assurance direction

The Government Chief Digital Officer’s assurance expectations for digital and AI systems point in the same direction as 42119: identify risks, test against them, and keep evidence. Aligning to 42119 helps satisfy that assurance lens.

NZ Compliance Layer — Beyond 42119

A real NZ project runs 42119 compliance alongside Privacy Act 2020 obligations. The OPC has published guidance on generative AI that creates additional test obligations not in the standard itself:

Transparency notices — where the system makes decisions about individuals using their personal information, a transparency notice is required before collection.
Automated decision-making rights (IPP 12A) — individuals can request that a decision made wholly or partly by an automated system be reviewed by a human. Your test plan must document which decisions qualify and how the review process works.
Data residency — training data containing NZ personal information sent offshore for model training triggers IPP 12 (transborder data flows). This must appear in your risk register.

Add an "NZ Compliance" column to your risk register alongside the 42119 test type column. See privacy.org.nz for the current OPC AI guidance document.

NZ government AI guidance is evolving quickly — confirm the current name and owner of each programme with your agency’s digital or assurance team before citing it in a formal document.

Senior engineer insight

The NZ Government Algorithm Charter sounds abstract until a signatory agency has to evidence it. On a DIA-adjacent project where we were building a document classification AI, we realised in sprint 3 that our “aligned to the Charter” claim had no test artefacts behind it — nothing to show how we had checked transparency, nothing on representativeness, no dated sign-offs. We spent two weeks retrospectively assembling evidence that should have taken two hours if we had run the risk register in sprint 1. That scramble rewired how I think about this: the Charter commitments and the 42119 artefacts are the same work; you just have to do it at the right end of the project.

The most common mistake: treating the AI risk register as a governance document written after testing is done, rather than as the instrument that drives and justifies test scope from day one.

From the field

A central government agency procuring an AI-assisted eligibility tool assumed the vendor’s internal testing covered 42119 obligations — the tender had asked for “evidence of AI testing” and the vendor had provided a 40-page functional test report. What the agency discovered during a GCDO assurance review was that the report contained no fairness testing, no data representativeness evidence for Māori and Pasifika applicants, and no drift plan at all. The vendor had tested the plumbing, not the model. The agency had to go back to the vendor mid-contract, extend the timeline by three months, and fund an additional testing round — all costs that would have been vendor obligations if the procurement had specified 42119 test types by name in the acceptance criteria. The lesson that generalises: “AI testing” in a contract means nothing unless you enumerate the specific test types (data quality, model performance, fairness, drift) and the evidence each must produce.

9 Common Mistakes

🚫 Treating 42119 as a phase before release instead of a lifecycle approach

Why it happens: Traditional testing is a stage near the end, so AI testing gets slotted into the same place.
The fix: The highest-value 42119 work — the risk register, the fairness definition, the drift label-collection — happens in sprint 1, and continuous validation runs for the life of the system. Bolt it on at the end and you have missed the cheap, decisive moves.

🚫 Leaving drift’s label-collection until after go-live

Why it happens: Drift testing runs in production, so teams assume it can be designed in production.
The fix: Drift testing runs late but must be planned first. Without a mechanism to learn real outcomes, you cannot measure drift — the model degrades invisibly. Design the label-collection in sprint 1.

🚫 Selling 42119 to the team with clause numbers instead of risk

Why it happens: The standard is the tester’s frame, so it is the first thing out of their mouth.
The fix: Developers and PMs hear “the standard requires” as overhead. Lead with the concrete failure and the protection the artefacts give the whole team — the calm afternoon with the regulator instead of the scramble.

🚫 Claiming “certified against 42119”

Why it happens: “Certified” sounds stronger in a tender or board paper.
The fix: 42119-2 is a Technical Specification, not a certifiable International Standard, and the series is still growing. Claim “aligned to ISO/IEC TS 42119-2:2025.” Over-claiming is itself a compliance risk.

10 Now You Try

Three graded exercises bringing the whole module together. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Find the Gaps in the Plan

Below is a project plan for a fictional Auckland Council permit assessment AI (it recommends approve / decline / refer for building and resource consent applications). Identify 5 points where 42119-aligned testing activity should be added but is currently missing, and say which phase each belongs in.

Project plan (as written):
• Sprint 1–2: gather requirements, design the model, source 6 years of historical consent decisions for training.
• Sprint 3–6: build and train the model.
• Sprint 7: functional testing of the web form and the recommendation API; load testing.
• Sprint 8: UAT with the consents team; fix defects.
• Go-live: release to all council consent officers.
• Post-go-live: monitor system uptime and API error rates.

List 5 missing 42119 testing activities and the phase each belongs in:

Show model answer

The plan tests the plumbing (form, API, load, UAT, uptime) but no AI-specific testing at all. Five gaps:

Gap 1: AI risk register — Phase: Sprint 1–2 (design). There is no risk register, so nothing drives or justifies AI test scope. This is the foundation everything else traces to.

Gap 2: Data quality testing (representativeness, provenance, label correctness) — Phase: Sprint 3–6 (development), alongside training. The 6 years of historical decisions must be tested for representativeness (all consent types, regions, applicant types), provenance (lawful basis), and label correctness (were past decisions correct, or do they encode past officer bias?). Training on untested history risks reproducing it.

Gap 3: Model performance testing by segment — Phase: Sprint 7–8 (UAT). Functional testing checks the API works; nothing checks whether the recommendations are accurate, with precision/recall against adjudicated outcomes, broken down by consent type and area.

Gap 4: Fairness testing — Phase: before go-live. No demographic parity or counterfactual testing — does the model refer or decline at different rates across areas or applicant groups for equivalent applications? Consent decisions about people and property are high fairness-risk.

Gap 5: Drift testing / continuous validation — Phase: post-go-live. "Monitor uptime and API errors" is not model monitoring. There is no drift test on recommendation quality as building patterns, costs, and rules change — and no label-collection planned to enable it (should have been designed in Sprint 1).

Bonus: explainability testing — consent decisions must be explainable to applicants, and audit-ready artefacts/traceability are absent throughout.

🔧 Exercise 2 of 3 — Add a 42119 Annex

The test strategy below covers only functional and regression testing for the same Auckland Council permit AI. Add a 42119 AI testing annex that specifies: which AI test types are required, at which lifecycle phase, and with what evidence requirements.

Existing test strategy (traditional only): “We will perform functional testing of all user-facing features, integration testing of all APIs, and regression testing on each release. Performance testing will confirm response times under load. Defects are tracked in Jira and triaged daily.”

Write the 42119 AI testing annex:

Show model answer

42119 AI TESTING ANNEX — Auckland Council Permit Assessment AI

AI test type | Lifecycle phase | What it verifies | Evidence required

1. Data quality (representativeness, provenance, label correctness) | Development | Training data covers all consent types/areas/applicant types; lawful basis under Privacy Act; past-decision labels are correct, not encoded bias | Distribution tables vs live population; lineage + lawful-basis records; label re-check + inter-labeller agreement

2. Model performance (segmented) | Development + UAT | Recommendation accuracy vs adjudicated outcomes, precision/recall by consent type and area, against agreed thresholds | Per-segment precision/recall tables; benchmark set ID; model version

3. Adversarial | UAT | Model behaves safely on edge/boundary/malformed applications and cannot be gamed | Input transformations; per-case results vs robustness threshold

4. Fairness (demographic parity + counterfactual) | Before go-live | Refer/decline rates do not differ across areas/applicant groups for equivalent applications | Per-group outcome tables; counterfactual pairs + results; chosen fairness metric + tolerance

5. Drift / continuous validation | Post-deployment, scheduled | Recommendation quality does not degrade as building patterns, costs, and rules change | Dated metric time series; intervention threshold; alert + retraining records; fresh labelled-data snapshots

6. Explainability | UAT + sampled in production | Each decision carries a defensible reason an applicant and auditor can rely on | Sampled decisions + explanations + expert confirmation

Traceability approach: Every test case above traces to a numbered risk in the AI risk register (built in Sprint 1) AND to any related requirement. The test summary report at go-live shows coverage by risk and states open risks. Drift label-collection is designed in Sprint 1 so post-deployment validation is possible.

Strong annex: names specific AI test types (not "test the AI"), places each in the right phase, gives concrete evidence per type, and ties everything to risk-based traceability.

🏗️ Exercise 3 of 3 — Write a Full Test Approach

Write a 42119 test approach for a new fictional NZ government or financial-services AI system of your choice. Include: a system description, an AI risk register of 5 risks, the test types selected per risk, the lifecycle phase for each, and one sample test case per AI test type used. This pulls together all six lessons.

Show model answer (one worked example)

SYSTEM: HealthNZ ED triage support AI
Description: An AI that suggests a triage priority (1–5) for patients arriving at a hospital emergency department, from presenting symptoms, vitals, age, and history. It supports — does not replace — the triage nurse.

AI RISK REGISTER:
Risk | Category | Likelihood | Impact | Test type | Lifecycle phase
1. Training data under-represents Māori and Pasifika presentations | Data | Med | High | Data representativeness testing | Development
2. Model under-triages a genuinely urgent patient (misses a priority-1) | Model performance | Med | Critical | Model performance testing (recall on priority-1) | Dev + UAT
3. Triage priority differs by ethnicity for equivalent presentations | Fairness | Med | Critical | Demographic parity + counterfactual testing | Before go-live
4. Accuracy degrades as presentation patterns change (e.g. a new seasonal illness) | Drift | High | High | Continuous validation | Post-deployment
5. Nurse cannot see why a priority was suggested | Explainability | Med | High | Explainability testing | UAT + production sample

SAMPLE TEST CASES (one per type used):

TC-DATA-01 | Data | Representativeness | Verify training data covers ethnicity groups in proportion to the ED's live population | Each group within ±3pp of live ED presentations (rolling 12mo) | Distribution table vs live snapshot; query; date | Risk 1

TC-PERF-01 | Model | Performance | Measure recall on adjudicated priority-1 cases | Recall on true priority-1 ≥ 0.98 (a missed priority-1 is critical) | Confusion matrix; benchmark set ID; model version | Risk 2

TC-FAIR-01 | Fairness | Demographic parity + counterfactual | Compare triage-priority distribution across ethnicity for matched-equivalent presentations; plus counterfactual pairs varying only ethnicity proxy | Priority distribution within tolerance across groups; counterfactual pairs return identical priority | Per-group tables; counterfactual pairs + results | Risk 3

TC-DRIFT-01 | Drift | Continuous validation | Monthly recall on priority-1 for recent patients with confirmed outcomes | Alert if recall drops below 0.95 or 3pp below baseline | Dated metric series; threshold; alert log; labelled snapshot | Risk 4

TC-EXPL-01 | Explainability | Explainability | On a sample of suggestions, confirm the stated reason matches the driving factors and a nurse accepts it | Reason present, accurate, and nurse-confirmed in ≥95% of sample | Sampled cases + explanations + nurse sign-off | Risk 5

Strong answers: a clear system description; 5 risks each with a category, likelihood, impact, test type and phase; and one sample test case per AI test type with a measurable criterion, concrete evidence, and traceability to a risk. The hallmark of mastery is that the priority metric is chosen for the harm (recall ≥0.98 on priority-1 because under-triage can kill) — not a generic accuracy target.

Why teams fail here

Conflating uptime monitoring with model monitoring. Post-go-live dashboards show API response times and error rates but nothing about recommendation quality. The model is degrading invisibly while every operational metric is green.
Designing drift testing in production instead of sprint 1. By the time the model is live there is no clean mechanism to learn real outcomes, so drift cannot be measured — the label-collection that makes continuous validation possible must be architected before the model is built, not retrofitted after.
Writing the AI risk register as a governance artefact after testing. Risk registers created to satisfy a tender, Charter sign-off, or GCDO assurance review after the fact have no test cases traced to them — they document what was done rather than driving what to do. The register has no value if it does not exist before the first test is written.
Leaving fairness testing until UAT instead of treating it as a go-live gate. Fairness results discovered in UAT create intense pressure to ship anyway — the timelines, the sunk cost, the developer relationships. Treating fairness as a pre-release gate (not a mid-project nice-to-have) is the only way it stays non-negotiable.
Claiming “certified against ISO/IEC 42119” in a tender or board paper. 42119-2 is a Technical Specification, not a certifiable standard. Procurement teams and auditors are catching this more frequently as NZ government AI assurance matures — over-claiming alignment is itself a compliance risk that can unwind a contract.

11 Self-Check

Click each question to reveal the answer.

Q1: Why is 42119 described as a lifecycle approach rather than a phase before release?

Because its highest-value activities happen at the start (the risk register, the fairness definition, the drift label-collection in sprint 1) and its largest block of testing (continuous validation) runs after go-live for the life of the system. Slotting AI testing into a pre-release phase misses both ends.

Q2: Map each AI test type to the lifecycle phase it belongs in.

Data quality — during development; model performance & adversarial — development and UAT; fairness — before go-live; drift / continuous validation — post-deployment, but planned (its label-collection) in sprint 1.

Q3: What is the one timing mistake that quietly wrecks AI projects?

Leaving drift’s label-collection until after go-live. Without a mechanism to learn real outcomes you cannot measure drift, so the model degrades invisibly — the Meridian Digital failure. Drift testing runs late but must be planned first.

Q4: How should you raise 42119 with a team that has never heard of it?

Lead with risk and failures, not clause numbers. Frame the risk register and audit-ready artefacts as protecting the whole team (the calm afternoon with a regulator), start with cheap high-value tests (representativeness, a drift plan), and position bias testing as giving the developers’ mitigation work the evidence it deserves.

Q5: Why claim “aligned to” rather than “certified against” 42119, and what is coming next?

Because 42119-2 is a Technical Specification, not a certifiable International Standard, and the series is still growing — Part 3 (Verification and Validation, ~2026), Part 7 (Red Teaming), Part 8 (Generative AI Testing). Over-claiming certification is itself a compliance risk.

12 Interview Prep

Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.

“Walk me through how you’d apply AI testing across a project, start to finish.”

In sprint 1, before any model exists, I build the AI risk register and decide what we’ll test against — including the fairness definition and the label-collection drift will need later. During development I run data quality testing alongside the model build — representativeness, provenance, label correctness — and start model performance testing as it matures. At UAT I do full model testing, adversarial and explainability checks, and the usual system and integration testing. Before go-live I run fairness testing and write a summary report that states open risks honestly. Then continuous validation runs in production for the life of the system — scheduled drift tests with alerts. The theme is that testing is continuous and front-loaded, not a phase at the end, and every test produces an audit-ready artefact traced to a risk.

“The developers and PM have never heard of 42119 and think it’s overhead. How do you bring them along?”

I don’t lead with the standard — I lead with the risk. “If the model performs worse for one region and we can’t show we checked, that’s a complaint and a headline” lands better than quoting a clause. I frame the risk register and the artefacts as protecting the whole team when a regulator calls — insurance, not bureaucracy. I’d start with the cheap, high-value tests like a representativeness check and a drift plan to build trust, and I’d make the developers allies by being clear that de-biasing is their craft and testing just gives their good work the evidence it deserves.

“Where is the AI testing standards space heading, and how do you keep your approach current?”

42119-2 is the overview and it’s a Technical Specification, so it’s deliberately early and evolving. More parts are coming — Part 3 on verification and validation around 2026, Part 7 on red teaming, Part 8 on generative AI testing. I adopt 42119-2 now as best practice but write the test approach so it can absorb those parts as they land, and I keep claims framed as “aligned to” rather than “certified against”. In an NZ context I also keep an eye on the Government Algorithm Charter and the public-service AI guidance, because 42119 is a concrete way to evidence the commitments those frameworks make.

Key takeaway

The artefact that answers a regulator calmly on a Friday afternoon is the risk register you built on a Tuesday in sprint one — 42119 is only valuable if you apply it at the beginning, not as a retrospective justification of decisions already made.

← Audit-Ready Test Artefacts Back to ISO/IEC 42119 Overview →