Test with AI · ISO/IEC 42119

Model Testing

Q: Why can a model pass every pre-deployment test and still fail in production months later?

Drift. The world the model operates in changes — inputs shift (data drift) or the input-to-answer relationship moves (concept drift) — while the model still reflects its training. No code changes and no test fails, so without continuous validation the degradation is invisible.

Q: What are the four things a drift test must specify?

What is measured (a metric on fresh labelled data), how often (the schedule), the intervention threshold (the value that triggers an alert or retraining), and the evidence artefact (the dated record retained for audit).

Q: What makes an adversarial test valid rather than just “trying weird inputs”?

It must be specific and repeatable : a defined input transformation (named technique, exact parameters) and a defined acceptance threshold, so it can be run identically twice, handed to someone else, and used as evidence.

A model that passes every test at deployment can fail silently three months later, with no code change and no error. Model testing is not a gate you pass once. It is a process that runs for the life of the system.

Test with AI ISO/IEC TS 42119-2:2025 — Lesson 3 of 6 ~35 min read · ~80 min with exercises

1 The Hook

Meridian Digital, a fictional NZ energy retailer, built an AI model to predict which customers were about to churn so the retention team could reach them first. On the held-out test set it hit 91% accuracy. The product team celebrated, signed it off, and shipped it. The model went into the retention workflow and everyone moved on to the next thing.

Six months later, someone finally checked. Churn prediction accuracy had quietly fallen to 67%. The retention team had been chasing the wrong customers for months — missing people who were actually leaving and pestering people who were not. No error had been thrown. No test had failed. No code had changed.

What changed was the world. Over those six months wholesale energy prices spiked, a competitor ran an aggressive switching campaign, and customer behaviour shifted — people who would never have churned before were now shopping around. The model was still making predictions based on the patterns it learned from a calmer market that no longer existed. It had drifted.

The deeper failure was in the test approach, not the model. The test suite ran once, before deployment, and then stopped. For a deterministic billing engine that would be fine — nothing changes after release unless someone changes the code. But a model lives in a moving world, and a test approach that ends at go-live cannot see the model slowly going wrong.

This lesson covers the model test types in 42119 — performance, adversarial, and explainability — and then the one that catches teams from a traditional background every time: drift, and the continuous validation that runs long after release.

2 The Rule

Model testing is not a single event at deployment — it is a continuous process across the model’s whole life. A model can pass every pre-deployment test and then degrade silently as the world it operates in changes. Under ISO/IEC 42119, testing that stops at go-live is an incomplete test approach.

3 The Analogy

Analogy

A Warrant of Fitness, not a factory inspection.

When a car rolls off the line it passes a one-off factory check — everything works on day one. But we do not let that check stand for the life of the car. We require a Warrant of Fitness on a schedule, because a car that was safe last year wears out, and the only way to know it is still safe is to re-test it periodically against the same standard.

Pre-deployment model testing is the factory check. Continuous validation is the WOF. A model that passed at go-live, like a car that passed at the factory, tells you nothing about whether it is still fit six months on. Drift is the AI equivalent of brake pads wearing down: gradual, invisible from the driver’s seat, and dangerous precisely because nothing feels wrong until you test for it.

4 The Three Model Test Types

42119 defines three kinds of model testing. Each answers a different question about the model itself — distinct from the data tests in Lesson 2.

Model performance testing

The question: is the model accurate enough, in the ways that matter? This measures how well the model does its job against agreed thresholds — using metrics like accuracy, precision, recall, and F1 (next section). The key word is “in the ways that matter”: a single accuracy number can hide a model that is useless on the cases you most care about.

Adversarial testing

The question: can the model be tricked, and how does it behave under hostile or unusual input? Adversarial testing deliberately probes the model with crafted inputs designed to make it fail — edge cases, manipulated inputs, inputs that exploit known weaknesses. For an identity-verification model, that means trying to fool it. For a content classifier, it means inputs engineered to be misread.

Explainability testing

The question: can the model’s decisions be explained and justified? When an AI declines a loan or flags a claim, someone may be entitled to know why. Explainability testing verifies that the model produces a defensible, accurate reason for its decisions — not a plausible-sounding rationalisation, but an explanation that actually reflects what drove the decision and that a caseworker, customer, or auditor can rely on.

5 Model Performance Metrics a Tester Needs

You do not need the maths. You need to know what each metric means, what it hides, and when to insist on it. Take a fictional Benefits NZ benefits eligibility model that flags applications as “likely eligible” or “needs review”.

Metric	Plain meaning	What it hides
Accuracy	Of all predictions, how many were right.	Useless on imbalanced data. If 95% of applications are eligible, a model that says “eligible” every time scores 95% accuracy and catches zero problem cases.
Precision	When the model flags “needs review”, how often it is right to.	You can get high precision by only flagging the most obvious cases — while missing most of the real ones.
Recall	Of all the applications that truly needed review, how many the model actually caught.	You can get high recall by flagging almost everything — burying the review team and destroying precision.
F1 score	A single number balancing precision and recall.	It is a blend, so it can mask a model that is strong on one and weak on the other. Always look at precision and recall separately too.

The tester’s job is to insist on the right metric for the stakes. For benefits eligibility, a missed problem case (low recall) might mean someone gets a payment they should not, or is wrongly approved and later pursued for debt — so recall matters enormously. For a system that auto-declines, precision matters more, because a false decline directly harms a person. The wrong question is “what’s the accuracy?” The right question is “what does a false positive cost, what does a false negative cost, and which metric protects against the worse one?”

Pro tip: When a team reports a single accuracy figure for a model that makes decisions about people, treat it as a red flag, not a result. Ask for precision and recall broken down by the groups the model decides about. A 91% headline can hide 60% recall for one region — which is both a performance failure and the start of a fairness failure (Lesson 4).

6 Adversarial Testing

Adversarial testing asks how the model behaves when someone is actively trying to break it, or when reality hands it something strange. For NZ business systems this is not theoretical: an identity-verification model, a fraud model, or a content moderation model all face inputs designed to fool them.

How to construct adversarial inputs

Perturbation: take a known-good input and change it slightly — a RealMe identity photo with altered lighting, a slight crop, or compression artefacts — and check the decision does not flip in ways it should not.
Evasion: craft inputs designed to slip past the model — a fraudulent loan application engineered to look exactly like the legitimate ones the model approves.
Boundary probing: push inputs to the edges of what the model has seen — an income figure far above the training range, an application in te reo Māori when the training data was overwhelmingly English.
Stress and nonsense: feed malformed, empty, or contradictory inputs and confirm the model fails safely rather than producing a confident wrong answer.

The defining property of a good adversarial test is the same as any good test: it is specific and repeatable. “Test the model with unusual inputs” is not a test — it cannot be run the same way twice or handed to someone else. “Submit the reference identity photo with brightness reduced 40% and confirm the match score stays above the 0.85 acceptance threshold” is a test.

7 Explainability Testing

Explainability is a testable quality characteristic, not an ethical aspiration. In an NZ context it is often a hard requirement: the Privacy Act 2020 gives people the right to access information about decisions made about them, and a model that cannot explain a decline may put the organisation in breach.

What an explainability test checks

An explanation exists: for every decision type (especially adverse ones), the system produces a reason a human can read.
The explanation is accurate: the stated reason actually reflects what drove the decision — not a generic template. If the explanation says “declined due to income” but the model was really driven by postcode, that is a failed explainability test and a likely fairness problem.
The explanation is actionable and defensible: a caseworker can stand behind it to a customer, and an auditor can accept it.

How to write one: take a sample of real decisions — particularly declines and flags — and for each, check that the system’s explanation is present, that it is consistent with the input data, and that a domain expert agrees it is the genuine reason. A KiwiFirst Bank loan decline explained as “insufficient serviceability based on declared income and existing commitments” passes; “the model scored you 0.31” fails — it is true but not an explanation.

8 Drift and Continuous Validation

Drift is the Meridian Digital failure: a model that was accurate at go-live degrades over time because the world changed while the model stayed the same. It is the single most important concept in this lesson, because it is invisible to every traditional testing instinct.

Why drift happens

The inputs change (data drift): the kinds of customers, transactions, or documents coming in shift away from what the model trained on — new products, new behaviours, new fraud patterns.
The relationship changes (concept drift): the link between input and correct answer moves. In a calm market, low engagement did not mean churn; in a price war, it does.

How to test for it: continuous validation

Continuous validation is a scheduled, automated re-test of the live model against fresh, labelled data. A drift test specifies four things:

What is measured — e.g. churn-prediction recall on the last 30 days of customers whose outcome is now known.
How often — e.g. monthly, automatically.
The intervention threshold — e.g. if recall drops below 80%, raise an alert and trigger a retraining review.
The evidence artefact — e.g. a dated metric record showing the measurement, the threshold, and the pass/fail, retained for audit.

Had Meridian Digital run a monthly drift test with an alert at, say, an 8-point accuracy drop, they would have caught the decline within weeks instead of discovering it at six months by accident. The cost of the drift test is trivial next to the cost of a retention team chasing the wrong customers for half a year.

Pro tip: The hardest part of drift testing is usually getting fresh labelled data — you need to know the real outcomes to measure the model against. Build the label-collection mechanism at design time (Lesson 1’s point about testing decisions being made early). Bolting it on after go-live is how teams end up, like Meridian, with no way to see drift until it has done its damage.

9 Test Levels for AI, Mapped to 29119

29119 gives you familiar test levels — component, integration, system, acceptance. 42119 keeps them and adds AI-specific levels underneath. It helps to think of four layers for an AI system:

Data testing — the dataset as the unit under test (Lesson 2). Roughly the AI equivalent of component-level testing: you test the foundational unit before integrating.
Model testing — the trained model in isolation: performance, adversarial, explainability (this lesson). The model is the component.
System testing — the model embedded in the real application, with its UI, business rules, and fallbacks. Does the whole system behave correctly when the model is one part of it?
Integration testing — the model’s interfaces with everything around it: the data pipeline feeding it, the systems consuming its output, the monitoring watching it.

The mental model: data and model testing are new and AI-specific; system and integration testing are your existing 29119 skills applied to a system that happens to contain a model. A complete test approach covers all four — and a common failure is to test the model brilliantly in isolation while never testing how the wider system behaves when the model is wrong (its fallback, its human-in-the-loop, its error handling).

From the field

A government agency (not named, but comparable to CoverNZ's injury-classification work) deployed a triage model that assigned incoming claims to fast-track or standard review. Pre-deployment performance looked solid — 88% accuracy, which the project team presented to the governance board as the sign-off metric. What the board was not shown was that the fast-track recall for Māori claimants was 0.44 versus 0.71 for the overall population: the model was systematically routing Māori claims to the slower, more burdensome standard track at nearly twice the rate of the rest. Nobody had asked for a demographic breakdown because "accuracy" felt like a complete answer. The fix required not just a model retrain but a full explainability audit — because once the disparity was surfaced, the agency also could not demonstrate that the decisions were defensible under the Privacy Act. The generalised lesson: a single headline metric on a model that makes decisions about people is never a complete test result.

10 Common Mistakes

🚫 Treating a single accuracy number as the model’s grade

Why it happens: One headline figure is easy to report and easy to celebrate.
The fix: Accuracy hides everything on imbalanced data — a model can score 95% and catch none of the cases that matter. Insist on precision and recall, chosen for the cost of each error type, and broken down by the groups the model decides about.

🚫 Stopping model testing at go-live

Why it happens: In traditional projects, deployment is the end of testing.
The fix: Models drift. Without continuous validation, the model silently degrades and no test ever fails — the Meridian Digital story. Schedule a drift test with a measured threshold and an alert, and build the label-collection it needs at design time.

🚫 Writing adversarial tests that are too vague to repeat

Why it happens: “Try some unusual inputs” feels like adversarial testing.
The fix: An adversarial test must be specific and repeatable — a defined input transformation and a defined acceptance threshold — or it cannot be run twice, handed over, or used as evidence. Name the perturbation and the expected bound.

🚫 Accepting a model score as an explanation

Why it happens: The model outputs a number, and a number feels like a reason.
The fix: “Scored 0.31” is true but is not an explanation a customer or auditor can use. Explainability testing checks that the stated reason is present, accurate to what actually drove the decision, and defensible — which the Privacy Act 2020 may require.

Senior engineer insight

The drift conversation always goes the same way: the team says "we'll know if it's going wrong because someone will tell us." No one tells them. What changed my thinking was watching a churn model quietly degrade for four months while the business kept acting on its outputs — the retention team's conversion rate had dropped, but everyone assumed that was a sales problem. The model was the problem and there was no test running that could see it.

Continuous validation is not a nice-to-have you wire up after launch. The label-collection pipeline — gathering real outcomes to measure the model against — has to be designed before go-live, because once you're live and skipping it, you're already blind.

The most common mistake: teams run model testing once at deployment, tick it as done, then report the initial recall figure to the steering committee for the next two years as if it still means something.

11 Now You Try

Three graded exercises across performance, adversarial, and drift testing. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Read the Metrics

Below are performance metrics for a fictional Benefits NZ benefits eligibility AI that flags applications as “likely eligible” or “needs review”. Identify which numbers suggest a potential fairness or accuracy problem, and say what further testing is needed.

Overall:            Accuracy 93%  |  Precision 0.71  |  Recall 0.58

By region — Auckland:   Accuracy 95%  |  Precision 0.74  |  Recall 0.64

By region — Northland:  Accuracy 88%  |  Precision 0.69  |  Recall 0.41

By age — under 25:      Accuracy 90%  |  Precision 0.55  |  Recall 0.49

By age — 25 to 64:      Accuracy 94%  |  Precision 0.73  |  Recall 0.61

Class balance:      82% of applications are “likely eligible”, 18% “needs review”

Identify the problems and the further testing needed:

Show model answer

1. Low overall recall (0.58) — the model misses 42% of applications that genuinely need review. On a benefits system that means problem cases slip through as "likely eligible". The 93% accuracy headline hides this completely because the classes are imbalanced (82/18). Further testing: confirm recall against the cost of a missed review; consider whether recall is the priority metric here.

2. Northland recall (0.41) far below Auckland (0.64) — the model catches far fewer of the real "needs review" cases in Northland. This is both a performance problem and a fairness red flag: applicants in one region get systematically different treatment. Further testing: demographic parity / equal-opportunity testing across regions (Lesson 4), and a data representativeness check on Northland coverage (Lesson 2).

3. Under-25 precision (0.55) — when the model flags an under-25 application, it is wrong nearly half the time, well below the 25–64 group (0.73). Younger applicants are disproportionately sent to review unnecessarily. Further testing: fairness testing across age bands; investigate whether under-25 cases were under-represented or mislabelled in training.

The accuracy column is the trap: every group looks "fine" on accuracy (88–95%) while recall and precision reveal serious, unequal failures. A tester who reported only accuracy would have signed this off.

🔧 Exercise 2 of 3 — Make the Adversarial Test Repeatable

The adversarial test case below is too vague to run twice. Rewrite it as a specific, repeatable, 42119-aligned adversarial test case for a fictional RealMe-style identity verification model that matches a live selfie to a reference ID photo. Include the input transformation, the acceptance criterion, and the evidence required.

Original (too vague): “Test the model with unusual inputs to make sure it can’t be fooled.”

Rewrite as a specific, repeatable adversarial test case:

Show model answer

Test ID: ADV-IDV-009

Risk category: Model — adversarial robustness (security risk)

Test type: Adversarial

Adversarial technique: Perturbation

Input transformation (exact, repeatable): Take the 50 reference selfie/ID pairs in the adversarial test set. For each genuine match, apply these perturbations one at a time: (a) reduce brightness by 40%; (b) apply JPEG compression at quality 30; (c) rotate the selfie 8 degrees; (d) crop 10% from each edge. Each produces a defined, reproducible variant.

Acceptance criteria: For genuine matches, the match score must remain above the 0.85 acceptance threshold for perturbations (a)–(d) in at least 95% of the 50 pairs — the model must not wrongly REJECT a genuine person because of minor image variation. Separately, for 50 known non-matches with the same perturbations, the score must stay below 0.85 in 100% of cases — perturbation must not be usable to force a false ACCEPT.

Evidence required: Score table for all 50 genuine and 50 non-match pairs across all four perturbations; the script that applied the transformations; the reference image set ID; pass/fail per perturbation.

Traceability: AI risk register risk R-11 (identity model can be evaded or wrongly rejects genuine users under minor image variation).

What makes it repeatable: exact, named transformations with parameters (40%, quality 30, 8 degrees, 10%); a fixed test set; numeric thresholds; and separate criteria for false-reject and false-accept. "Unusual inputs" had none of these.

🏗️ Exercise 3 of 3 — Design Drift Tests

Design 4 drift detection test cases for the Meridian Digital churn prediction model from the Hook. For each, specify: what is being measured, how often, the threshold for intervention, and the expected evidence artefact.

Show model answer

Drift test 1 (performance / concept drift) | Measures: churn-prediction recall and precision on the last 30 days of customers whose churn outcome is now known | Frequency: monthly, automated | Intervention threshold: recall drops below 80% OR more than 8 points below the go-live baseline → alert + retraining review | Evidence artefact: dated metric record (recall, precision, sample size, baseline, pass/fail)

Drift test 2 (input / data drift) | Measures: distribution of key input features (tariff type, usage, tenure, engagement) in the last 30 days vs the training distribution | Frequency: monthly, automated | Intervention threshold: any feature's distribution shifts beyond an agreed statistical distance from training → alert | Evidence artefact: distribution-comparison report per feature with the distance metric and threshold

Drift test 3 (prediction drift) | Measures: the proportion of customers the model flags as "will churn" each month | Frequency: monthly, automated | Intervention threshold: predicted churn rate moves more than X points from the rolling 6-month norm without a known cause → investigate | Evidence artefact: time series of predicted churn rate with the control band marked

Drift test 4 (external-context check) | Measures: correlation between predicted churn and actual churn against known market events (price changes, competitor campaigns) | Frequency: quarterly, or on a major market event | Intervention threshold: actual churn diverges from predicted during a market shift → trigger concept-drift review | Evidence artefact: dated review note linking market events to model performance

Strong answers: each test names a real drift type (data, concept, prediction), has a measurable threshold with an action attached, and produces a retainable, dated evidence artefact. Critically, all require fresh labelled outcomes — the thing Meridian never set up. A weak answer says "check accuracy sometimes" with no threshold or artefact.

Why teams fail here

Treating deployment as the end of model testing — the team files the pre-deployment test report and never schedules a retest. Drift accrues invisibly because no test is running to catch it.
Designing drift tests without a label-collection mechanism — the team agrees to "monitor model performance monthly" but has no pipeline to collect real outcomes; they end up measuring prediction drift only, which tells you the model is changing but not whether it is getting worse.
Reporting a single accuracy figure for an imbalanced classification model — on a 90/10 class split, a model that predicts the majority class every time scores 90% accuracy. The team ships it. Real recall is zero.
Writing adversarial tests too vague to repeat — "test with edge cases" appears in the test plan, no one can reconstruct what was actually tested six months later, and the test cannot serve as evidence for an audit or a DIA AI assurance review.
Accepting a raw model score as the explanation for an adverse decision — "the model scored you 0.31" satisfies no one: not the caseworker, not the customer, not a Privacy Commissioner inquiry. Explainability testing is skipped because teams assume the model's output is self-explanatory.
Never checking metrics by subgroup — the overall recall looks acceptable, so no one looks at Northland vs Auckland or under-25 vs the wider population. A fairness failure and a performance failure can hide inside a passing headline number indefinitely.

12 Self-Check

Click each question to reveal the answer.

Q1: Why can a model pass every pre-deployment test and still fail in production months later?

Drift. The world the model operates in changes — inputs shift (data drift) or the input-to-answer relationship moves (concept drift) — while the model still reflects its training. No code changes and no test fails, so without continuous validation the degradation is invisible.

Q2: Why is a single accuracy figure dangerous for a model that makes decisions about people?

On imbalanced data, accuracy can be high while the model misses the cases that matter — say “eligible” to everything and score 95% while catching no problem cases. You need precision and recall, chosen for the cost of each error type, and broken down by the groups the model decides about.

Q3: What are the four things a drift test must specify?

What is measured (a metric on fresh labelled data), how often (the schedule), the intervention threshold (the value that triggers an alert or retraining), and the evidence artefact (the dated record retained for audit).

Q4: What makes an adversarial test valid rather than just “trying weird inputs”?

It must be specific and repeatable: a defined input transformation (named technique, exact parameters) and a defined acceptance threshold, so it can be run identically twice, handed to someone else, and used as evidence.

Q5: Why is “the model scored 0.31” a failed explainability result?

Because it is true but not an explanation a person can use. Explainability testing requires that the stated reason is present, accurate to what actually drove the decision, and defensible to a customer or auditor — which the Privacy Act 2020 right to decision information may require for adverse outcomes.

13 Interview Prep

Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.

“A team tells you their model is 91% accurate and ready to ship. What do you ask next?”

I’d ask what the class balance is, because on imbalanced data a high accuracy can hide a model that misses the cases that matter. Then I’d ask for precision and recall — and which one matters more given what a false positive versus a false negative actually costs here. And I’d ask to see those metrics broken down by the groups the model makes decisions about, because a strong overall number can hide a weak result for one region or age band, which is both a performance and a fairness problem. “91% accurate” on its own doesn’t tell me whether it’s ready.

“How would you set up testing so we catch model drift before customers do?”

Continuous validation: a scheduled, automated re-test of the live model against fresh labelled data. I’d define what we measure — usually recall or precision on recent cases whose real outcome we now know — run it on a schedule like monthly, set an intervention threshold that triggers an alert and a retraining review, and retain a dated evidence record each run. The piece teams forget is the label-collection mechanism: you need the real outcomes to measure against, and that has to be designed in from the start, not bolted on. Done right, you catch a drop in weeks instead of discovering it at six months by accident.

“What is explainability testing and why would a tester care about it?”

It verifies that the model can give a defensible, accurate reason for its decisions — especially adverse ones like a decline or a flag. I care because in NZ the Privacy Act 2020 can give a person the right to information about a decision made about them, so a model that can’t explain a decline is a compliance risk, not just a UX gap. A real explainability test takes a sample of decisions and checks the reason is present, consistent with the input, and genuinely reflects what drove the decision — not a generic template and not just a raw score, which is true but useless as an explanation.

Key takeaway

A model test suite that ends at go-live is not a test suite — it is a one-time snapshot of a system that will keep changing long after you stop looking.

← Data Quality Testing Next: Bias and Fairness Testing →