Test with AI · ISO/IEC 42119

Drift, Monitoring & Ongoing Testing

Q: What is the difference between data drift and concept drift?

Data drift is when the inputs change — the data coming in looks different from training data. Concept drift is when the relationship changes — the same input now maps to a different correct answer (COVID turning “spending drop” from “stolen card” into “lockdown”). Model drift is the umbrella term for performance degrading from either cause.

Testing doesn’t stop at release. A model that was good on launch day will not stay good on its own — detecting and responding to degradation in production is part of the testing programme, not an optional extra.

Test with AI ISO/IEC TS 42119-2:2025 · ~15 min read ~15 min read · ~50 min with exercises

1 The Hook

A fictional NZ insurer, Aroha Insurance, built an AI to detect fraudulent claims. At release it was excellent — 91% precision on its test set, meaning when it flagged a claim as likely fraud it was right nine times in ten. The team celebrated, signed it off, and moved on to the next project. The model went into production and quietly did its job.

Then the world changed. Through 2023 and 2024, a run of severe weather events — the Auckland Anniversary floods, Cyclone Gabrielle, and the storms that followed — produced a surge of genuine claims that looked nothing like the patterns the model had learned from. Roof damage, surface flooding, and silt clean-ups arrived in volumes and combinations the pre-2023 training data had never seen. The model, trained on a calmer world, started mistaking unusual-but-genuine claims for fraud, and missing the new patterns of opportunistic fraud that rode in on the surge.

Precision slid from 91% to 74%. And here is the part that should worry every tester: nobody noticed for three months. There was no monitoring. The model had passed its tests at release, so everyone assumed it was still passing. It was investigators wondering why so many honest claimants were being flagged that finally surfaced the problem — long after the damage to those claimants, and to the insurer’s reputation, was done.

The model had no new bug. It was trained on pre-2023 patterns, the world moved, and the model did not. That is drift — and the failure was not the model’s, it was the absence of any testing after release. ISO/IEC 42119 treats post-deployment monitoring and ongoing testing as part of the testing programme. This lesson teaches what drift is, how to watch for it, and what to do when you find it.

2 The Rule

A model that is good at release will not necessarily stay good. The world it was trained on keeps moving, and the model does not move with it. Post-deployment monitoring and ongoing testing are part of the testing programme, not an optional extra — a model signed off and never watched again is an untested model from the day the world starts to change.

3 The Analogy

Analogy

A Warrant of Fitness.

Your car passes its WOF in January. That tells you it was safe to drive on the day it was checked. It does not tell you the car is safe in December — tyres wear, brake pads thin, a light fails. A WOF is a point-in-time assessment, and the law knows it: that is exactly why it expires and has to be redone.

A model’s release sign-off is its January WOF. It certifies the model was fit on the day it was tested, against the world as it was then. As conditions change — new weather patterns, new customer behaviour, a pandemic — the model wears just like brake pads. Monitoring is the dashboard warning light; ongoing testing is the next WOF. Treating release sign-off as permanent is like never getting your car re-checked because it passed once.

4 What Drift Is

Drift is the gradual (or sudden) divergence between the world the model learned and the world it now operates in. 42119 and the AI vocabulary standard (ISO/IEC 22989) distinguish three kinds, and the difference matters because they call for different responses:

Data drift — the inputs change. The kinds of claims, customers, or transactions coming in look different from the training data, even if the underlying relationship has not changed. The flood surge sending Aroha unfamiliar claim types is data drift.
Concept drift — the relationship the model learned changes. What “normal” or “fraud” means in the real world shifts, so the same input now maps to a different correct answer. COVID-19 is the textbook case: overnight, “a sudden drop in card spending” stopped meaning “possible stolen card” and started meaning “lockdown.”
Model drift — the umbrella term for the model’s performance degrading over time, whether driven by data drift, concept drift, or both. It is the symptom; the other two are the causes.

Drift looks different across model types: a classification model (fraud / not fraud) drifts as its precision or recall falls; a regression model (predicted house price) drifts as its error grows; a ranking model (search or recommendation order) drifts as the items it ranks highly stop being the ones people choose; a generative model drifts as the inputs people send it move away from what it was tuned on, or as the facts it states go stale.

5 How Drift Shows Up in Production

Drift rarely announces itself. It shows up as a slow erosion that is easy to miss without something watching for it:

A quiet slide in the headline metric — precision from 91% to 74% over months, with no single day where it obviously breaks.
A shift in the input distribution — the average claim value, the mix of regions, or the proportion of a customer type moves away from what the model trained on, before performance has even dropped. This is the early-warning signal.
A rise in human overrides — caseworkers increasingly disagreeing with the model’s output is often the first human-visible sign that the model has fallen out of step with reality.
Complaints clustering — a sudden pattern of complaints from one group or about one decision type, as at Aroha, is drift surfacing the hard way, after harm.

The lesson from the hook is that none of these is visible unless someone is measuring. A model with no monitoring drifts in the dark.

6 Monitoring Strategies

Monitoring is how you turn drift from an invisible slide into a measured signal. 42119-aligned programmes use several strategies together:

Performance-metric monitoring — track the model’s accuracy, precision, recall, or error against a stream of ground-truth labels as they arrive, and alert when a metric crosses a threshold. The catch: ground truth often arrives late (was that flagged claim really fraud?), so this can lag.
Data-distribution monitoring — watch the inputs, not the outputs. Because the input distribution shifts before performance visibly drops, this is the earliest warning you can get and does not wait for ground truth.
Shadow deployment — run a new or candidate model alongside the live one on real traffic, without acting on its output, to compare them safely before switching.
A/B testing — route a slice of traffic to a different model version and compare outcomes, to confirm a change actually helps before rolling it out fully.
Canary release — expose a new model to a small share of traffic first, watch the metrics, and widen only if they hold — so a bad model harms few people before it is caught.

Pro tip: Data-distribution monitoring is the strategy testers most often miss. Because it watches inputs, it can warn you a model is about to degrade before performance drops and before any harm is done — whereas performance monitoring only tells you after ground truth confirms the damage. Watch the inputs, not just the outputs.

7 Drift-Detection Techniques — in Plain Terms

You do not need heavy maths to understand what the common drift detectors do. Each one answers the same question — “has this distribution moved?” — in a slightly different way:

Technique	What it does, in plain terms
Population Stability Index (PSI)	Compares the spread of a value now against the spread at training time and gives a single number for how far it has moved. Small number = stable; larger number = the inputs have shifted enough to investigate. The most common first detector.
KL divergence	Another way to put a number on how different “today’s” distribution is from the “training” distribution — larger means more different. Used the same way as PSI, with different maths underneath.
Chi-square test	For categories (region, claim type), checks whether the mix today differs from the mix the model trained on by more than chance — a statistical “is this shift real or just noise?”

A simple way to picture PSI: imagine the claim values the model trained on sorted into ten buckets, low to high, each holding 10% of claims. Months later you sort the live claims into the same ten buckets. If they still land roughly 10% each, nothing has moved — PSI is near zero. If the surge has stuffed the “high value” buckets and emptied the low ones, the buckets no longer match, and PSI rises to flag it. You are not computing it by hand — you are reading the number a tool produces and knowing what a rising one means: the inputs have moved away from what the model expects.

8 Ongoing-Testing Obligations

Monitoring tells you something has moved. Ongoing testing is what you do about it. 42119 frames a set of obligations that continue for the life of the model:

Scheduled re-evaluation — re-test the model against fresh ground-truth data on a set cadence (say quarterly), not only when an alert fires. The WOF rule: a regular re-check, on the calendar.
Retraining triggers — define, in advance, the conditions that mean “retrain now”: a PSI above a threshold, precision below a floor, or a scheduled interval reached. A trigger turns a vague “keep an eye on it” into an agreed, testable rule.
Regression testing after retraining — a retrained model is a new model. It must be re-tested for everything the original was — fairness, data quality, performance, edge cases — before it replaces the old one. Fixing drift on one metric must not silently reintroduce bias on another.

NZ context. Aotearoa is a live source of concept drift. The 2023–24 climate events — the Auckland Anniversary floods and Cyclone Gabrielle — shifted what “normal” looks like for insurance, infrastructure, and emergency-response models almost overnight. COVID-19 was a nationwide concept-drift event: in the 2020 lockdowns, spending, travel, and employment patterns broke so sharply that models trained on 2019 data were briefly worse than useless. The NZ Algorithm Charter’s commitment to ongoing review of how algorithms inform decisions — not a one-off pre-launch check — is exactly this obligation written into government practice. A NZ tester should treat concept drift not as a rare edge case but as a near-certainty.

9 Tester vs MLOps Engineer

Ongoing testing is a shared job, and confusion about who owns what is how monitoring falls through the cracks — the Aroha failure. A clean split:

The MLOps engineer	The tester
Builds and runs the monitoring infrastructure — the pipelines that compute metrics and PSI, the dashboards, the alerting.	Decides what to monitor, which thresholds count as a fail, and what evidence a re-evaluation must produce.
Operates retraining and deployment — shadow, canary, A/B mechanics.	Defines the retraining triggers and designs the regression test suite the retrained model must pass before release.
Keeps the system running and the data flowing.	Judges whether the model is still fit for purpose and makes that judgement auditable.

Put simply: the MLOps engineer provides the instruments; the tester decides what the readings have to be, what counts as failure, and what gets re-tested when the alarm sounds. Monitoring without a tester defining thresholds and triggers is a dashboard nobody is reading against a standard — which is how a model slides from 91% to 74% with the lights on and no one watching.

10 Common Mistakes

🚫 Treating release sign-off as the end of testing

I used to think… once a model passes its tests and ships, testing is done — the job moves to the next project.
Actually… release sign-off is a point-in-time WOF. The world keeps moving and the model does not. Without monitoring and scheduled re-evaluation, the model becomes untested again the moment conditions start to change — exactly how Aroha slid from 91% to 74% unnoticed.

🚫 Watching only the output metric and waiting for it to drop

I used to think… the way to catch drift is to watch accuracy or precision and react when it falls.
Actually… output metrics need ground truth, which often arrives late, so by the time the number drops the harm is done. The input distribution shifts first — data-distribution monitoring (and a detector like PSI) warns you before performance visibly degrades. Watch the inputs, not just the outputs.

🚫 Shipping a retrained model without regression testing it

I used to think… retraining is just an update to fix the drift, so it can go straight out.
Actually… a retrained model is a new model. Restoring precision can silently reintroduce bias or break edge cases the original handled. It must pass the full regression suite — fairness, data quality, performance — before it replaces the live model.

Senior engineer insight

The thing that genuinely changed how I think about drift monitoring is that the gap between "model starts degrading" and "someone notices" is almost never a technical failure — it is an organisational one. In every case I have seen where a model quietly slid for months, there was monitoring infrastructure that worked fine; what was missing was a named person whose job it was to read the dashboard against a written standard and escalate when numbers moved. The Aroha scenario is not unusual — it is the default outcome when monitoring is built but ownership is not assigned.

The most common mistake is treating "we have dashboards" as equivalent to "we have monitoring" — dashboards nobody is accountable for reading are decoration.

From the field

A NZ health insurer ran an AI that triaged incoming claims by injury type to route them to the right assessor. At release it was solid — accuracy above 90% on the held-out set. The team monitored precision quarterly, which felt reasonable. What nobody monitored was the input distribution: over eighteen months the proportion of mental-health-related claims grew substantially, driven by post-COVID waiting lists, while musculoskeletal claims (which dominated the training data) shrank. By the time the quarterly report flagged a precision dip, the model had been systematically misrouting mental-health claims for two quarters — not because it broke, but because it had never seen that mix before. The retrigger was a caseworker asking why override rates on one claim category had quietly tripled. The lesson: when you are only watching the output metric and checking it infrequently, you are always the last to know — and the harm is already done.

11 Now You Try

Three graded exercises on drift and ongoing testing. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Drift

Read the production scenario for a fictional Kiwi retail-bank card-fraud model below. Identify which type of drift is occurring (data, concept, or model), explain why, and name the monitoring strategy that would have caught it earliest.

Scenario: The model flags card transactions as possible fraud. It learned that a sudden sharp drop in a customer’s spending often signals a stolen or compromised card. During a nationwide lockdown, spending drops sharply for almost everyone at once — people are at home, shops are shut. The model begins flagging huge numbers of ordinary customers as fraud. The transactions themselves look normal in size and type; what has changed is what “a sudden drop in spending” now means.

Identify the drift type, explain why, and name the earliest-warning monitoring strategy:

Show model answer

Drift type: Concept drift.

Why this type and not the others: The relationship the model learned has changed meaning. "A sudden drop in spending" used to map to "possible stolen card"; during lockdown the same input maps to "normal lockdown behaviour." The inputs themselves still look ordinary in size and type — so this is not primarily data drift (the input distribution of transaction values has not changed shape the way the meaning has). It is concept drift: the same input now has a different correct answer. "Model drift" is the umbrella symptom (performance falling), not the specific cause; the cause here is concept drift. COVID-19 is the classic real-world example.

Monitoring strategy that catches it earliest: A rise in human overrides / a spike in flag rate would surface it fast, but the strongest early signal is monitoring the model's output behaviour and ground-truth confirmation together — flag rate jumping while confirmed-fraud rate does not. Data-distribution monitoring of inputs alone may NOT catch pure concept drift, because the inputs can look unchanged while their meaning shifts — a key subtlety. So performance/flag-rate monitoring with a fast feedback loop (and watching override rates) is what catches this one earliest.

The teaching point: concept drift is the trap where inputs look fine but the relationship has moved — input monitoring alone can miss it, which is why you monitor outputs and overrides too.

🔧 Exercise 2 of 3 — Fix the Monitoring Plan

The post-deployment monitoring plan below is too vague to detect or respond to drift. Rewrite it as a proper model-monitoring checklist with these columns for at least 3 items: Metric, Frequency, Alert threshold, Escalation path. Use a fictional CoverNZ injury-claim triage model as the context.

Original (too vague):
“We will keep an eye on the model after launch and check it from time to time. If it seems to be getting worse we will look into it.”

Rewrite as a monitoring checklist (Metric | Frequency | Alert threshold | Escalation path):

Show model answer

Item 1 | Metric: Triage precision/recall against confirmed clinical outcomes | Frequency: Monthly, plus a quarterly scheduled re-evaluation | Alert threshold: Precision drops more than 5 percentage points below the release baseline, or below an agreed floor | Escalation path: Alert to QA lead → review with clinical owner → decide retrain/hold

Item 2 | Metric: Input distribution shift (PSI) on key features — injury type, region, claimant age band | Frequency: Weekly | Alert threshold: PSI on any monitored feature rises above an agreed value (e.g. moderate-shift band) | Escalation path: Auto-alert to MLOps + QA → investigate cause → data-quality re-check

Item 3 | Metric: Human override rate (how often assessors overturn the model) | Frequency: Weekly | Alert threshold: Override rate rises more than X% above its rolling baseline | Escalation path: Alert to QA lead → sample overridden cases → flag possible concept drift

Item 4 (bonus) | Metric: Fairness — flag/triage-rate parity across groups | Frequency: Quarterly | Alert threshold: Any group's rate outside the agreed parity band | Escalation path: Fairness re-test → governance owner

What makes this a real plan vs the original: every item has a NAMED metric, a SET frequency, a MEASURABLE threshold (not "seems worse"), and a NAMED escalation path. It also mixes output monitoring (precision), input monitoring (PSI — the early warning), and human-signal monitoring (overrides) so concept drift is not missed.

🏗️ Exercise 3 of 3 — Build an Ongoing-Testing Plan

You are the tester for a fictional CityTransit bus-arrival-time prediction model. Design an ongoing-testing plan covering: (a) the metrics you would monitor and at what frequency, (b) two plausible drift types with their likely causes for this system, (c) the trigger conditions that would mean “re-evaluate or retrain now”, and (d) the regression tests you would run on a retrained model before it goes live.

Show model answer

(a) Metrics monitored + frequency: Prediction error — mean absolute error between predicted and actual arrival time — tracked daily, with a weekly rollup. Input-distribution monitoring (PSI) on traffic volume, route, time-of-day, and weather inputs — weekly. Error broken down by route and by time-of-day so a problem on one route is not hidden in the average.

(b) Two drift types + likely causes:
  - Data drift: roadworks, a new bus route, or a motorway closure changes the travel patterns feeding the model — the inputs no longer look like training data.
  - Concept drift: a timetable change, a major event (concert, sports fixture), or a sustained shift in commuting after a behaviour change means the relationship between inputs and actual arrival time has moved — the same conditions now produce different real arrival times.

(c) Trigger conditions: mean absolute error rises above an agreed threshold (e.g. predictions off by more than N minutes on a rolling week); OR PSI on a key input crosses its threshold; OR a scheduled quarterly re-evaluation falls due; OR a known step-change event (timetable overhaul) occurs.

(d) Regression tests before release: re-run the full release test suite on the retrained model — overall and per-route accuracy must meet or beat the previous model; no route or time band may regress beyond tolerance; edge cases (late-night, public holidays, disruption events) still handled; and a shadow or canary period comparing the new model against the live one on real traffic before full switchover.

Strong plans separate input monitoring from output monitoring, name BOTH a data-drift and a concept-drift cause specific to buses, give MEASURABLE triggers, and insist the retrained model is regression-tested — not shipped because "it should be better".

Why teams fail here

Monitoring only outputs, not inputs. Precision and recall need ground truth to compute, and ground truth arrives late — often weeks or months after the decision was made. By the time the output metric visibly drops, the damage is already done. Input-distribution monitoring (PSI, chi-square) is the early-warning layer that most teams skip.
No named owner for the monitoring cadence. Dashboards get built, alerts get configured, and then nobody is explicitly responsible for reviewing them on a schedule. Alerts go to a shared Slack channel that everyone assumes someone else is watching.
Setting alert thresholds from intuition rather than baseline data. A team picks "alert if precision drops below 80%" because it sounds reasonable — not because they measured what normal variance looks like in production. The result is either constant false alarms (desensitising the team) or a threshold so loose it never fires until the model is badly wrong.
Shipping a retrained model without running the full regression suite. When drift is confirmed and retraining fixes it, the team is relieved and wants to ship fast. The retrained model goes out with "smoke tests only." Three months later a fairness issue the original handled has silently reappeared, because the retraining data had different representation.
Missing concept drift because input monitoring looks clean. Concept drift is the trap: the inputs themselves look normal (same transaction sizes, same claim types) while their meaning has shifted. Teams relying purely on PSI or data-distribution checks will see stable numbers right up until performance collapses — you need to watch override rates and output-behaviour change too.
Treating the initial monitoring plan as permanent. The thresholds, cadence, and metrics that made sense at launch stop making sense when the business context changes — a new product line, a regulatory update, a major external event. Ongoing testing plans need their own review cycle, not just the model.

12 Self-Check

Click each question to reveal the answer.

Q1: Why is release sign-off not the end of testing for an AI model?

Because sign-off is a point-in-time assessment — a January WOF. It certifies the model was fit against the world as it was then. The world keeps moving and the model does not, so without monitoring and scheduled re-evaluation the model becomes untested again as soon as conditions change. That is how Aroha slid from 91% to 74% unnoticed.

Q2: What is the difference between data drift and concept drift?

Data drift is when the inputs change — the data coming in looks different from training data. Concept drift is when the relationship changes — the same input now maps to a different correct answer (COVID turning “spending drop” from “stolen card” into “lockdown”). Model drift is the umbrella term for performance degrading from either cause.

Q3: Why watch the input distribution and not just the output metric?

Because output metrics need ground truth, which often arrives late — so by the time accuracy visibly drops the harm is done. The input distribution shifts first, so data-distribution monitoring (with a detector like PSI) gives the earliest warning, before performance degrades. Watch the inputs, not just the outputs.

Q4: In plain terms, what does a rising Population Stability Index tell you?

That the live inputs have moved away from the distribution the model trained on. Picture training data sorted into ten equal buckets; if today’s data no longer falls roughly evenly into those buckets, PSI rises. A small PSI means stable; a larger one means the inputs have shifted enough to investigate. You read the number a tool produces — you do not compute it by hand.

Q5: Why must a retrained model be regression-tested before it replaces the live one?

Because a retrained model is a new model. Restoring precision on the drifted metric can silently reintroduce bias or break edge cases the original handled. It must pass the full suite — fairness, data quality, performance, edge cases — before going live, so that fixing one problem does not quietly create another.

13 Interview Prep

Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.

“Our model passed all its tests and is live. Why would you keep testing it?”

Because passing at release only certifies the model against the world as it was on that day — it is a point-in-time WOF. Conditions change: new customer behaviour, weather events, a policy change, and the model that was good last quarter quietly degrades. Under 42119, post-deployment monitoring and scheduled re-evaluation are part of the testing programme. I would set up monitoring of both inputs and outputs, define thresholds and retraining triggers, and re-test on a cadence — so we catch drift while it is a dashboard signal, not after complaints. A model nobody re-tests is untested from the day the world starts to move.

“What is concept drift, and can you give a NZ example?”

Concept drift is when the relationship the model learned changes meaning — the same input now maps to a different correct answer. The clearest NZ example is COVID-19: in the 2020 lockdowns a sudden drop in card spending stopped meaning “possible stolen card” and started meaning “everyone is at home,” so fraud models trained on 2019 data flagged ordinary customers in droves. The 2023–24 climate events did the same to insurance models — the patterns of a “normal” claim shifted overnight. It is dangerous because the inputs can look unchanged while their meaning has moved, so input monitoring alone can miss it and you have to watch outputs and override rates too.

“Where does the tester’s role end and the MLOps engineer’s begin in monitoring?”

The MLOps engineer builds and runs the instruments — the pipelines that compute metrics and PSI, the dashboards, the alerting, the deployment mechanics for shadow and canary releases. The tester decides what those instruments should measure: which metrics matter, what thresholds count as a fail, what the retraining triggers are, and what regression suite a retrained model must pass before it goes live. In short, MLOps provides the readings; I decide what the readings have to be and what counts as failure. Monitoring without a tester defining that is a dashboard nobody is judging against a standard — which is exactly how a model degrades with the lights on and no one watching.

Key takeaway

A model that is monitored but not owned is a model that will eventually drift in the dark — the difference between "we have dashboards" and "we catch drift" is a named person reading those dashboards against a written standard on a fixed schedule.

← Risk-Based AI Testing Next: Audit-Ready Test Artefacts →