Drift, Monitoring & Ongoing Testing
Testing doesn’t stop at release. A model that was good on launch day will not stay good on its own — detecting and responding to degradation in production is part of the testing programme, not an optional extra.
1 The Hook
A fictional NZ insurer, Aroha Insurance, built an AI to detect fraudulent claims. At release it was excellent — 91% precision on its test set, meaning when it flagged a claim as likely fraud it was right nine times in ten. The team celebrated, signed it off, and moved on to the next project. The model went into production and quietly did its job.
Then the world changed. Through 2023 and 2024, a run of severe weather events — the Auckland Anniversary floods, Cyclone Gabrielle, and the storms that followed — produced a surge of genuine claims that looked nothing like the patterns the model had learned from. Roof damage, surface flooding, and silt clean-ups arrived in volumes and combinations the pre-2023 training data had never seen. The model, trained on a calmer world, started mistaking unusual-but-genuine claims for fraud, and missing the new patterns of opportunistic fraud that rode in on the surge.
Precision slid from 91% to 74%. And here is the part that should worry every tester: nobody noticed for three months. There was no monitoring. The model had passed its tests at release, so everyone assumed it was still passing. It was investigators wondering why so many honest claimants were being flagged that finally surfaced the problem — long after the damage to those claimants, and to the insurer’s reputation, was done.
The model had no new bug. It was trained on pre-2023 patterns, the world moved, and the model did not. That is drift — and the failure was not the model’s, it was the absence of any testing after release. ISO/IEC 42119 treats post-deployment monitoring and ongoing testing as part of the testing programme. This lesson teaches what drift is, how to watch for it, and what to do when you find it.
2 The Rule
A model that is good at release will not necessarily stay good. The world it was trained on keeps moving, and the model does not move with it. Post-deployment monitoring and ongoing testing are part of the testing programme, not an optional extra — a model signed off and never watched again is an untested model from the day the world starts to change.
3 The Analogy
A Warrant of Fitness.
Your car passes its WOF in January. That tells you it was safe to drive on the day it was checked. It does not tell you the car is safe in December — tyres wear, brake pads thin, a light fails. A WOF is a point-in-time assessment, and the law knows it: that is exactly why it expires and has to be redone.
A model’s release sign-off is its January WOF. It certifies the model was fit on the day it was tested, against the world as it was then. As conditions change — new weather patterns, new customer behaviour, a pandemic — the model wears just like brake pads. Monitoring is the dashboard warning light; ongoing testing is the next WOF. Treating release sign-off as permanent is like never getting your car re-checked because it passed once.
4 What Drift Is
Drift is the gradual (or sudden) divergence between the world the model learned and the world it now operates in. 42119 and the AI vocabulary standard (ISO/IEC 22989) distinguish three kinds, and the difference matters because they call for different responses:
- Data drift — the inputs change. The kinds of claims, customers, or transactions coming in look different from the training data, even if the underlying relationship has not changed. The flood surge sending Aroha unfamiliar claim types is data drift.
- Concept drift — the relationship the model learned changes. What “normal” or “fraud” means in the real world shifts, so the same input now maps to a different correct answer. COVID-19 is the textbook case: overnight, “a sudden drop in card spending” stopped meaning “possible stolen card” and started meaning “lockdown.”
- Model drift — the umbrella term for the model’s performance degrading over time, whether driven by data drift, concept drift, or both. It is the symptom; the other two are the causes.
Drift looks different across model types: a classification model (fraud / not fraud) drifts as its precision or recall falls; a regression model (predicted house price) drifts as its error grows; a ranking model (search or recommendation order) drifts as the items it ranks highly stop being the ones people choose; a generative model drifts as the inputs people send it move away from what it was tuned on, or as the facts it states go stale.
5 How Drift Shows Up in Production
Drift rarely announces itself. It shows up as a slow erosion that is easy to miss without something watching for it:
- A quiet slide in the headline metric — precision from 91% to 74% over months, with no single day where it obviously breaks.
- A shift in the input distribution — the average claim value, the mix of regions, or the proportion of a customer type moves away from what the model trained on, before performance has even dropped. This is the early-warning signal.
- A rise in human overrides — caseworkers increasingly disagreeing with the model’s output is often the first human-visible sign that the model has fallen out of step with reality.
- Complaints clustering — a sudden pattern of complaints from one group or about one decision type, as at Aroha, is drift surfacing the hard way, after harm.
The lesson from the hook is that none of these is visible unless someone is measuring. A model with no monitoring drifts in the dark.
6 Monitoring Strategies
Monitoring is how you turn drift from an invisible slide into a measured signal. 42119-aligned programmes use several strategies together:
- Performance-metric monitoring — track the model’s accuracy, precision, recall, or error against a stream of ground-truth labels as they arrive, and alert when a metric crosses a threshold. The catch: ground truth often arrives late (was that flagged claim really fraud?), so this can lag.
- Data-distribution monitoring — watch the inputs, not the outputs. Because the input distribution shifts before performance visibly drops, this is the earliest warning you can get and does not wait for ground truth.
- Shadow deployment — run a new or candidate model alongside the live one on real traffic, without acting on its output, to compare them safely before switching.
- A/B testing — route a slice of traffic to a different model version and compare outcomes, to confirm a change actually helps before rolling it out fully.
- Canary release — expose a new model to a small share of traffic first, watch the metrics, and widen only if they hold — so a bad model harms few people before it is caught.
7 Drift-Detection Techniques — in Plain Terms
You do not need heavy maths to understand what the common drift detectors do. Each one answers the same question — “has this distribution moved?” — in a slightly different way:
| Technique | What it does, in plain terms |
|---|---|
| Population Stability Index (PSI) | Compares the spread of a value now against the spread at training time and gives a single number for how far it has moved. Small number = stable; larger number = the inputs have shifted enough to investigate. The most common first detector. |
| KL divergence | Another way to put a number on how different “today’s” distribution is from the “training” distribution — larger means more different. Used the same way as PSI, with different maths underneath. |
| Chi-square test | For categories (region, claim type), checks whether the mix today differs from the mix the model trained on by more than chance — a statistical “is this shift real or just noise?” |
A simple way to picture PSI: imagine the claim values the model trained on sorted into ten buckets, low to high, each holding 10% of claims. Months later you sort the live claims into the same ten buckets. If they still land roughly 10% each, nothing has moved — PSI is near zero. If the surge has stuffed the “high value” buckets and emptied the low ones, the buckets no longer match, and PSI rises to flag it. You are not computing it by hand — you are reading the number a tool produces and knowing what a rising one means: the inputs have moved away from what the model expects.
8 Ongoing-Testing Obligations
Monitoring tells you something has moved. Ongoing testing is what you do about it. 42119 frames a set of obligations that continue for the life of the model:
- Scheduled re-evaluation — re-test the model against fresh ground-truth data on a set cadence (say quarterly), not only when an alert fires. The WOF rule: a regular re-check, on the calendar.
- Retraining triggers — define, in advance, the conditions that mean “retrain now”: a PSI above a threshold, precision below a floor, or a scheduled interval reached. A trigger turns a vague “keep an eye on it” into an agreed, testable rule.
- Regression testing after retraining — a retrained model is a new model. It must be re-tested for everything the original was — fairness, data quality, performance, edge cases — before it replaces the old one. Fixing drift on one metric must not silently reintroduce bias on another.
NZ context. Aotearoa is a live source of concept drift. The 2023–24 climate events — the Auckland Anniversary floods and Cyclone Gabrielle — shifted what “normal” looks like for insurance, infrastructure, and emergency-response models almost overnight. COVID-19 was a nationwide concept-drift event: in the 2020 lockdowns, spending, travel, and employment patterns broke so sharply that models trained on 2019 data were briefly worse than useless. The NZ Algorithm Charter’s commitment to ongoing review of how algorithms inform decisions — not a one-off pre-launch check — is exactly this obligation written into government practice. A NZ tester should treat concept drift not as a rare edge case but as a near-certainty.
9 Tester vs MLOps Engineer
Ongoing testing is a shared job, and confusion about who owns what is how monitoring falls through the cracks — the Aroha failure. A clean split:
| The MLOps engineer | The tester |
|---|---|
| Builds and runs the monitoring infrastructure — the pipelines that compute metrics and PSI, the dashboards, the alerting. | Decides what to monitor, which thresholds count as a fail, and what evidence a re-evaluation must produce. |
| Operates retraining and deployment — shadow, canary, A/B mechanics. | Defines the retraining triggers and designs the regression test suite the retrained model must pass before release. |
| Keeps the system running and the data flowing. | Judges whether the model is still fit for purpose and makes that judgement auditable. |
Put simply: the MLOps engineer provides the instruments; the tester decides what the readings have to be, what counts as failure, and what gets re-tested when the alarm sounds. Monitoring without a tester defining thresholds and triggers is a dashboard nobody is reading against a standard — which is how a model slides from 91% to 74% with the lights on and no one watching.
10 Common Mistakes
🚫 Treating release sign-off as the end of testing
I used to think… once a model passes its tests and ships, testing is done — the job moves to the next project.
Actually… release sign-off is a point-in-time WOF. The world keeps moving and the model does not. Without monitoring and scheduled re-evaluation, the model becomes untested again the moment conditions start to change — exactly how Aroha slid from 91% to 74% unnoticed.
🚫 Watching only the output metric and waiting for it to drop
I used to think… the way to catch drift is to watch accuracy or precision and react when it falls.
Actually… output metrics need ground truth, which often arrives late, so by the time the number drops the harm is done. The input distribution shifts first — data-distribution monitoring (and a detector like PSI) warns you before performance visibly degrades. Watch the inputs, not just the outputs.
🚫 Shipping a retrained model without regression testing it
I used to think… retraining is just an update to fix the drift, so it can go straight out.
Actually… a retrained model is a new model. Restoring precision can silently reintroduce bias or break edge cases the original handled. It must pass the full regression suite — fairness, data quality, performance — before it replaces the live model.
11 Now You Try
Three graded exercises on drift and ongoing testing. Write your answer, run it for AI feedback, then compare to the model answer.
Read the production scenario for a fictional Kiwi retail-bank card-fraud model below. Identify which type of drift is occurring (data, concept, or model), explain why, and name the monitoring strategy that would have caught it earliest.
Identify the drift type, explain why, and name the earliest-warning monitoring strategy:
Show model answer
Drift type: Concept drift. Why this type and not the others: The relationship the model learned has changed meaning. "A sudden drop in spending" used to map to "possible stolen card"; during lockdown the same input maps to "normal lockdown behaviour." The inputs themselves still look ordinary in size and type — so this is not primarily data drift (the input distribution of transaction values has not changed shape the way the meaning has). It is concept drift: the same input now has a different correct answer. "Model drift" is the umbrella symptom (performance falling), not the specific cause; the cause here is concept drift. COVID-19 is the classic real-world example. Monitoring strategy that catches it earliest: A rise in human overrides / a spike in flag rate would surface it fast, but the strongest early signal is monitoring the model's output behaviour and ground-truth confirmation together — flag rate jumping while confirmed-fraud rate does not. Data-distribution monitoring of inputs alone may NOT catch pure concept drift, because the inputs can look unchanged while their meaning shifts — a key subtlety. So performance/flag-rate monitoring with a fast feedback loop (and watching override rates) is what catches this one earliest. The teaching point: concept drift is the trap where inputs look fine but the relationship has moved — input monitoring alone can miss it, which is why you monitor outputs and overrides too.
The post-deployment monitoring plan below is too vague to detect or respond to drift. Rewrite it as a proper model-monitoring checklist with these columns for at least 3 items: Metric, Frequency, Alert threshold, Escalation path. Use a fictional ACC injury-claim triage model as the context.
“We will keep an eye on the model after launch and check it from time to time. If it seems to be getting worse we will look into it.”
Rewrite as a monitoring checklist (Metric | Frequency | Alert threshold | Escalation path):
Show model answer
Item 1 | Metric: Triage precision/recall against confirmed clinical outcomes | Frequency: Monthly, plus a quarterly scheduled re-evaluation | Alert threshold: Precision drops more than 5 percentage points below the release baseline, or below an agreed floor | Escalation path: Alert to QA lead → review with clinical owner → decide retrain/hold Item 2 | Metric: Input distribution shift (PSI) on key features — injury type, region, claimant age band | Frequency: Weekly | Alert threshold: PSI on any monitored feature rises above an agreed value (e.g. moderate-shift band) | Escalation path: Auto-alert to MLOps + QA → investigate cause → data-quality re-check Item 3 | Metric: Human override rate (how often assessors overturn the model) | Frequency: Weekly | Alert threshold: Override rate rises more than X% above its rolling baseline | Escalation path: Alert to QA lead → sample overridden cases → flag possible concept drift Item 4 (bonus) | Metric: Fairness — flag/triage-rate parity across groups | Frequency: Quarterly | Alert threshold: Any group's rate outside the agreed parity band | Escalation path: Fairness re-test → governance owner What makes this a real plan vs the original: every item has a NAMED metric, a SET frequency, a MEASURABLE threshold (not "seems worse"), and a NAMED escalation path. It also mixes output monitoring (precision), input monitoring (PSI — the early warning), and human-signal monitoring (overrides) so concept drift is not missed.
You are the tester for a fictional Auckland Transport bus-arrival-time prediction model. Design an ongoing-testing plan covering: (a) the metrics you would monitor and at what frequency, (b) two plausible drift types with their likely causes for this system, (c) the trigger conditions that would mean “re-evaluate or retrain now”, and (d) the regression tests you would run on a retrained model before it goes live.
Show model answer
(a) Metrics monitored + frequency: Prediction error — mean absolute error between predicted and actual arrival time — tracked daily, with a weekly rollup. Input-distribution monitoring (PSI) on traffic volume, route, time-of-day, and weather inputs — weekly. Error broken down by route and by time-of-day so a problem on one route is not hidden in the average. (b) Two drift types + likely causes: - Data drift: roadworks, a new bus route, or a motorway closure changes the travel patterns feeding the model — the inputs no longer look like training data. - Concept drift: a timetable change, a major event (concert, sports fixture), or a sustained shift in commuting after a behaviour change means the relationship between inputs and actual arrival time has moved — the same conditions now produce different real arrival times. (c) Trigger conditions: mean absolute error rises above an agreed threshold (e.g. predictions off by more than N minutes on a rolling week); OR PSI on a key input crosses its threshold; OR a scheduled quarterly re-evaluation falls due; OR a known step-change event (timetable overhaul) occurs. (d) Regression tests before release: re-run the full release test suite on the retrained model — overall and per-route accuracy must meet or beat the previous model; no route or time band may regress beyond tolerance; edge cases (late-night, public holidays, disruption events) still handled; and a shadow or canary period comparing the new model against the live one on real traffic before full switchover. Strong plans separate input monitoring from output monitoring, name BOTH a data-drift and a concept-drift cause specific to buses, give MEASURABLE triggers, and insist the retrained model is regression-tested — not shipped because "it should be better".
12 Self-Check
Click each question to reveal the answer.
Q1: Why is release sign-off not the end of testing for an AI model?
Because sign-off is a point-in-time assessment — a January WOF. It certifies the model was fit against the world as it was then. The world keeps moving and the model does not, so without monitoring and scheduled re-evaluation the model becomes untested again as soon as conditions change. That is how Aroha slid from 91% to 74% unnoticed.
Q2: What is the difference between data drift and concept drift?
Data drift is when the inputs change — the data coming in looks different from training data. Concept drift is when the relationship changes — the same input now maps to a different correct answer (COVID turning “spending drop” from “stolen card” into “lockdown”). Model drift is the umbrella term for performance degrading from either cause.
Q3: Why watch the input distribution and not just the output metric?
Because output metrics need ground truth, which often arrives late — so by the time accuracy visibly drops the harm is done. The input distribution shifts first, so data-distribution monitoring (with a detector like PSI) gives the earliest warning, before performance degrades. Watch the inputs, not just the outputs.
Q4: In plain terms, what does a rising Population Stability Index tell you?
That the live inputs have moved away from the distribution the model trained on. Picture training data sorted into ten equal buckets; if today’s data no longer falls roughly evenly into those buckets, PSI rises. A small PSI means stable; a larger one means the inputs have shifted enough to investigate. You read the number a tool produces — you do not compute it by hand.
Q5: Why must a retrained model be regression-tested before it replaces the live one?
Because a retrained model is a new model. Restoring precision on the drifted metric can silently reintroduce bias or break edge cases the original handled. It must pass the full suite — fairness, data quality, performance, edge cases — before going live, so that fixing one problem does not quietly create another.
13 Interview Prep
Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.
“Our model passed all its tests and is live. Why would you keep testing it?”
Because passing at release only certifies the model against the world as it was on that day — it is a point-in-time WOF. Conditions change: new customer behaviour, weather events, a policy change, and the model that was good last quarter quietly degrades. Under 42119, post-deployment monitoring and scheduled re-evaluation are part of the testing programme. I would set up monitoring of both inputs and outputs, define thresholds and retraining triggers, and re-test on a cadence — so we catch drift while it is a dashboard signal, not after complaints. A model nobody re-tests is untested from the day the world starts to move.
“What is concept drift, and can you give a NZ example?”
Concept drift is when the relationship the model learned changes meaning — the same input now maps to a different correct answer. The clearest NZ example is COVID-19: in the 2020 lockdowns a sudden drop in card spending stopped meaning “possible stolen card” and started meaning “everyone is at home,” so fraud models trained on 2019 data flagged ordinary customers in droves. The 2023–24 climate events did the same to insurance models — the patterns of a “normal” claim shifted overnight. It is dangerous because the inputs can look unchanged while their meaning has moved, so input monitoring alone can miss it and you have to watch outputs and override rates too.
“Where does the tester’s role end and the MLOps engineer’s begin in monitoring?”
The MLOps engineer builds and runs the instruments — the pipelines that compute metrics and PSI, the dashboards, the alerting, the deployment mechanics for shadow and canary releases. The tester decides what those instruments should measure: which metrics matter, what thresholds count as a fail, what the retraining triggers are, and what regression suite a retrained model must pass before it goes live. In short, MLOps provides the readings; I decide what the readings have to be and what counts as failure. Monitoring without a tester defining that is a dashboard nobody is judging against a standard — which is exactly how a model degrades with the lights on and no one watching.