Why AI Testing Is Different
Traditional software testing was built for systems that do the same thing every time. AI systems do not. Until you understand why, your test suite will pass while the system fails.
1 The Hook
The NZ Revenue Analytics Unit, a Wellington government agency, deployed an AI-assisted document classification system to route incoming correspondence to the right team. The build was clean. The test team ran their standard suite — functional tests, boundary tests, integration tests against the case management system. Everything passed. The system went live.
Six weeks later, a team lead noticed a backlog building in the manual review queue. Correspondence from Māori land trusts was being misclassified and misrouted at a much higher rate than other correspondence types — letters about succession, partition, and trust administration were being sent to the wrong teams, delayed, and in some cases lost in the wrong workflow. No code had changed. No defect had ever been raised. Every test still passed.
The investigation found the cause in the training data. Māori land correspondence uses distinct legal terminology, references the Māori Land Court, and follows document structures that were under-represented in the historical data the model learned from. The model had simply never seen enough of these documents to classify them reliably. It was not broken. It had been trained on a dataset that did not represent the full population it would serve.
Here is the uncomfortable part: no traditional test could have caught this. The functional tests checked that a document went somewhere. The integration tests checked that the routing API returned a 200. None of them asked the question that mattered — does this model perform equally well across every type of correspondence it will actually receive? That question is not in a traditional test plan. Under ISO/IEC 42119, it is.
This lesson is about the gap that scenario exposes: the categories of failure that AI systems introduce, and why the testing you already know — good as it is — was never designed to find them.
2 The Rule
Traditional software testing was designed for deterministic systems — same input, same output, every time. AI systems are probabilistic, data-driven, and can degrade without a single line of code changing. That means traditional test approaches will pass while entire categories of AI failure go undetected. ISO/IEC 42119 exists to close that gap.
3 The Analogy
Testing a vending machine versus testing a sommelier.
A vending machine at a Z petrol station is deterministic. Press B4, you get the same bag of chips every time. You can test it exhaustively: every button, every coin, every edge case. Pass the tests once and you know it works, because nothing about it changes between yesterday and tomorrow.
A sommelier is probabilistic. Hand them the same glass of Marlborough sauvignon blanc twice and you may get two different descriptions — because their answer depends on context, on what they tasted before, on conditions you cannot fully control. You cannot test a sommelier by checking that button B4 always returns chips. You have to test them differently: across many wines, looking for consistency and fairness in their judgement, watching whether their palate drifts over a long evening. An AI system is the sommelier, not the vending machine. Testing it like a vending machine is exactly how the NZ Revenue Analytics Unit’s suite passed while the system failed.
4 Deterministic vs Probabilistic Systems
The single most important shift for a tester moving into AI is this: you are no longer testing a system that behaves the same way every time.
A deterministic system maps each input to exactly one output. A PAYE calculator given an income of $70,000 produces one correct answer, today and forever, until someone changes the code. This is the world traditional testing was built for. Boundary value analysis, equivalence partitioning, expected-result assertions — they all assume that if you know the input, you can state the one correct output in advance.
A probabilistic system maps each input to a distribution of possible outputs, and selects from it. An AI fraud model given the same transaction might score it 0.82 today and 0.79 tomorrow after a retrain. A large language model given the same prompt twice can produce two different answers. There is often no single “correct” output to assert against — only outputs that are more or less acceptable, more or less likely, more or less fair.
What this changes for test design
- You test against thresholds and ranges, not exact values. “The model must score known-fraud transactions above 0.7 at least 95% of the time” replaces “the output equals X”.
- You test statistical behaviour across a population, not single cases. One passing test case proves almost nothing about a probabilistic system; you need a representative test dataset and aggregate metrics.
- You re-test over time, not just once. A deterministic system that passes stays passed until the code changes. A probabilistic system can pass at go-live and fail three months later with no code change at all.
- You test the data and the model, not just the code. In a deterministic system the code is the product. In an AI system, the data shapes the behaviour as much as the code does — so untested data is an untested system.
5 The Five AI-Specific Failure Modes
ISO/IEC 42119 is organised around the categories of failure that are unique to AI systems — the ones that traditional 29119 testing does not address. There are five you need to know by name, because they map directly to the AI-specific test types in the lessons that follow.
1. Data quality failures
The system behaves badly because the data it learned from was incomplete, unrepresentative, mislabelled, or of unknown origin. The NZ Revenue Analytics Unit failure was a data quality failure: the training data under-represented Māori land correspondence. The model was fine. The data was not. (Lesson 2.)
2. Model performance degradation
The model is measurably worse than required — too many false positives, too many missed cases — either at launch or developing over time. A fraud model that catches only 60% of fraud, or floods analysts with false alarms, is failing on performance even if every API call succeeds. (Lesson 3.)
3. Bias and fairness failures
The system performs unequally across groups of people — by ethnicity, age, gender, location, or any protected characteristic. This is not an ethical opinion bolted on at the end; under 42119 it is a measurable, testable quality characteristic with its own test types. (Lesson 4.)
4. Explainability failures
The system produces a decision that no one can explain or justify. When an AI declines a Kiwibank loan application or flags an ACC claim for review, someone — the customer, an auditor, the FMA — may be legally entitled to know why. A model that cannot supply a defensible reason has failed an explainability requirement. (Lesson 3.)
5. Drift
The system was fine at go-live and silently got worse. The world changed — customer behaviour shifted, prices moved, fraud patterns evolved — but the model kept making decisions based on the world it was trained on. Drift is the failure mode that most surprises teams from a traditional background, because nothing “broke.” It just quietly stopped being right. (Lesson 3.)
Why these five matter to you as a tester
A traditional regression suite checks none of these. It checks that the code does what the code is supposed to do. All five AI failure modes can be present in a system whose code is flawless. That is the core reason 42119 exists — and the core reason “all tests pass” means much less for an AI system than it does for a billing engine.
6 The AI System Lifecycle and Where Testing Fits
ISO/IEC 42119 treats testing as something that runs across the whole AI system lifecycle, not a phase at the end. The standard (drawing on the concepts in ISO/IEC 22989) describes four broad phases, and there is testing to do in each.
Design
Before any model is trained, you build the AI risk register, define the quality characteristics that matter, and set the acceptance thresholds. The most valuable testing decision — what to test and how hard — is made here. A tester who arrives only at the end has already lost the most important argument.
Development
This is where data quality testing and model testing live. You test the training data for representativeness, provenance, and label correctness. You test model performance against the thresholds set in design. You run adversarial tests. This happens alongside model development, not after it.
Deployment
Fairness testing before go-live, system and integration testing of the model inside the real application, and the establishment of the continuous validation that will run from here on. Deployment is not the finish line for AI testing — it is where the longest phase begins.
Retirement
When a model is decommissioned or replaced, there is testing to confirm it has been cleanly removed, that its data obligations are met, and that any system depending on it has a safe fallback. AI systems hold data and make decisions; switching one off is itself a tested event.
The headline: in traditional testing the centre of gravity is before release. In AI testing, a large share of the work — drift detection, continuous validation — happens after release, for as long as the system is live.
7 How 42119 Extends 29119 — What Stays, What Changes, What Is New
A common misconception is that AI testing replaces everything you know. It does not. ISO/IEC TS 42119-2:2025 is explicitly built on top of the ISO/IEC/IEEE 29119 software testing series. Most of your skills carry straight over.
What stays the same
Test planning, test design techniques, test levels (component, integration, system, acceptance), risk-based prioritisation, defect management, traceability, and the discipline of documented, repeatable test cases. The 29119 backbone is intact. An AI system still has a user interface, APIs, a database, and integrations — all of which you test exactly as you always have.
What changes
Expected results become thresholds and statistical criteria instead of exact values. Test data stops being a supporting input and becomes a primary test target in its own right. Testing extends past deployment into continuous validation. And the quality characteristics you test against expand — 42119 maps to the ISO/IEC 25059 AI quality model, which breaks functional suitability into functional correctness, functional appropriateness, functional completeness, and functional adaptability.
What is entirely new
Test types that have no equivalent in traditional testing: data representativeness, data provenance, and label correctness testing (data layer); model performance, adversarial, and explainability testing (model layer); counterfactual fairness and demographic parity testing (fairness layer); and drift testing with continuous validation (post-deployment). These are the subjects of Lessons 2 through 4.
The standards around it
42119 does not stand alone. It connects to a small family of ISO/IEC standards that a senior tester should be able to name:
- ISO/IEC/IEEE 29119 — the software testing series 42119 extends.
- ISO/IEC 25059 — the AI quality model; the quality characteristics you test against.
- ISO/IEC 23894 — AI risk management; where the AI risk register comes from.
- ISO/IEC 22989 — AI concepts and terminology; scope, system boundaries, and stakeholder definitions.
- ISO/IEC 42001 — AI management systems; the governance standard that 42119 operationalises through testing.
8 The Risk-Based Approach and the AI Risk Register
42119 is risk-based from end to end. You do not test every AI-specific characteristic to the same depth on every system. You identify the AI risks for this system, and you let those risks drive what you test, which techniques you use, and how deep you go. A churn model that recommends marketing offers carries very different risk from a model that decides ACC claim eligibility — and the test effort should reflect that.
The instrument that captures this is the AI risk register, built using the approach in ISO/IEC 23894. It is the single most important artefact in AI testing, because every test case ultimately traces back to a risk in it. At minimum, each row records the risk, its category, how likely and how serious it is, and the test approach that addresses it.
A worked example
For a fictional Toka Tū Ake EQC damage-assessment AI that estimates repair cost from claim photos, a few rows might look like this:
Model under-estimates damage for older Canterbury housing stock under-represented in training data | Data | Medium | High | Data representativeness testing across property age bands
Estimate accuracy degrades as building costs inflate post-training | Drift | High | High | Continuous validation; monthly drift test on a fresh sample
Assessment differs by region for visually identical damage | Fairness | Medium | High | Demographic parity testing across regions
Assessor cannot explain why a claim was flagged for manual review | Explainability | Medium | Medium | Explainability test cases on a sample of flagged claims
Notice what the register does. It turns vague anxiety (“is this AI any good?”) into a finite, prioritised list of testable questions — each tagged with the failure-mode category from Section 5, each pointing at a specific 42119 test type. That mapping, from risk to category to test approach, is the spine of everything in this module.
9 Common Mistakes
🚫 Treating “all tests pass” as proof an AI system is ready
Why it happens: A green test suite has meant “ready to ship” for an entire career. The instinct carries over unchanged to AI work.
The fix: A passing traditional suite proves the code does what the code intends — nothing about data quality, fairness, or drift. Ask which of the five AI failure modes each test actually covers. Usually the answer is none.
🚫 Writing AI test cases with exact expected-result assertions
Why it happens: Boundary value analysis and equivalence partitioning trained you to state the one correct output for a given input.
The fix: Probabilistic systems do not have one correct output. Test against thresholds, ranges, and aggregate metrics over a representative dataset — “at least 95% above 0.7”, not “equals 0.82”.
🚫 Treating testing as finished at go-live
Why it happens: In traditional projects, deployment is the end of the test effort.
The fix: AI systems drift. A model can pass every pre-deployment test and silently degrade over the following months. Continuous validation after release is not optional under 42119 — it is a core part of the test approach.
🚫 Starting test design without an AI risk register
Why it happens: The team is keen to write test cases and a risk register feels like governance overhead.
The fix: 42119 is risk-based by design. Without a register you cannot justify your test scope, prioritise sensibly, or answer an auditor. Build it first — it is the artefact every test case traces back to.
10 Now You Try
Three graded exercises. Each builds on the one before. Write your answer, run it for AI feedback, then compare to the model answer.
Below are 8 test scenarios for a fictional MBIE business grant eligibility AI that recommends whether an application should be approved, declined, or sent for manual review. For each, decide whether it is covered by traditional testing or requires AI-specific testing per 42119, and say why in one line.
S2 | The model approves and declines applications at the same rate for businesses in Auckland as for businesses in Northland, given equivalent financials
S3 | The “submit” button is disabled until all mandatory fields are complete
S4 | The model’s approval accuracy has not dropped more than 5% three months after go-live
S5 | The recommendation API returns within 800ms under 100 concurrent requests
S6 | When the model declines an application, it returns a reason a caseworker can defend to the applicant
S7 | The training data includes enough sole-trader applications, not just companies, to classify them reliably
S8 | A declined application can be escalated to manual review via the case management screen
Classify each scenario and give a one-line reason:
Show model answer
S1 — Traditional. Field-format validation (13-digit NZBN) is deterministic input validation. Same input, same result. S2 — AI-specific (fairness). This is a demographic parity question across regions — a fairness test type that traditional testing never addresses. S3 — Traditional. UI state logic; deterministic and unrelated to model behaviour. S4 — AI-specific (drift). Tracking accuracy decay over time after go-live is drift testing / continuous validation. No traditional test runs post-deployment to watch for silent degradation. S5 — Traditional. Performance/load testing of the API. The model is a black box here; you are testing response time, not model quality. (Watch the wording: this is system performance, not model performance.) S6 — AI-specific (explainability). Whether a decline carries a defensible reason is an explainability test — a quality characteristic with no traditional equivalent. S7 — AI-specific (data quality). Whether the training data represents sole traders is data representativeness testing. The NZ Revenue Analytics Unit failure was exactly this. S8 — Traditional. Workflow/integration test of the escalation path. Deterministic application behaviour. Pattern: 4 traditional (S1, S3, S5, S8), 4 AI-specific (S2 fairness, S4 drift, S6 explainability, S7 data quality). Note S5 is the trap — “performance” here means API speed, not model performance.
A team has written the test plan below for a fictional KiwiSaver fund recommendation engine — an AI that recommends a fund (conservative, balanced, growth) based on a member’s age, balance, risk answers, and time to retirement. The plan has no AI-specific coverage at all. Identify the 4 most critical gaps against 42119, naming the failure-mode category for each.
1. Verify the recommendation form validates age (18–100) and balance (≥ $0).
2. Verify the recommendation API returns a valid fund type for every complete submission.
3. Verify the recommendation displays correctly on mobile and desktop.
4. Verify the recommendation is saved to the member’s account record.
5. Verify response time is under 1 second for 200 concurrent users.
6. Regression: re-run all of the above on each release.
List the 4 most critical AI-specific gaps and name the category for each:
Show model answer
The plan tests the plumbing (form, API, UI, persistence, performance) but never tests the recommendation itself. The four most critical gaps: Gap 1 — Model performance. Nothing checks whether the recommendations are actually any good. There is no test that, for members with known-appropriate fund types, the model recommends correctly above an agreed threshold. “Returns a valid fund type” (test 2) only checks the output is one of three allowed values — not that it is the right one. Gap 2 — Fairness. Nothing checks whether recommendations differ unfairly across groups — for example, whether women and men with identical age, balance, and risk answers receive systematically different fund recommendations. Demographic parity testing is required and absent. Gap 3 — Drift. The regression (test 6) re-runs the same checks on each release, but nothing monitors whether recommendation quality degrades over time as markets move and member behaviour changes. Continuous validation / drift testing is missing entirely. Gap 4 — Data quality / representativeness. Nothing checks whether the training data represented the full membership — e.g. members near retirement, very low balances, or those who skipped risk questions. An under-represented group will get unreliable recommendations, exactly as in the Hook scenario. (Explainability is a strong fifth gap: a member declined a growth fund may be entitled to know why. Naming it as well is a bonus, not a miss.)
Draft a 5-row AI risk register for a fictional Waka Kotahi traffic incident detection AI — a system that analyses motorway camera feeds and automatically alerts the traffic operations centre when it detects a crash, breakdown, or debris. Use the columns: Risk | Category | Likelihood | Impact | Test approach. Cover at least three different categories from the five failure modes.
Show model answer
Risk | Category | Likelihood | Impact | Test approach 1. Model misses incidents at night or in heavy rain, conditions under-represented in training footage | Data | High | High | Data representativeness testing across lighting and weather conditions; targeted test footage sets 2. Detection accuracy degrades as new vehicle types, road layouts, or camera angles appear after training | Drift | High | High | Continuous validation; scheduled drift test on fresh footage; alert on accuracy drop below threshold 3. High false-alarm rate floods the operations centre and trains operators to ignore alerts | Model performance | Medium | High | Model performance testing on precision/recall against a labelled incident set; agreed false-positive ceiling 4. Detection performs worse on rural state highways than on urban motorways with denser camera coverage | Fairness | Medium | Medium | Demographic-style parity testing across road types and regions 5. Operators cannot tell why the model raised (or failed to raise) an alert for a given clip | Explainability | Medium | Medium | Explainability test cases on a sample of alerts and misses What good looks like: each risk names a real failure mode, likelihood and impact justify the test depth, and every test approach points at a specific 42119 test type. A weak register lists generic risks (“the model might be wrong”) with no category or concrete approach — that is the difference being marked here.
11 Self-Check
Click each question to reveal the answer.
Q1: In one sentence, why can a traditional test suite pass while an AI system fails?
Because a traditional suite checks that the code does what the code intends — and all five AI-specific failure modes (data quality, model performance, bias/fairness, explainability, drift) can be present in a system whose code is flawless.
Q2: What is the practical difference between testing a deterministic and a probabilistic system?
A deterministic system maps each input to exactly one output, so you assert exact expected results and test once. A probabilistic system produces a distribution of outputs, so you test against thresholds and aggregate metrics over a representative dataset, and you re-test over time because behaviour can change with no code change.
Q3: Name the five AI-specific failure modes 42119 is organised around.
Data quality failures, model performance degradation, bias and fairness failures, explainability failures, and drift. Each maps to specific AI test types covered later in the module.
Q4: Does 42119 replace ISO/IEC/IEEE 29119? What is the relationship?
No. 42119 extends 29119. The 29119 backbone — test planning, design techniques, test levels, traceability — stays intact. 42119 adds AI-specific test types (data, model, fairness, drift) and changes some practices (thresholds instead of exact results, testing past deployment).
Q5: Why must you build an AI risk register before designing tests, and what does it connect to?
Because 42119 is risk-based: the register turns vague concern into a prioritised list of testable questions, and every test case traces back to a risk in it — which is also how you answer an auditor’s “which risks did your testing address?”. It is built using the approach in ISO/IEC 23894 (AI risk management).
12 Interview Prep
Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.
“We’re building our first AI feature. Our regression suite is green. Why isn’t that enough?”
A green regression suite tells you the code behaves as written — it says nothing about whether the model is accurate, fair, explainable, or stable over time. AI systems are probabilistic and data-driven, so they can fail in ways the code never expresses: trained on unrepresentative data, biased across groups, or quietly drifting after go-live. I’d want AI-specific coverage on top of the regression suite — data quality, model performance, fairness, and post-deployment drift — mapped to a risk register so we can show what we actually tested and why.
“What is model drift, and how would you test for it?”
Drift is when a model that was accurate at go-live silently degrades because the world it operates in has changed — new customer behaviour, new pricing, new patterns — while the model still reflects its training data. No code changes, nothing throws an error, so traditional testing never sees it. I’d test for it with continuous validation: take a fresh, labelled sample on a schedule, measure the model’s current accuracy against the threshold we agreed at design, and alert when it drops below it. The key shift is that this testing runs after deployment, for the life of the system.
“How does ISO/IEC 42119 relate to the 29119 testing you already do?”
42119 builds on 29119 rather than replacing it. All the 29119 fundamentals — test planning, design techniques, test levels, traceability — still apply, because an AI system still has a UI, APIs, and integrations to test conventionally. What 42119 adds is the AI layer: new test types for data, model, fairness, and drift, plus changes like testing against thresholds instead of exact results and extending testing past deployment. It also ties testing to the ISO/IEC 25059 quality model and the ISO/IEC 23894 risk approach. I’d describe it as my existing testing plus a new, risk-driven AI test layer.