Test with AI · ISO/IEC 42119

Why AI Testing Is Different

Q: Does 42119 replace ISO/IEC/IEEE 29119? What is the relationship?

No. 42119 extends 29119. The 29119 backbone — test planning, design techniques, test levels, traceability — stays intact. 42119 adds AI-specific test types (data, model, fairness, drift) and changes some practices (thresholds instead of exact results, testing past deployment).

Traditional software testing was built for systems that do the same thing every time. AI systems do not. Until you understand why, your test suite will pass while the system fails.

Test with AI ISO/IEC TS 42119-2:2025 — Lesson 1 of 6 ~30 min read · ~70 min with exercises

1 The Hook

The NZ Revenue Analytics Unit, a Wellington government agency, deployed an AI-assisted document classification system to route incoming correspondence to the right team. The build was clean. The test team ran their standard suite — functional tests, boundary tests, integration tests against the case management system. Everything passed. The system went live.

Six weeks later, a team lead noticed a backlog building in the manual review queue. Correspondence from Māori land trusts was being misclassified and misrouted at a much higher rate than other correspondence types — letters about succession, partition, and trust administration were being sent to the wrong teams, delayed, and in some cases lost in the wrong workflow. No code had changed. No defect had ever been raised. Every test still passed.

The investigation found the cause in the training data. Māori land correspondence uses distinct legal terminology, references the Māori Land Court, and follows document structures that were under-represented in the historical data the model learned from. The model had simply never seen enough of these documents to classify them reliably. It was not broken. It had been trained on a dataset that did not represent the full population it would serve.

Here is the uncomfortable part: no traditional test could have caught this. The functional tests checked that a document went somewhere. The integration tests checked that the routing API returned a 200. None of them asked the question that mattered — does this model perform equally well across every type of correspondence it will actually receive? That question is not in a traditional test plan. Under ISO/IEC 42119, it is.

This lesson is about the gap that scenario exposes: the categories of failure that AI systems introduce, and why the testing you already know — good as it is — was never designed to find them.

2 The Rule

Traditional software testing was designed for deterministic systems — same input, same output, every time. AI systems are probabilistic, data-driven, and can degrade without a single line of code changing. That means traditional test approaches will pass while entire categories of AI failure go undetected. ISO/IEC 42119 exists to close that gap.

3 The Analogy

Analogy

Testing a vending machine versus testing a sommelier.

A vending machine at a Z petrol station is deterministic. Press B4, you get the same bag of chips every time. You can test it exhaustively: every button, every coin, every edge case. Pass the tests once and you know it works, because nothing about it changes between yesterday and tomorrow.

A sommelier is probabilistic. Hand them the same glass of Marlborough sauvignon blanc twice and you may get two different descriptions — because their answer depends on context, on what they tasted before, on conditions you cannot fully control. You cannot test a sommelier by checking that button B4 always returns chips. You have to test them differently: across many wines, looking for consistency and fairness in their judgement, watching whether their palate drifts over a long evening. An AI system is the sommelier, not the vending machine. Testing it like a vending machine is exactly how the NZ Revenue Analytics Unit’s suite passed while the system failed.

4 Deterministic vs Probabilistic Systems

The single most important shift for a tester moving into AI is this: you are no longer testing a system that behaves the same way every time.

A deterministic system maps each input to exactly one output. A PAYE calculator given an income of $70,000 produces one correct answer, today and forever, until someone changes the code. This is the world traditional testing was built for. Boundary value analysis, equivalence partitioning, expected-result assertions — they all assume that if you know the input, you can state the one correct output in advance.

A probabilistic system maps each input to a distribution of possible outputs, and selects from it. An AI fraud model given the same transaction might score it 0.82 today and 0.79 tomorrow after a retrain. A large language model given the same prompt twice can produce two different answers. There is often no single “correct” output to assert against — only outputs that are more or less acceptable, more or less likely, more or less fair.

What this changes for test design

You test against thresholds and ranges, not exact values. “The model must score known-fraud transactions above 0.7 at least 95% of the time” replaces “the output equals X”.
You test statistical behaviour across a population, not single cases. One passing test case proves almost nothing about a probabilistic system; you need a representative test dataset and aggregate metrics.
You re-test over time, not just once. A deterministic system that passes stays passed until the code changes. A probabilistic system can pass at go-live and fail three months later with no code change at all.
You test the data and the model, not just the code. In a deterministic system the code is the product. In an AI system, the data shapes the behaviour as much as the code does — so untested data is an untested system.

Pro tip: If a test plan for an AI system contains only exact expected-result assertions, that is your first red flag. It was almost certainly written by someone treating a probabilistic system as if it were deterministic — which means it will pass while missing every AI-specific failure mode below.

5 The Five AI-Specific Failure Modes

ISO/IEC 42119 is organised around the categories of failure that are unique to AI systems — the ones that traditional 29119 testing does not address. There are five you need to know by name, because they map directly to the AI-specific test types in the lessons that follow.

1. Data quality failures

The system behaves badly because the data it learned from was incomplete, unrepresentative, mislabelled, or of unknown origin. The NZ Revenue Analytics Unit failure was a data quality failure: the training data under-represented Māori land correspondence. The model was fine. The data was not. (Lesson 2.)

2. Model performance degradation

The model is measurably worse than required — too many false positives, too many missed cases — either at launch or developing over time. A fraud model that catches only 60% of fraud, or floods analysts with false alarms, is failing on performance even if every API call succeeds. (Lesson 3.)

3. Bias and fairness failures

The system performs unequally across groups of people — by ethnicity, age, gender, location, or any protected characteristic. This is not an ethical opinion bolted on at the end; under 42119 it is a measurable, testable quality characteristic with its own test types. (Lesson 4.)

4. Explainability failures

The system produces a decision that no one can explain or justify. When an AI declines a Fern Bank loan application or flags a CoverNZ claim for review, someone — the customer, an auditor, the FMA — may be legally entitled to know why. A model that cannot supply a defensible reason has failed an explainability requirement. (Lesson 3.)

5. Drift

The system was fine at go-live and silently got worse. The world changed — customer behaviour shifted, prices moved, fraud patterns evolved — but the model kept making decisions based on the world it was trained on. Drift is the failure mode that most surprises teams from a traditional background, because nothing “broke.” It just quietly stopped being right. (Lesson 3.)

Why these five matter to you as a tester

A traditional regression suite checks none of these. It checks that the code does what the code is supposed to do. All five AI failure modes can be present in a system whose code is flawless. That is the core reason 42119 exists — and the core reason “all tests pass” means much less for an AI system than it does for a billing engine.

6 The AI System Lifecycle and Where Testing Fits

ISO/IEC 42119 treats testing as something that runs across the whole AI system lifecycle, not a phase at the end. The specification (drawing on the concepts in ISO/IEC 22989) describes four broad phases, and there is testing to do in each.

Design

Before any model is trained, you build the AI risk register, define the quality characteristics that matter, and set the acceptance thresholds. The most valuable testing decision — what to test and how hard — is made here. A tester who arrives only at the end has already lost the most important argument.

Development

This is where data quality testing and model testing live. You test the training data for representativeness, provenance, and label correctness. You test model performance against the thresholds set in design. You run adversarial tests. This happens alongside model development, not after it.

Deployment

Fairness testing before go-live, system and integration testing of the model inside the real application, and the establishment of the continuous validation that will run from here on. Deployment is not the finish line for AI testing — it is where the longest phase begins.

Retirement

When a model is decommissioned or replaced, there is testing to confirm it has been cleanly removed, that its data obligations are met, and that any system depending on it has a safe fallback. AI systems hold data and make decisions; switching one off is itself a tested event.

The headline: in traditional testing the centre of gravity is before release. In AI testing, a large share of the work — drift detection, continuous validation — happens after release, for as long as the system is live.

7 How 42119 Extends 29119 — What Stays, What Changes, What Is New

A common misconception is that AI testing replaces everything you know. It does not. ISO/IEC TS 42119-2:2025 is explicitly built on top of the ISO/IEC/IEEE 29119 software testing series. Most of your skills carry straight over.

What stays the same

Test planning, test design techniques, test levels (component, integration, system, acceptance), risk-based prioritisation, defect management, traceability, and the discipline of documented, repeatable test cases. The 29119 backbone is intact. An AI system still has a user interface, APIs, a database, and integrations — all of which you test exactly as you always have.

What changes

Expected results become thresholds and statistical criteria instead of exact values. Test data stops being a supporting input and becomes a primary test target in its own right. Testing extends past deployment into continuous validation. And the quality characteristics you test against expand — 42119 draws on the ISO/IEC 25010 and 25059 quality models, which add AI-specific quality characteristics on top of the familiar ones.

What is entirely new

Test types that have no equivalent in traditional testing: data representativeness, data provenance, and label correctness testing (data layer); model performance, adversarial, and explainability testing (model layer); counterfactual fairness and demographic parity testing (fairness layer); and drift testing with continuous validation (post-deployment). These are the subjects of Lessons 2 through 4.

The standards around it

42119 does not stand alone. It connects to a small family of ISO/IEC standards that a senior tester should be able to name:

ISO/IEC/IEEE 29119 — the software testing series 42119 extends.
ISO/IEC 25059 — the AI quality model; the quality characteristics you test against.
ISO/IEC 23894 — AI risk management; where the AI risk register comes from.
ISO/IEC 22989 — AI concepts and terminology; scope, system boundaries, and stakeholder definitions.
ISO/IEC 42001 — AI management systems; the governance standard that 42119 operationalises through testing.

Pro tip: 42119-2 is a Technical Specification (TS), not a full International Standard (IS). It is stable enough to adopt now as best practice, but it will evolve — further parts of the 42119 series are in development under ISO/IEC JTC 1/SC 42. Frame organisational claims as “aligned to ISO/IEC TS 42119-2:2025”, never “certified against” it.

8 The Risk-Based Approach and the AI Risk Register

42119 is risk-based from end to end. You do not test every AI-specific characteristic to the same depth on every system. You identify the AI risks for this system, and you let those risks drive what you test, which techniques you use, and how deep you go. A churn model that recommends marketing offers carries very different risk from a model that decides CoverNZ claim eligibility — and the test effort should reflect that.

The instrument that captures this is the AI risk register, built using the approach in ISO/IEC 23894. It is the single most important artefact in AI testing, because every test case ultimately traces back to a risk in it. At minimum, each row records the risk, its category, how likely and how serious it is, and the test approach that addresses it.

A worked example

For a fictional Toka Tū Ake EQC damage-assessment AI that estimates repair cost from claim photos, a few rows might look like this:

Risk | Category | Likelihood | Impact | Test approach

Model under-estimates damage for older Canterbury housing stock under-represented in training data | Data | Medium | High | Data representativeness testing across property age bands

Estimate accuracy degrades as building costs inflate post-training | Drift | High | High | Continuous validation; monthly drift test on a fresh sample

Assessment differs by region for visually identical damage | Fairness | Medium | High | Demographic parity testing across regions

Assessor cannot explain why a claim was flagged for manual review | Explainability | Medium | Medium | Explainability test cases on a sample of flagged claims

Notice what the register does. It turns vague anxiety (“is this AI any good?”) into a finite, prioritised list of testable questions — each tagged with the failure-mode category from Section 5, each pointing at a specific 42119 test type. That mapping, from risk to category to test approach, is the spine of everything in this module.

Pro tip: If you join an AI project and there is no AI risk register, that is the first thing to raise — before you write a single test case. Without it, you have no defensible basis for what you chose to test, and no answer when a regulator asks “which risks did your testing address?” (the exact question Lesson 5 is built around).

From the field

A Wellington social services team building an AI triage system assumed their testing was solid because they had inherited a mature 29119 test suite from a previous project. The system — intended to prioritise referrals for at-risk whānau — passed UAT, went live, and immediately started producing outputs that compliance reviewers could not explain to affected families when challenged under the Privacy Act. What the team had not realised was that their Benefits NZ-adjacent accountability obligations (essentially the DCAT principles applied internally) required not just that the model worked, but that every triage decision could be traced back to specific, auditable factors. The test suite had no explainability test cases at all. They had to retrofit an explanation layer post-go-live under time pressure, which cost three times what it would have cost to design for it upfront. The lesson that generalises: explainability is an architectural decision, not a testing afterthought — and the time to raise it is when you are reading the requirements, not when the auditor is in the room.

9 Common Mistakes

🚫 Treating “all tests pass” as proof an AI system is ready

Why it happens: A green test suite has meant “ready to ship” for an entire career. The instinct carries over unchanged to AI work.
The fix: A passing traditional suite proves the code does what the code intends — nothing about data quality, fairness, or drift. Ask which of the five AI failure modes each test actually covers. Usually the answer is none.

🚫 Writing AI test cases with exact expected-result assertions

Why it happens: Boundary value analysis and equivalence partitioning trained you to state the one correct output for a given input.
The fix: Probabilistic systems do not have one correct output. Test against thresholds, ranges, and aggregate metrics over a representative dataset — “at least 95% above 0.7”, not “equals 0.82”.

🚫 Treating testing as finished at go-live

Why it happens: In traditional projects, deployment is the end of the test effort.
The fix: AI systems drift. A model can pass every pre-deployment test and silently degrade over the following months. Continuous validation after release is not optional under 42119 — it is a core part of the test approach.

🚫 Starting test design without an AI risk register

Why it happens: The team is keen to write test cases and a risk register feels like governance overhead.
The fix: 42119 is risk-based by design. Without a register you cannot justify your test scope, prioritise sensibly, or answer an auditor. Build it first — it is the artefact every test case traces back to.

From the field

A Wellington team building a risk-scoring tool for Benefits NZ supplementary benefit applications ran a complete 29119-compliant test suite before go-live—every functional path, every API integration, every boundary check. The product owner signed off. Three months after launch, a caseworker noticed that applicants from rural Northland were being flagged for manual review at nearly twice the rate of equivalent Wellington applicants. The team dug in and found the training data had been sourced primarily from urban case files; rural patterns were statistically thin. The model was not broken—it had never been asked whether it treated all populations the same. Under the NZ Algorithm Charter’s transparency and accountability obligations, this differential outcome needed a defensible explanation, and there was none. The lesson: a test suite that never asks a fairness question will never find a fairness failure.

Senior engineer insight

The most dangerous project I worked on passed every functional test beautifully — 400 test cases, all green. The AI recommendation engine was routing benefit applicants to the right team with 94% accuracy on our evaluation set. What we had not tested was which 6% it got wrong: it was disproportionately misrouting applicants with non-Anglicised names and rural RD addresses. That pattern never appeared in any functional test because functional tests check routing — not whose routing gets checked. Discovering it required building a demographic breakdown of errors, something that did not exist in our test plan until an auditor asked for it.

The most common mistake: teams celebrate aggregate accuracy figures without ever slicing that accuracy by the subpopulations that matter most — which is precisely where AI systems tend to fail silently.

10 Now You Try

Three graded exercises. Each builds on the one before. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot What Needs AI Testing

Below are 8 test scenarios for a fictional MBIE business grant eligibility AI that recommends whether an application should be approved, declined, or sent for manual review. For each, decide whether it is covered by traditional testing or requires AI-specific testing per 42119, and say why in one line.

S1 | The application form rejects an NZBN that is not 13 digits

S2 | The model approves and declines applications at the same rate for businesses in Auckland as for businesses in Northland, given equivalent financials

S3 | The “submit” button is disabled until all mandatory fields are complete

S4 | The model’s approval accuracy has not dropped more than 5% three months after go-live

S5 | The recommendation API returns within 800ms under 100 concurrent requests

S6 | When the model declines an application, it returns a reason a caseworker can defend to the applicant

S7 | The training data includes enough sole-trader applications, not just companies, to classify them reliably

S8 | A declined application can be escalated to manual review via the case management screen

Classify each scenario and give a one-line reason:

Show model answer

S1 — Traditional. Field-format validation (13-digit NZBN) is deterministic input validation. Same input, same result.

S2 — AI-specific (fairness). This is a demographic parity question across regions — a fairness test type that traditional testing never addresses.

S3 — Traditional. UI state logic; deterministic and unrelated to model behaviour.

S4 — AI-specific (drift). Tracking accuracy decay over time after go-live is drift testing / continuous validation. No traditional test runs post-deployment to watch for silent degradation.

S5 — Traditional. Performance/load testing of the API. The model is a black box here; you are testing response time, not model quality. (Watch the wording: this is system performance, not model performance.)

S6 — AI-specific (explainability). Whether a decline carries a defensible reason is an explainability test — a quality characteristic with no traditional equivalent.

S7 — AI-specific (data quality). Whether the training data represents sole traders is data representativeness testing. The NZ Revenue Analytics Unit failure was exactly this.

S8 — Traditional. Workflow/integration test of the escalation path. Deterministic application behaviour.

Pattern: 4 traditional (S1, S3, S5, S8), 4 AI-specific (S2 fairness, S4 drift, S6 explainability, S7 data quality). Note S5 is the trap — “performance” here means API speed, not model performance.

🔧 Exercise 2 of 3 — Find the Gaps

A team has written the test plan below for a fictional KiwiSaver fund recommendation engine — an AI that recommends a fund (conservative, balanced, growth) based on a member’s age, balance, risk answers, and time to retirement. The plan has no AI-specific coverage at all. Identify the 4 most critical gaps against 42119, naming the failure-mode category for each.

Current test plan (traditional only):
1. Verify the recommendation form validates age (18–100) and balance (≥ $0).
2. Verify the recommendation API returns a valid fund type for every complete submission.
3. Verify the recommendation displays correctly on mobile and desktop.
4. Verify the recommendation is saved to the member’s account record.
5. Verify response time is under 1 second for 200 concurrent users.
6. Regression: re-run all of the above on each release.

List the 4 most critical AI-specific gaps and name the category for each:

Show model answer

The plan tests the plumbing (form, API, UI, persistence, performance) but never tests the recommendation itself. The four most critical gaps:

Gap 1 — Model performance. Nothing checks whether the recommendations are actually any good. There is no test that, for members with known-appropriate fund types, the model recommends correctly above an agreed threshold. “Returns a valid fund type” (test 2) only checks the output is one of three allowed values — not that it is the right one.

Gap 2 — Fairness. Nothing checks whether recommendations differ unfairly across groups — for example, whether women and men with identical age, balance, and risk answers receive systematically different fund recommendations. Demographic parity testing is required and absent.

Gap 3 — Drift. The regression (test 6) re-runs the same checks on each release, but nothing monitors whether recommendation quality degrades over time as markets move and member behaviour changes. Continuous validation / drift testing is missing entirely.

Gap 4 — Data quality / representativeness. Nothing checks whether the training data represented the full membership — e.g. members near retirement, very low balances, or those who skipped risk questions. An under-represented group will get unreliable recommendations, exactly as in the Hook scenario.

(Explainability is a strong fifth gap: a member declined a growth fund may be entitled to know why. Naming it as well is a bonus, not a miss.)

🏗️ Exercise 3 of 3 — Build a Risk Register

Draft a 5-row AI risk register for a fictional TransitNZ traffic incident detection AI — a system that analyses motorway camera feeds and automatically alerts the traffic operations centre when it detects a crash, breakdown, or debris. Use the columns: Risk | Category | Likelihood | Impact | Test approach. Cover at least three different categories from the five failure modes.

Show model answer

Risk | Category | Likelihood | Impact | Test approach

1. Model misses incidents at night or in heavy rain, conditions under-represented in training footage | Data | High | High | Data representativeness testing across lighting and weather conditions; targeted test footage sets

2. Detection accuracy degrades as new vehicle types, road layouts, or camera angles appear after training | Drift | High | High | Continuous validation; scheduled drift test on fresh footage; alert on accuracy drop below threshold

3. High false-alarm rate floods the operations centre and trains operators to ignore alerts | Model performance | Medium | High | Model performance testing on precision/recall against a labelled incident set; agreed false-positive ceiling

4. Detection performs worse on rural state highways than on urban motorways with denser camera coverage | Fairness | Medium | Medium | Demographic-style parity testing across road types and regions

5. Operators cannot tell why the model raised (or failed to raise) an alert for a given clip | Explainability | Medium | Medium | Explainability test cases on a sample of alerts and misses

What good looks like: each risk names a real failure mode, likelihood and impact justify the test depth, and every test approach points at a specific 42119 test type. A weak register lists generic risks (“the model might be wrong”) with no category or concrete approach — that is the difference being marked here.

Why teams fail here

They treat a green regression suite as a safety signal—it only proves the code does what the code says; it says nothing about whether the model is accurate, fair, or stable.
They write AI test cases with exact expected-result assertions, then wonder why 90% of them are marked as “passed with deviation”—probabilistic systems require thresholds and aggregate metrics, not point values.
They close the test plan at go-live, not realising that drift—the failure mode most invisible to traditional QA—starts accumulating the day the model stops seeing new training data.
They skip the AI risk register because it feels like governance overhead, then cannot answer an auditor, an FMA reviewer, or a DCAT-aligned agency asking “which risks did your testing actually address?”

Key takeaway

“If your AI test plan would pass a PAYE calculator unchanged, you haven’t started testing the AI yet.”

Why teams fail here

Porting deterministic test thinking directly to AI: writing test cases with exact expected-result assertions against a probabilistic model, then wondering why coverage feels hollow. The assertion was never wrong — the model type was.
Treating aggregate accuracy as a pass: an overall 93% accuracy figure hides the 40% error rate on the subgroup that matters most. Teams report the aggregate, regulators ask about the subgroup.
Stopping testing at deployment: in traditional projects, green at release means done. AI systems drift — and the team that disbanded its test capability at go-live has no mechanism to detect it when accuracy silently drops three months later.
Skipping the AI risk register as governance overhead: without it, test scope is whatever the team feels like testing, there is no audit trail from risk to test case, and the first regulator question — which risks did your testing address? — has no defensible answer.
Ignoring data as a test target: teams test the application thoroughly and hand-wave over the training data with the data science team handled that. Under 42119, untested data is an untested system — the NZ Revenue Analytics Unit failure was a data failure, not a code failure.
Deferring fairness and explainability to the ethics team: both are measurable, testable quality characteristics with defined test types in 42119. Treating them as philosophical questions rather than test conditions means they never get covered — until an affected party challenges a decision under the Human Rights Act or the Privacy Act.

11 Self-Check

Click each question to reveal the answer.

Q1: In one sentence, why can a traditional test suite pass while an AI system fails?

Because a traditional suite checks that the code does what the code intends — and all five AI-specific failure modes (data quality, model performance, bias/fairness, explainability, drift) can be present in a system whose code is flawless.

Q2: What is the practical difference between testing a deterministic and a probabilistic system?

A deterministic system maps each input to exactly one output, so you assert exact expected results and test once. A probabilistic system produces a distribution of outputs, so you test against thresholds and aggregate metrics over a representative dataset, and you re-test over time because behaviour can change with no code change.

Q3: Name the five AI-specific failure modes 42119 is organised around.

Data quality failures, model performance degradation, bias and fairness failures, explainability failures, and drift. Each maps to specific AI test types covered later in the module.

Q4: Does 42119 replace ISO/IEC/IEEE 29119? What is the relationship?

No. 42119 extends 29119. The 29119 backbone — test planning, design techniques, test levels, traceability — stays intact. 42119 adds AI-specific test types (data, model, fairness, drift) and changes some practices (thresholds instead of exact results, testing past deployment).

Q5: Why must you build an AI risk register before designing tests, and what does it connect to?

Because 42119 is risk-based: the register turns vague concern into a prioritised list of testable questions, and every test case traces back to a risk in it — which is also how you answer an auditor’s “which risks did your testing address?”. It is built using the approach in ISO/IEC 23894 (AI risk management).

12 Interview Prep

Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.

“We’re building our first AI feature. Our regression suite is green. Why isn’t that enough?”

A green regression suite tells you the code behaves as written — it says nothing about whether the model is accurate, fair, explainable, or stable over time. AI systems are probabilistic and data-driven, so they can fail in ways the code never expresses: trained on unrepresentative data, biased across groups, or quietly drifting after go-live. I’d want AI-specific coverage on top of the regression suite — data quality, model performance, fairness, and post-deployment drift — mapped to a risk register so we can show what we actually tested and why.

“What is model drift, and how would you test for it?”

Drift is when a model that was accurate at go-live silently degrades because the world it operates in has changed — new customer behaviour, new pricing, new patterns — while the model still reflects its training data. No code changes, nothing throws an error, so traditional testing never sees it. I’d test for it with continuous validation: take a fresh, labelled sample on a schedule, measure the model’s current accuracy against the threshold we agreed at design, and alert when it drops below it. The key shift is that this testing runs after deployment, for the life of the system.

“How does ISO/IEC 42119 relate to the 29119 testing you already do?”

42119 builds on 29119 rather than replacing it. All the 29119 fundamentals — test planning, design techniques, test levels, traceability — still apply, because an AI system still has a UI, APIs, and integrations to test conventionally. What 42119 adds is the AI layer: new test types for data, model, fairness, and drift, plus changes like testing against thresholds instead of exact results and extending testing past deployment. It also ties testing to the ISO/IEC 25059 quality model and the ISO/IEC 23894 risk approach. I’d describe it as my existing testing plus a new, risk-driven AI test layer.

Key takeaway

A green test suite tells you the code works — and with AI systems, the code is the least of your problems: the data, the model, the fairness profile, and the passage of time are where failures live, and none of them appear in a traditional test result.

← ISO/IEC 42119 Overview Next: Data Quality Testing →