Test with AI · ISO/IEC 42119

Risk-Based AI Testing

Q: Name the four factors in the AI risk equation.

Likelihood (how often it is wrong), severity (how bad the harm is), breadth (how many people are affected), and reversibility (how hard the harm is to undo). Risk is roughly the product of all four — not accuracy alone.

How to calibrate test depth to the consequences of getting it wrong. Two AI systems can both be wrong — but if you test them the same way, you have missed the point of testing.

Test with AI ISO/IEC TS 42119-2:2025 · ~15 min read ~15 min read · ~50 min with exercises

1 The Hook

Imagine two AI systems built by the same team in the same month. The first is a streaming-service recommender that suggests which film a viewer might watch next. The second is a model that scores benefit applications and routes them towards approval or decline. Both are machine-learning systems. Both will be wrong some of the time — no model is perfect.

Now ask what happens when each one is wrong. If the recommender suggests a film you do not fancy, you scroll past it. Mild annoyance, no harm, fully reversible in one click. If the benefit-application model wrongly steers a struggling family towards decline, someone who needed support may not get it — or gets it weeks late, after a stressful review. The harm is real, it falls on people who can least absorb it, and it is hard to undo.

A tester who writes the same test plan for both has not understood what testing is for. The recommender deserves a light touch: does it return sensible suggestions, does it respond quickly, does it avoid obviously broken output. The benefit model deserves deep fairness testing, careful data-quality checks, explainability evidence, and documentation an auditor can read. Same effort spread evenly across both is effort wasted on one and dangerously thin on the other.

ISO/IEC 42119 builds this idea in. It describes AI testing as risk-based: the depth, the coverage, and the documentation you apply to a system should be proportional to the risk that system carries. This lesson teaches you how to judge that risk and turn the judgement into a test plan.

2 The Rule

AI testing depth must be proportional to risk. Risk is not one number — it is the likelihood of failure, multiplied by the severity of the harm, multiplied by the number of people affected, multiplied by how hard the harm is to reverse. A high-stakes model gets deep, documented testing; a low-stakes one gets a light touch. Spreading the same effort evenly across both under-tests the dangerous system and over-tests the harmless one.

3 The Analogy

Analogy

Emergency-services triage.

One paramedic arrives at a scene with two patients: one has stubbed a toe, the other is in cardiac arrest. The paramedic does not give each of them an equal share of attention. That would be fair-sounding and fatal. Triage means matching the depth of the response to the seriousness of the case — the cardiac arrest gets everything, the stubbed toe gets a quick check and a plaster.

AI testing under finite time and budget is the same. You always have more systems and features than testing capacity. Risk-based testing is the triage rule: put your depth where the consequences of failure are greatest. Treating a recommender and a benefits model identically is the testing equivalent of splitting your attention evenly between the toe and the heart.

4 The AI Risk Equation

“Risk” on its own is too vague to plan with. 42119 and its parent risk standards break it into factors you can reason about one at a time. For AI testing, four factors matter:

Risk ≈ Likelihood × Severity × Breadth × (how hard to reverse)
Likelihood — how often is this system likely to be wrong, given the data and the difficulty of the task?
Severity — how bad is the harm when it is wrong? A wrong film suggestion versus a wrong medical-priority decision.
Breadth — how many people are affected? A model used on every applicant in the country carries more risk than one used by a single back-office team.
Reversibility — if it gets it wrong, can the harm be undone, and how easily? A reversible error (resend the email) is far less risky than an irreversible one (a missed cancer-screening referral).

You do not compute a precise product — these are not exact numbers. You use the four factors as a structured way to rank systems and features against each other. A model that is often wrong but always reversible and affects few people is lower risk than one that is rarely wrong but, when it is, causes severe and irreversible harm to many. The four factors stop you fixating on accuracy alone, which is the most common risk-assessment mistake in AI.

Pro tip: Reversibility is the factor teams forget. A model can have a modest error rate and still be very high risk if its mistakes cannot be walked back. Always ask “and if it is wrong, what then?” before you decide how hard to test it.

5 AI-Specific Risk Dimensions

The four factors tell you how much risk a system carries. The next question is what kind. AI systems fail in ways traditional software does not, and each kind of risk points to a different sort of test. 42119 and its companions (ISO/IEC 25010 for quality characteristics, ISO/IEC 23894 for AI risk management) recognise these AI-specific dimensions:

Fairness risk: the system treats groups of people unequally. Highest where the model makes decisions about people — eligibility, prioritisation, scoring.
Explainability risk: no one can say why the model produced a given output. Highest where a person has a right to an explanation of a decision that affects them.
Data risk: the training or input data is unrepresentative, mislabelled, or of unknown origin (the subject of the data-quality lesson). Highest where the data is old, narrow, or hard to verify.
Drift risk: the model degrades over time as the world changes. Highest where the environment is volatile — markets, weather, behaviour (the subject of the next lesson).
Adversarial risk: someone deliberately feeds the model crafted inputs to make it behave badly. Highest where there is an incentive to game the system — fraud, security, content moderation.

Naming the dimension is what turns “this is risky” into a testable plan. A model is not just “high risk” — it is high fairness risk and high explainability risk, which tells you precisely which test types to deepen.

6 Building an AI Test Risk Register

A risk register is the document that records each identified risk, scores it, and links it to the testing that addresses it. It is the spine of a 42119 test plan — every serious AI test case traces back to a numbered risk in here. Each row captures the factors from section 4 plus the risk type from section 5.

Here is a worked register fragment for a fictional HealthNZ elective-surgery waitlist prioritisation model that scores patients to help order a surgical waiting list:

Risk ID:        R-03

Risk type:      Fairness

Component:      Patient prioritisation score

Likelihood:     Medium — training data under-represents rural and Māori patients

Severity:       High — wrongly de-prioritised patient waits longer for surgery

Breadth:        High — applies to every patient on the elective waitlist

Reversibility:  Low — time spent waiting cannot be given back

Score:          HIGH

Test types:     Bias/fairness testing; data representativeness testing; explainability testing

Mitigation:     Group-level fairness thresholds; clinician override; quarterly fairness re-test

The value of the register is not the score in isolation — it is the link from a named, scored risk to specific test types and a mitigation. When an auditor or a manager asks “why did you fairness-test this model so heavily?”, the answer is R-03, not a hunch. And when budget is tight, the register tells you which risks you are choosing to test lightly, on the record.

7 Mapping Risk Types to Test Types

Each AI risk dimension is addressed by particular test types from the 42119 set. You do not invent tests per system — you read the risk type and reach for the matching tests. This table is the bridge between risk assessment and test design:

Risk type	What it threatens	Test types that address it
Fairness	Equal treatment of groups	Bias and fairness testing; data representativeness testing
Explainability	A person’s right to know why	Explainability / transparency testing; model testing for interpretability
Data	Validity of what the model learned	Data representativeness, provenance, and label-correctness testing
Drift	Continued correctness over time	Drift detection; monitoring; scheduled re-evaluation (next lesson)
Adversarial	Resistance to deliberate misuse	Adversarial / robustness testing; security testing of inputs
Performance (general)	Accuracy and reliability of output	Model testing against held-out and edge-case sets

The discipline is to score the risk first, then let the mapping choose the tests — not to start from a favourite test type and look for a reason to run it. A high fairness-risk model with no fairness testing is a gap an auditor will find in minutes.

8 High-Risk vs Low-Risk: What Actually Changes

“Test it more” is not a plan. Risk-based testing changes three concrete things as risk rises: depth, coverage, and documentation.

	Low-risk system (e.g. film recommender)	High-risk system (e.g. waitlist prioritiser)
Depth	Sanity checks: sensible output, acceptable latency, no broken responses	Deep fairness, data-quality, explainability, and edge-case testing with measured thresholds
Coverage	Common cases and a few obvious edges	Every group the system decides about, rare-but-serious cases, adversarial inputs
Documentation	Light — a short note that checks passed	Audit-ready evidence: measurements, queries, snapshot dates, reviewer sign-off, traceability to the register

Notice that the high-risk column is not just “the same tests, run longer.” It is different test types (fairness, explainability) and a different evidence standard. The recommender does not need an audit trail; the waitlist model is unusable without one. Matching that to the consequence of failure is the whole job.

9 The Tester’s Role in Risk Assessment

A common confusion: does the tester decide how much risk the organisation is willing to accept? No. Setting risk appetite — the line between acceptable and unacceptable risk — is a business and governance decision, made by the people accountable for the system. A tester who quietly decides “this is fine” or “this must not ship” has stepped outside their role.

What the tester does is design and run tests that match the agreed risk appetite, and surface the evidence so the accountable people can decide with their eyes open. The tester’s contribution is: identify the risks, score them honestly, map them to test types, run the tests, and report what was found — including the risks the team has chosen to test lightly. The decision to accept a residual risk belongs to the business; the duty to make that risk visible belongs to the tester.

NZ context. The NZ Algorithm Charter, signed by most major government agencies, commits them to assess the risk of unintended consequences and to be transparent about how algorithms inform decisions — a risk assessment is expected before deployment, not after a complaint. The Public Service Act 2020 requires public decisions to be lawful, reasonable, and fair, which raises the severity factor for any model touching a public decision. A worked example: risk-assessing a fictional Benefits NZ benefit-fraud-flagging model, fairness risk scores HIGH because the model decides about people (severity and breadth both high) and a wrong flag is slow to reverse (low reversibility) — so the Charter and the Act together push it firmly into deep, documented testing.

10 Common Mistakes

🚫 Treating every AI system to the same test plan

I used to think… a thorough tester applies the same rigorous process to everything — that consistency is professionalism.
Actually… equal effort across unequal risks is the triage mistake. It over-tests the harmless system and leaves the dangerous one thin. Rigour means putting depth where the consequences of failure are greatest, not spreading it evenly.

🚫 Judging risk by accuracy alone

I used to think… the riskiest model is the least accurate one, so I rank systems by error rate.
Actually… a rarely-wrong model can be the highest risk of all if its mistakes are severe, affect many people, and cannot be reversed. Likelihood is only one of four factors. A model that is wrong often but harmlessly is lower risk than one that is wrong rarely but catastrophically.

🚫 The tester deciding what risk is acceptable

I used to think… if I judge a model too risky, it is my job to block it — or to wave it through if I think it is fine.
Actually… setting risk appetite is a governance decision for the people accountable for the system. The tester’s job is to make the risk visible with honest scoring and evidence, and to test in line with the appetite the business sets — not to set it.

Senior engineer insight

The most dangerous project I worked on was a welfare-eligibility decision tool that the team confidently rated “low risk” because it had 97% accuracy on the test set. What they hadn’t asked was: who are the 3% and what happens to them? Those cases were clustered in a specific demographic, the harm was a delayed benefit during a hardship period, and there was no appeal path that didn’t take weeks. Severity plus irreversibility pushed that 3% into territory far worse than a system wrong 30% of the time in trivially recoverable ways.

The most common mistake: teams conflate risk classification with risk management — they score the risk correctly, then continue testing exactly the same way they always did.

From the field

A central-government team in Wellington was deploying a document-triage model to route incoming correspondence — the assumption was that it was a workflow efficiency tool, low consequence, maybe a medium risk on the register. Halfway through testing, a QA engineer noticed that some correspondence types being routed away from human review were Official Information Act requests. Under the AoG AI risk framework, anything touching citizens’ statutory rights escalates automatically to high risk, regardless of how routine the routing looks operationally. The team had classified the risk by looking at the model’s function (sorting emails) rather than the downstream consequence of mis-sorting. Reclassifying to high risk added explainability requirements, mandatory human review for the affected category, and a logging standard that hadn’t been planned. The lesson: always ask what the routed output is used for, not just what the model does with the input.

11 Now You Try

Three graded exercises on risk-based AI testing. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Risk Misjudgement

Below is a tester’s risk note for four NZ AI systems. The ranking is wrong. Identify which system is mis-ranked, explain the error using the four risk factors (likelihood, severity, breadth, reversibility), and give the corrected order from highest to lowest testing risk.

Tester’s ranking (highest to lowest risk):
1. Music-playlist recommender — “it is wrong most often, so highest risk”
2. AI sentencing-range suggestion tool for a court
3. HealthNZ cardiac-surgery waitlist prioritiser
4. Bank FAQ chatbot

Identify the mis-ranking and give the corrected order with reasons:

Show model answer

Mis-ranked system: the music-playlist recommender, ranked #1 (highest risk).

Why it is mis-ranked: the tester judged risk by likelihood alone — "it is wrong most often." But likelihood is only one of four factors. A wrong playlist suggestion is low severity (mild annoyance), narrow-to-broad breadth but trivial impact, and fully reversible (skip the track). High error rate × negligible harm = low risk. The tester fixated on accuracy and ignored severity, breadth, and reversibility.

Corrected order (highest to lowest testing risk):
1. HealthNZ cardiac-surgery waitlist prioritiser — severe harm (delayed life-saving surgery), broad (every cardiac patient), very low reversibility (lost time cannot be returned). Fairness + data + explainability risk.
2. AI sentencing-range suggestion tool — severe harm (liberty), affects defendants, hard to reverse; very high explainability and fairness risk, but a human judge decides, slightly lowering breadth of automated impact.
3. Bank FAQ chatbot — moderate: a wrong answer can mislead a customer, but is usually reversible and rarely life-altering.
4. Music-playlist recommender — frequent but harmless and fully reversible errors. Lowest risk.

The lesson: rank by likelihood × severity × breadth × reversibility, not by error rate.

🔧 Exercise 2 of 3 — Fix the Risk Register Row

The risk register row below is too vague to drive testing. Rewrite it as a complete 42119 row with these fields: Risk ID, Risk type, Component, Likelihood, Severity, Breadth, Reversibility, Score, Test types, Mitigation. Use a fictional Benefits NZ benefit-fraud-flagging model as the context.

Original (too vague):
“Risk: the model might be biased. This is risky. We should test it. Mitigation: be careful.”

Rewrite as a complete 42119 risk register row:

Show model answer

Risk ID: R-05

Risk type: Fairness

Component: Fraud-likelihood score applied to benefit applications

Likelihood: Medium — historical investigation data over-represents certain regions and benefit types, so the model may flag some groups more often.

Severity: High — a wrong fraud flag subjects a vulnerable applicant to investigation, payment holds, and stress.

Breadth: High — the model scores every incoming benefit application.

Reversibility: Low — once a person is investigated, the time, stress, and any payment delay cannot be undone, even if cleared.

Score: HIGH (severe, broad, hard to reverse; likelihood medium).

Test types: Bias and fairness testing (group-level flag-rate parity); data representativeness testing; explainability testing so a flagged person can be given a reason.

Mitigation: Group-level fairness thresholds with a fail gate before release; mandatory human review before any flag leads to action; quarterly fairness re-test; documented appeal path.

What makes this 42119-compliant: each factor is reasoned separately, the score follows from the factors (not just "it's risky"), the test types match the fairness risk type, and the mitigation is specific and checkable — unlike "be careful".

🏗️ Exercise 3 of 3 — Build a Risk-Ranked Test Plan

Rank these four NZ AI systems from highest to lowest testing risk, and for each pick the top one or two test types it most needs, justified by its strongest risk dimension. Systems: playlist recommender, AI sentencing-range tool, bank FAQ chatbot, HealthNZ cardiac-surgery waitlist prioritiser.

Show model answer

Rank 1 (highest risk): HealthNZ cardiac-surgery waitlist prioritiser — strongest dimension: fairness + data (severe, broad, very low reversibility) — top test types: bias/fairness testing and data representativeness testing, plus explainability so clinicians can see why a patient was scored.

Rank 2: AI sentencing-range tool — strongest dimension: explainability (a defendant has a right to know the basis) and fairness (equal treatment across groups) — top test types: explainability/transparency testing and fairness testing. Slightly below #1 because a human judge remains the decision-maker.

Rank 3: Bank FAQ chatbot — strongest dimension: adversarial/performance (people may try to extract wrong or harmful answers) — top test types: adversarial/robustness testing and model testing for response correctness. Harm is usually reversible.

Rank 4 (lowest risk): Playlist recommender — strongest dimension: none material; errors are frequent but trivial and fully reversible — top test type: light model testing for sensible output and latency.

Strong answers justify the ranking with the four factors (not error rate) and match each test type to the named risk dimension. The cardiac prioritiser must outrank the recommender even though the recommender is "wrong" more often — that is the whole point.

Why teams fail here

Classifying by model function, not downstream consequence. Teams assess what the model does (sorts, scores, predicts) rather than what happens to a person when it gets it wrong. A scoring model with a single downstream action — like a payment hold — is high risk regardless of how mechanical the scoring looks.
Treating the risk register as a compliance checkbox, not a test driver. The register gets filled in, signed off, and filed. Test cases are then written from functional requirements as usual. The link between risk ID and test design never forms — so the register scores the risk HIGH and the test plan covers only happy-path output correctness.
Reversibility assessed at the model level, not the human level. A model output can be “corrected in the next run” — technically reversible. But if a person has already been denied, investigated, or excluded in the interim, the human harm is irreversible. Teams score reversibility on the system, not on the affected person.
Risk scores that don’t change test effort. A model is rated HIGH risk in the register, but the team has the same two-week test window and the same test types as the LOW-risk system before it. Risk-based testing only works if the score actually drives resourcing and test-type selection — not just documentation.
Ignoring aggregate breadth on narrow-seeming features. A feature that affects “only” grant applicants in one region feels narrow. But if that model processes every application — two thousand people a month — the breadth factor is high. Teams undercount breadth when the population has a qualifier that sounds small.

Enterprise reality

Enterprise AI portfolios with dozens of models across regulated use cases

Risk classification must cover the entire model portfolio, not just new deployments. At scale, organisations maintain AI inventories — registers of every model in production, each with its own risk score — so that a change in regulatory environment or a new use case triggers a re-score across the board, not just for the system that changed.
High-risk AI systems require independent validation, not just internal testing. In regulated sectors — banking, insurance, health — enterprise governance frameworks mandate that models above a risk threshold are reviewed by a team that did not build or test them, often a dedicated AI risk or model risk function. The tester’s job shifts from sole assessor to evidence packager: produce artefacts an independent reviewer can audit without asking questions.
Risk re-assessment is triggered, not scheduled. Individual teams test before release; enterprises test on triggers — a model update, detected data drift, a new use case added to an existing model, or a regulatory change. The risk register is a living document queried by automated monitoring pipelines, not a one-time pre-release document.
Risk decisions are signed off by named accountable executives. At organisational scale, a HIGH-risk finding does not stay with the test team. It escalates to a model risk committee or a named risk owner — a Chief Risk Officer, a Chief Data Officer — who signs the residual-risk acceptance. That signature is the audit trail. The tester’s role is to make the evidence so clear that a senior executive who does not read code can still make an informed decision.

How this has changed

The field moved fast. Here is what the evolution looked like for Risk-Based AI Testing.

Pre-2023

AI risk assessment is ad hoc. ML teams use accuracy metrics; no formal test risk framework exists for AI systems.

2024

EU AI Act passes — first major regulation requiring formal risk classification for AI systems. High-risk AI (healthcare, employment, law enforcement) must be auditable.

2025

ISO/IEC TS 42119-2 published with explicit risk categorisation guidance for AI testing. NZ organisations with EU customers must align. NZISM guidance on AI security risk begins circulating.

Now

Risk-based AI testing is expected by auditors and procurement teams. Organisations without documented risk assessments face compliance exposure.

12 Self-Check

Click each question to reveal the answer.

Q1: Why is it a mistake to give a film recommender and a benefits model the same test plan?

Because the consequences of failure differ enormously. A wrong recommendation is mild and reversible; a wrong benefits decision is severe, falls on vulnerable people, and is hard to undo. Equal effort over-tests the harmless system and dangerously under-tests the high-stakes one. Risk-based testing puts depth where the consequences are greatest — the triage rule.

Q2: Name the four factors in the AI risk equation.

Likelihood (how often it is wrong), severity (how bad the harm is), breadth (how many people are affected), and reversibility (how hard the harm is to undo). Risk is roughly the product of all four — not accuracy alone.

Q3: A model is rarely wrong. Can it still be the highest-risk system you test?

Yes. Low likelihood is only one factor. If the rare errors are severe, affect many people, and cannot be reversed — a missed cancer referral, say — the system can be the highest risk of all. Judging risk by error rate alone is the classic mistake.

Q4: What does the risk register link a scored risk to, and why does that matter?

It links each named, scored risk to specific test types and a mitigation. That matters because every serious AI test case can then trace back to a numbered risk — so when someone asks why a model was tested so heavily (or so lightly), the answer is a register entry, not a hunch. It also records, on paper, which risks the team chose to test lightly.

Q5: Does the tester decide what level of risk is acceptable?

No. Setting risk appetite is a governance decision for the people accountable for the system. The tester identifies and scores risks honestly, maps them to test types, runs the tests, and makes the findings — including residual risk — visible. The decision to accept a residual risk belongs to the business; making it visible belongs to the tester.

13 Interview Prep

Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.

“We have limited time to test three AI features. How would you decide where to focus?”

I would risk-rank them rather than split time evenly. For each feature I would judge four factors: how often it is likely to be wrong, how severe the harm is when it is, how many people it affects, and how reversible the harm is. The feature scoring highest across those — usually one that makes decisions about people and is hard to undo — gets the deepest, best-documented testing; the lowest-risk feature gets sanity checks. I would record that ranking in a risk register so the decision is visible and defensible, not just my preference.

“A stakeholder says our model is 99% accurate, so it is low risk. How do you respond?”

I would agree that accuracy is good news and then point out it is only one of four risk factors. The question I would raise is what happens in the 1% — how severe is the harm, how many people does it touch, and can it be reversed? A 1% error rate on a model that delays surgery or wrongly flags someone for fraud is very high risk despite the headline number, because those errors are severe and hard to undo. I would want to assess severity, breadth, and reversibility before agreeing on the testing depth.

“If you think a model is too risky to release, do you block it?”

No — that is not the tester’s call. Deciding what risk the organisation will accept is a governance decision for the people accountable for the system. My job is to make the risk impossible to miss: score it honestly, show the evidence, and state plainly what I found and what residual risk remains. I would escalate clearly and make sure the decision-maker is choosing with full information. The accountable owner accepts or rejects the residual risk; I make sure they are not doing it blind.

Key takeaway

Risk-based AI testing is not about testing more — it is about testing differently: the risk score tells you which test types to run, which populations to cover, and what evidence standard an auditor will expect, and none of that follows from accuracy metrics alone.

← Bias and Fairness Testing Next: Drift, Monitoring & Ongoing Testing →