Test with AI · ISO/IEC 42119

Risk-Based AI Testing

How to calibrate test depth to the consequences of getting it wrong. Two AI systems can both be wrong — but if you test them the same way, you have missed the point of testing.

Test with AI ISO/IEC TS 42119-2:2025 · ~15 min read ~15 min read · ~50 min with exercises

1 The Hook

Imagine two AI systems built by the same team in the same month. The first is a streaming-service recommender that suggests which film a viewer might watch next. The second is a model that scores benefit applications and routes them towards approval or decline. Both are machine-learning systems. Both will be wrong some of the time — no model is perfect.

Now ask what happens when each one is wrong. If the recommender suggests a film you do not fancy, you scroll past it. Mild annoyance, no harm, fully reversible in one click. If the benefit-application model wrongly steers a struggling family towards decline, someone who needed support may not get it — or gets it weeks late, after a stressful review. The harm is real, it falls on people who can least absorb it, and it is hard to undo.

A tester who writes the same test plan for both has not understood what testing is for. The recommender deserves a light touch: does it return sensible suggestions, does it respond quickly, does it avoid obviously broken output. The benefit model deserves deep fairness testing, careful data-quality checks, explainability evidence, and documentation an auditor can read. Same effort spread evenly across both is effort wasted on one and dangerously thin on the other.

ISO/IEC 42119 builds this idea in. It describes AI testing as risk-based: the depth, the coverage, and the documentation you apply to a system should be proportional to the risk that system carries. This lesson teaches you how to judge that risk and turn the judgement into a test plan.

2 The Rule

AI testing depth must be proportional to risk. Risk is not one number — it is the likelihood of failure, multiplied by the severity of the harm, multiplied by the number of people affected, multiplied by how hard the harm is to reverse. A high-stakes model gets deep, documented testing; a low-stakes one gets a light touch. Spreading the same effort evenly across both under-tests the dangerous system and over-tests the harmless one.

3 The Analogy

Analogy

Emergency-services triage.

One paramedic arrives at a scene with two patients: one has stubbed a toe, the other is in cardiac arrest. The paramedic does not give each of them an equal share of attention. That would be fair-sounding and fatal. Triage means matching the depth of the response to the seriousness of the case — the cardiac arrest gets everything, the stubbed toe gets a quick check and a plaster.

AI testing under finite time and budget is the same. You always have more systems and features than testing capacity. Risk-based testing is the triage rule: put your depth where the consequences of failure are greatest. Treating a recommender and a benefits model identically is the testing equivalent of splitting your attention evenly between the toe and the heart.

4 The AI Risk Equation

“Risk” on its own is too vague to plan with. 42119 and its parent risk standards break it into factors you can reason about one at a time. For AI testing, four factors matter:

Risk ≈ Likelihood × Severity × Breadth × (how hard to reverse)
Likelihood — how often is this system likely to be wrong, given the data and the difficulty of the task?
Severity — how bad is the harm when it is wrong? A wrong film suggestion versus a wrong medical-priority decision.
Breadth — how many people are affected? A model used on every applicant in the country carries more risk than one used by a single back-office team.
Reversibility — if it gets it wrong, can the harm be undone, and how easily? A reversible error (resend the email) is far less risky than an irreversible one (a missed cancer-screening referral).

You do not compute a precise product — these are not exact numbers. You use the four factors as a structured way to rank systems and features against each other. A model that is often wrong but always reversible and affects few people is lower risk than one that is rarely wrong but, when it is, causes severe and irreversible harm to many. The four factors stop you fixating on accuracy alone, which is the most common risk-assessment mistake in AI.

Pro tip: Reversibility is the factor teams forget. A model can have a modest error rate and still be very high risk if its mistakes cannot be walked back. Always ask “and if it is wrong, what then?” before you decide how hard to test it.

5 AI-Specific Risk Dimensions

The four factors tell you how much risk a system carries. The next question is what kind. AI systems fail in ways traditional software does not, and each kind of risk points to a different sort of test. 42119 and its companions (ISO/IEC 25010 for quality characteristics, ISO/IEC 23894 for AI risk management) recognise these AI-specific dimensions:

  • Fairness risk: the system treats groups of people unequally. Highest where the model makes decisions about people — eligibility, prioritisation, scoring.
  • Explainability risk: no one can say why the model produced a given output. Highest where a person has a right to an explanation of a decision that affects them.
  • Data risk: the training or input data is unrepresentative, mislabelled, or of unknown origin (the subject of the data-quality lesson). Highest where the data is old, narrow, or hard to verify.
  • Drift risk: the model degrades over time as the world changes. Highest where the environment is volatile — markets, weather, behaviour (the subject of the next lesson).
  • Adversarial risk: someone deliberately feeds the model crafted inputs to make it behave badly. Highest where there is an incentive to game the system — fraud, security, content moderation.

Naming the dimension is what turns “this is risky” into a testable plan. A model is not just “high risk” — it is high fairness risk and high explainability risk, which tells you precisely which test types to deepen.

6 Building an AI Test Risk Register

A risk register is the document that records each identified risk, scores it, and links it to the testing that addresses it. It is the spine of a 42119 test plan — every serious AI test case traces back to a numbered risk in here. Each row captures the factors from section 4 plus the risk type from section 5.

Here is a worked register fragment for a fictional Te Whatu Ora elective-surgery waitlist prioritisation model that scores patients to help order a surgical waiting list:

Risk ID: R-03
Risk type: Fairness
Component: Patient prioritisation score
Likelihood: Medium — training data under-represents rural and Māori patients
Severity: High — wrongly de-prioritised patient waits longer for surgery
Breadth: High — applies to every patient on the elective waitlist
Reversibility: Low — time spent waiting cannot be given back
Score: HIGH
Test types: Bias/fairness testing; data representativeness testing; explainability testing
Mitigation: Group-level fairness thresholds; clinician override; quarterly fairness re-test

The value of the register is not the score in isolation — it is the link from a named, scored risk to specific test types and a mitigation. When an auditor or a manager asks “why did you fairness-test this model so heavily?”, the answer is R-03, not a hunch. And when budget is tight, the register tells you which risks you are choosing to test lightly, on the record.

7 Mapping Risk Types to Test Types

Each AI risk dimension is addressed by particular test types from the 42119 set. You do not invent tests per system — you read the risk type and reach for the matching tests. This table is the bridge between risk assessment and test design:

Risk typeWhat it threatensTest types that address it
FairnessEqual treatment of groupsBias and fairness testing; data representativeness testing
ExplainabilityA person’s right to know whyExplainability / transparency testing; model testing for interpretability
DataValidity of what the model learnedData representativeness, provenance, and label-correctness testing
DriftContinued correctness over timeDrift detection; monitoring; scheduled re-evaluation (next lesson)
AdversarialResistance to deliberate misuseAdversarial / robustness testing; security testing of inputs
Performance (general)Accuracy and reliability of outputModel testing against held-out and edge-case sets

The discipline is to score the risk first, then let the mapping choose the tests — not to start from a favourite test type and look for a reason to run it. A high fairness-risk model with no fairness testing is a gap an auditor will find in minutes.

8 High-Risk vs Low-Risk: What Actually Changes

“Test it more” is not a plan. Risk-based testing changes three concrete things as risk rises: depth, coverage, and documentation.

Low-risk system (e.g. film recommender)High-risk system (e.g. waitlist prioritiser)
DepthSanity checks: sensible output, acceptable latency, no broken responsesDeep fairness, data-quality, explainability, and edge-case testing with measured thresholds
CoverageCommon cases and a few obvious edgesEvery group the system decides about, rare-but-serious cases, adversarial inputs
DocumentationLight — a short note that checks passedAudit-ready evidence: measurements, queries, snapshot dates, reviewer sign-off, traceability to the register

Notice that the high-risk column is not just “the same tests, run longer.” It is different test types (fairness, explainability) and a different evidence standard. The recommender does not need an audit trail; the waitlist model is unusable without one. Matching that to the consequence of failure is the whole job.

9 The Tester’s Role in Risk Assessment

A common confusion: does the tester decide how much risk the organisation is willing to accept? No. Setting risk appetite — the line between acceptable and unacceptable risk — is a business and governance decision, made by the people accountable for the system. A tester who quietly decides “this is fine” or “this must not ship” has stepped outside their role.

What the tester does is design and run tests that match the agreed risk appetite, and surface the evidence so the accountable people can decide with their eyes open. The tester’s contribution is: identify the risks, score them honestly, map them to test types, run the tests, and report what was found — including the risks the team has chosen to test lightly. The decision to accept a residual risk belongs to the business; the duty to make that risk visible belongs to the tester.

NZ context. The NZ Algorithm Charter, signed by most major government agencies, commits them to assess the risk of unintended consequences and to be transparent about how algorithms inform decisions — a risk assessment is expected before deployment, not after a complaint. The Public Service Act 2020 requires public decisions to be lawful, reasonable, and fair, which raises the severity factor for any model touching a public decision. A worked example: risk-assessing a fictional MSD benefit-fraud-flagging model, fairness risk scores HIGH because the model decides about people (severity and breadth both high) and a wrong flag is slow to reverse (low reversibility) — so the Charter and the Act together push it firmly into deep, documented testing.

10 Common Mistakes

🚫 Treating every AI system to the same test plan

I used to think… a thorough tester applies the same rigorous process to everything — that consistency is professionalism.
Actually… equal effort across unequal risks is the triage mistake. It over-tests the harmless system and leaves the dangerous one thin. Rigour means putting depth where the consequences of failure are greatest, not spreading it evenly.

🚫 Judging risk by accuracy alone

I used to think… the riskiest model is the least accurate one, so I rank systems by error rate.
Actually… a rarely-wrong model can be the highest risk of all if its mistakes are severe, affect many people, and cannot be reversed. Likelihood is only one of four factors. A model that is wrong often but harmlessly is lower risk than one that is wrong rarely but catastrophically.

🚫 The tester deciding what risk is acceptable

I used to think… if I judge a model too risky, it is my job to block it — or to wave it through if I think it is fine.
Actually… setting risk appetite is a governance decision for the people accountable for the system. The tester’s job is to make the risk visible with honest scoring and evidence, and to test in line with the appetite the business sets — not to set it.

11 Now You Try

Three graded exercises on risk-based AI testing. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Risk Misjudgement

Below is a tester’s risk note for four NZ AI systems. The ranking is wrong. Identify which system is mis-ranked, explain the error using the four risk factors (likelihood, severity, breadth, reversibility), and give the corrected order from highest to lowest testing risk.

Tester’s ranking (highest to lowest risk):
1. Music-playlist recommender — “it is wrong most often, so highest risk”
2. AI sentencing-range suggestion tool for a court
3. Te Whatu Ora cardiac-surgery waitlist prioritiser
4. Bank FAQ chatbot

Identify the mis-ranking and give the corrected order with reasons:

Show model answer
Mis-ranked system: the music-playlist recommender, ranked #1 (highest risk).

Why it is mis-ranked: the tester judged risk by likelihood alone — "it is wrong most often." But likelihood is only one of four factors. A wrong playlist suggestion is low severity (mild annoyance), narrow-to-broad breadth but trivial impact, and fully reversible (skip the track). High error rate × negligible harm = low risk. The tester fixated on accuracy and ignored severity, breadth, and reversibility.

Corrected order (highest to lowest testing risk):
1. Te Whatu Ora cardiac-surgery waitlist prioritiser — severe harm (delayed life-saving surgery), broad (every cardiac patient), very low reversibility (lost time cannot be returned). Fairness + data + explainability risk.
2. AI sentencing-range suggestion tool — severe harm (liberty), affects defendants, hard to reverse; very high explainability and fairness risk, but a human judge decides, slightly lowering breadth of automated impact.
3. Bank FAQ chatbot — moderate: a wrong answer can mislead a customer, but is usually reversible and rarely life-altering.
4. Music-playlist recommender — frequent but harmless and fully reversible errors. Lowest risk.

The lesson: rank by likelihood × severity × breadth × reversibility, not by error rate.
🔧 Exercise 2 of 3 — Fix the Risk Register Row

The risk register row below is too vague to drive testing. Rewrite it as a complete 42119 row with these fields: Risk ID, Risk type, Component, Likelihood, Severity, Breadth, Reversibility, Score, Test types, Mitigation. Use a fictional MSD benefit-fraud-flagging model as the context.

Original (too vague):
“Risk: the model might be biased. This is risky. We should test it. Mitigation: be careful.”

Rewrite as a complete 42119 risk register row:

Show model answer
Risk ID: R-05

Risk type: Fairness

Component: Fraud-likelihood score applied to benefit applications

Likelihood: Medium — historical investigation data over-represents certain regions and benefit types, so the model may flag some groups more often.

Severity: High — a wrong fraud flag subjects a vulnerable applicant to investigation, payment holds, and stress.

Breadth: High — the model scores every incoming benefit application.

Reversibility: Low — once a person is investigated, the time, stress, and any payment delay cannot be undone, even if cleared.

Score: HIGH (severe, broad, hard to reverse; likelihood medium).

Test types: Bias and fairness testing (group-level flag-rate parity); data representativeness testing; explainability testing so a flagged person can be given a reason.

Mitigation: Group-level fairness thresholds with a fail gate before release; mandatory human review before any flag leads to action; quarterly fairness re-test; documented appeal path.

What makes this 42119-compliant: each factor is reasoned separately, the score follows from the factors (not just "it's risky"), the test types match the fairness risk type, and the mitigation is specific and checkable — unlike "be careful".
🏗️ Exercise 3 of 3 — Build a Risk-Ranked Test Plan

Rank these four NZ AI systems from highest to lowest testing risk, and for each pick the top one or two test types it most needs, justified by its strongest risk dimension. Systems: playlist recommender, AI sentencing-range tool, bank FAQ chatbot, Te Whatu Ora cardiac-surgery waitlist prioritiser.

Show model answer
Rank 1 (highest risk): Te Whatu Ora cardiac-surgery waitlist prioritiser — strongest dimension: fairness + data (severe, broad, very low reversibility) — top test types: bias/fairness testing and data representativeness testing, plus explainability so clinicians can see why a patient was scored.

Rank 2: AI sentencing-range tool — strongest dimension: explainability (a defendant has a right to know the basis) and fairness (equal treatment across groups) — top test types: explainability/transparency testing and fairness testing. Slightly below #1 because a human judge remains the decision-maker.

Rank 3: Bank FAQ chatbot — strongest dimension: adversarial/performance (people may try to extract wrong or harmful answers) — top test types: adversarial/robustness testing and model testing for response correctness. Harm is usually reversible.

Rank 4 (lowest risk): Playlist recommender — strongest dimension: none material; errors are frequent but trivial and fully reversible — top test type: light model testing for sensible output and latency.

Strong answers justify the ranking with the four factors (not error rate) and match each test type to the named risk dimension. The cardiac prioritiser must outrank the recommender even though the recommender is "wrong" more often — that is the whole point.

12 Self-Check

Click each question to reveal the answer.

Q1: Why is it a mistake to give a film recommender and a benefits model the same test plan?

Because the consequences of failure differ enormously. A wrong recommendation is mild and reversible; a wrong benefits decision is severe, falls on vulnerable people, and is hard to undo. Equal effort over-tests the harmless system and dangerously under-tests the high-stakes one. Risk-based testing puts depth where the consequences are greatest — the triage rule.

Q2: Name the four factors in the AI risk equation.

Likelihood (how often it is wrong), severity (how bad the harm is), breadth (how many people are affected), and reversibility (how hard the harm is to undo). Risk is roughly the product of all four — not accuracy alone.

Q3: A model is rarely wrong. Can it still be the highest-risk system you test?

Yes. Low likelihood is only one factor. If the rare errors are severe, affect many people, and cannot be reversed — a missed cancer referral, say — the system can be the highest risk of all. Judging risk by error rate alone is the classic mistake.

Q4: What does the risk register link a scored risk to, and why does that matter?

It links each named, scored risk to specific test types and a mitigation. That matters because every serious AI test case can then trace back to a numbered risk — so when someone asks why a model was tested so heavily (or so lightly), the answer is a register entry, not a hunch. It also records, on paper, which risks the team chose to test lightly.

Q5: Does the tester decide what level of risk is acceptable?

No. Setting risk appetite is a governance decision for the people accountable for the system. The tester identifies and scores risks honestly, maps them to test types, runs the tests, and makes the findings — including residual risk — visible. The decision to accept a residual risk belongs to the business; making it visible belongs to the tester.

13 Interview Prep

Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.

“We have limited time to test three AI features. How would you decide where to focus?”

I would risk-rank them rather than split time evenly. For each feature I would judge four factors: how often it is likely to be wrong, how severe the harm is when it is, how many people it affects, and how reversible the harm is. The feature scoring highest across those — usually one that makes decisions about people and is hard to undo — gets the deepest, best-documented testing; the lowest-risk feature gets sanity checks. I would record that ranking in a risk register so the decision is visible and defensible, not just my preference.

“A stakeholder says our model is 99% accurate, so it is low risk. How do you respond?”

I would agree that accuracy is good news and then point out it is only one of four risk factors. The question I would raise is what happens in the 1% — how severe is the harm, how many people does it touch, and can it be reversed? A 1% error rate on a model that delays surgery or wrongly flags someone for fraud is very high risk despite the headline number, because those errors are severe and hard to undo. I would want to assess severity, breadth, and reversibility before agreeing on the testing depth.

“If you think a model is too risky to release, do you block it?”

No — that is not the tester’s call. Deciding what risk the organisation will accept is a governance decision for the people accountable for the system. My job is to make the risk impossible to miss: score it honestly, show the evidence, and state plainly what I found and what residual risk remains. I would escalate clearly and make sure the decision-maker is choosing with full information. The accountable owner accepts or rejects the residual risk; I make sure they are not doing it blind.