Risk-Based AI Testing
How to calibrate test depth to the consequences of getting it wrong. Two AI systems can both be wrong — but if you test them the same way, you have missed the point of testing.
1 The Hook
Imagine two AI systems built by the same team in the same month. The first is a streaming-service recommender that suggests which film a viewer might watch next. The second is a model that scores benefit applications and routes them towards approval or decline. Both are machine-learning systems. Both will be wrong some of the time — no model is perfect.
Now ask what happens when each one is wrong. If the recommender suggests a film you do not fancy, you scroll past it. Mild annoyance, no harm, fully reversible in one click. If the benefit-application model wrongly steers a struggling family towards decline, someone who needed support may not get it — or gets it weeks late, after a stressful review. The harm is real, it falls on people who can least absorb it, and it is hard to undo.
A tester who writes the same test plan for both has not understood what testing is for. The recommender deserves a light touch: does it return sensible suggestions, does it respond quickly, does it avoid obviously broken output. The benefit model deserves deep fairness testing, careful data-quality checks, explainability evidence, and documentation an auditor can read. Same effort spread evenly across both is effort wasted on one and dangerously thin on the other.
ISO/IEC 42119 builds this idea in. It describes AI testing as risk-based: the depth, the coverage, and the documentation you apply to a system should be proportional to the risk that system carries. This lesson teaches you how to judge that risk and turn the judgement into a test plan.
2 The Rule
AI testing depth must be proportional to risk. Risk is not one number — it is the likelihood of failure, multiplied by the severity of the harm, multiplied by the number of people affected, multiplied by how hard the harm is to reverse. A high-stakes model gets deep, documented testing; a low-stakes one gets a light touch. Spreading the same effort evenly across both under-tests the dangerous system and over-tests the harmless one.
3 The Analogy
Emergency-services triage.
One paramedic arrives at a scene with two patients: one has stubbed a toe, the other is in cardiac arrest. The paramedic does not give each of them an equal share of attention. That would be fair-sounding and fatal. Triage means matching the depth of the response to the seriousness of the case — the cardiac arrest gets everything, the stubbed toe gets a quick check and a plaster.
AI testing under finite time and budget is the same. You always have more systems and features than testing capacity. Risk-based testing is the triage rule: put your depth where the consequences of failure are greatest. Treating a recommender and a benefits model identically is the testing equivalent of splitting your attention evenly between the toe and the heart.
4 The AI Risk Equation
“Risk” on its own is too vague to plan with. 42119 and its parent risk standards break it into factors you can reason about one at a time. For AI testing, four factors matter:
Likelihood — how often is this system likely to be wrong, given the data and the difficulty of the task?
Severity — how bad is the harm when it is wrong? A wrong film suggestion versus a wrong medical-priority decision.
Breadth — how many people are affected? A model used on every applicant in the country carries more risk than one used by a single back-office team.
Reversibility — if it gets it wrong, can the harm be undone, and how easily? A reversible error (resend the email) is far less risky than an irreversible one (a missed cancer-screening referral).
You do not compute a precise product — these are not exact numbers. You use the four factors as a structured way to rank systems and features against each other. A model that is often wrong but always reversible and affects few people is lower risk than one that is rarely wrong but, when it is, causes severe and irreversible harm to many. The four factors stop you fixating on accuracy alone, which is the most common risk-assessment mistake in AI.
5 AI-Specific Risk Dimensions
The four factors tell you how much risk a system carries. The next question is what kind. AI systems fail in ways traditional software does not, and each kind of risk points to a different sort of test. 42119 and its companions (ISO/IEC 25010 for quality characteristics, ISO/IEC 23894 for AI risk management) recognise these AI-specific dimensions:
- Fairness risk: the system treats groups of people unequally. Highest where the model makes decisions about people — eligibility, prioritisation, scoring.
- Explainability risk: no one can say why the model produced a given output. Highest where a person has a right to an explanation of a decision that affects them.
- Data risk: the training or input data is unrepresentative, mislabelled, or of unknown origin (the subject of the data-quality lesson). Highest where the data is old, narrow, or hard to verify.
- Drift risk: the model degrades over time as the world changes. Highest where the environment is volatile — markets, weather, behaviour (the subject of the next lesson).
- Adversarial risk: someone deliberately feeds the model crafted inputs to make it behave badly. Highest where there is an incentive to game the system — fraud, security, content moderation.
Naming the dimension is what turns “this is risky” into a testable plan. A model is not just “high risk” — it is high fairness risk and high explainability risk, which tells you precisely which test types to deepen.
6 Building an AI Test Risk Register
A risk register is the document that records each identified risk, scores it, and links it to the testing that addresses it. It is the spine of a 42119 test plan — every serious AI test case traces back to a numbered risk in here. Each row captures the factors from section 4 plus the risk type from section 5.
Here is a worked register fragment for a fictional Te Whatu Ora elective-surgery waitlist prioritisation model that scores patients to help order a surgical waiting list:
Risk type: Fairness
Component: Patient prioritisation score
Likelihood: Medium — training data under-represents rural and Māori patients
Severity: High — wrongly de-prioritised patient waits longer for surgery
Breadth: High — applies to every patient on the elective waitlist
Reversibility: Low — time spent waiting cannot be given back
Score: HIGH
Test types: Bias/fairness testing; data representativeness testing; explainability testing
Mitigation: Group-level fairness thresholds; clinician override; quarterly fairness re-test
The value of the register is not the score in isolation — it is the link from a named, scored risk to specific test types and a mitigation. When an auditor or a manager asks “why did you fairness-test this model so heavily?”, the answer is R-03, not a hunch. And when budget is tight, the register tells you which risks you are choosing to test lightly, on the record.
7 Mapping Risk Types to Test Types
Each AI risk dimension is addressed by particular test types from the 42119 set. You do not invent tests per system — you read the risk type and reach for the matching tests. This table is the bridge between risk assessment and test design:
| Risk type | What it threatens | Test types that address it |
|---|---|---|
| Fairness | Equal treatment of groups | Bias and fairness testing; data representativeness testing |
| Explainability | A person’s right to know why | Explainability / transparency testing; model testing for interpretability |
| Data | Validity of what the model learned | Data representativeness, provenance, and label-correctness testing |
| Drift | Continued correctness over time | Drift detection; monitoring; scheduled re-evaluation (next lesson) |
| Adversarial | Resistance to deliberate misuse | Adversarial / robustness testing; security testing of inputs |
| Performance (general) | Accuracy and reliability of output | Model testing against held-out and edge-case sets |
The discipline is to score the risk first, then let the mapping choose the tests — not to start from a favourite test type and look for a reason to run it. A high fairness-risk model with no fairness testing is a gap an auditor will find in minutes.
8 High-Risk vs Low-Risk: What Actually Changes
“Test it more” is not a plan. Risk-based testing changes three concrete things as risk rises: depth, coverage, and documentation.
| Low-risk system (e.g. film recommender) | High-risk system (e.g. waitlist prioritiser) | |
|---|---|---|
| Depth | Sanity checks: sensible output, acceptable latency, no broken responses | Deep fairness, data-quality, explainability, and edge-case testing with measured thresholds |
| Coverage | Common cases and a few obvious edges | Every group the system decides about, rare-but-serious cases, adversarial inputs |
| Documentation | Light — a short note that checks passed | Audit-ready evidence: measurements, queries, snapshot dates, reviewer sign-off, traceability to the register |
Notice that the high-risk column is not just “the same tests, run longer.” It is different test types (fairness, explainability) and a different evidence standard. The recommender does not need an audit trail; the waitlist model is unusable without one. Matching that to the consequence of failure is the whole job.
9 The Tester’s Role in Risk Assessment
A common confusion: does the tester decide how much risk the organisation is willing to accept? No. Setting risk appetite — the line between acceptable and unacceptable risk — is a business and governance decision, made by the people accountable for the system. A tester who quietly decides “this is fine” or “this must not ship” has stepped outside their role.
What the tester does is design and run tests that match the agreed risk appetite, and surface the evidence so the accountable people can decide with their eyes open. The tester’s contribution is: identify the risks, score them honestly, map them to test types, run the tests, and report what was found — including the risks the team has chosen to test lightly. The decision to accept a residual risk belongs to the business; the duty to make that risk visible belongs to the tester.
NZ context. The NZ Algorithm Charter, signed by most major government agencies, commits them to assess the risk of unintended consequences and to be transparent about how algorithms inform decisions — a risk assessment is expected before deployment, not after a complaint. The Public Service Act 2020 requires public decisions to be lawful, reasonable, and fair, which raises the severity factor for any model touching a public decision. A worked example: risk-assessing a fictional MSD benefit-fraud-flagging model, fairness risk scores HIGH because the model decides about people (severity and breadth both high) and a wrong flag is slow to reverse (low reversibility) — so the Charter and the Act together push it firmly into deep, documented testing.
10 Common Mistakes
🚫 Treating every AI system to the same test plan
I used to think… a thorough tester applies the same rigorous process to everything — that consistency is professionalism.
Actually… equal effort across unequal risks is the triage mistake. It over-tests the harmless system and leaves the dangerous one thin. Rigour means putting depth where the consequences of failure are greatest, not spreading it evenly.
🚫 Judging risk by accuracy alone
I used to think… the riskiest model is the least accurate one, so I rank systems by error rate.
Actually… a rarely-wrong model can be the highest risk of all if its mistakes are severe, affect many people, and cannot be reversed. Likelihood is only one of four factors. A model that is wrong often but harmlessly is lower risk than one that is wrong rarely but catastrophically.
🚫 The tester deciding what risk is acceptable
I used to think… if I judge a model too risky, it is my job to block it — or to wave it through if I think it is fine.
Actually… setting risk appetite is a governance decision for the people accountable for the system. The tester’s job is to make the risk visible with honest scoring and evidence, and to test in line with the appetite the business sets — not to set it.
11 Now You Try
Three graded exercises on risk-based AI testing. Write your answer, run it for AI feedback, then compare to the model answer.
Below is a tester’s risk note for four NZ AI systems. The ranking is wrong. Identify which system is mis-ranked, explain the error using the four risk factors (likelihood, severity, breadth, reversibility), and give the corrected order from highest to lowest testing risk.
1. Music-playlist recommender — “it is wrong most often, so highest risk”
2. AI sentencing-range suggestion tool for a court
3. Te Whatu Ora cardiac-surgery waitlist prioritiser
4. Bank FAQ chatbot
Identify the mis-ranking and give the corrected order with reasons:
Show model answer
Mis-ranked system: the music-playlist recommender, ranked #1 (highest risk). Why it is mis-ranked: the tester judged risk by likelihood alone — "it is wrong most often." But likelihood is only one of four factors. A wrong playlist suggestion is low severity (mild annoyance), narrow-to-broad breadth but trivial impact, and fully reversible (skip the track). High error rate × negligible harm = low risk. The tester fixated on accuracy and ignored severity, breadth, and reversibility. Corrected order (highest to lowest testing risk): 1. Te Whatu Ora cardiac-surgery waitlist prioritiser — severe harm (delayed life-saving surgery), broad (every cardiac patient), very low reversibility (lost time cannot be returned). Fairness + data + explainability risk. 2. AI sentencing-range suggestion tool — severe harm (liberty), affects defendants, hard to reverse; very high explainability and fairness risk, but a human judge decides, slightly lowering breadth of automated impact. 3. Bank FAQ chatbot — moderate: a wrong answer can mislead a customer, but is usually reversible and rarely life-altering. 4. Music-playlist recommender — frequent but harmless and fully reversible errors. Lowest risk. The lesson: rank by likelihood × severity × breadth × reversibility, not by error rate.
The risk register row below is too vague to drive testing. Rewrite it as a complete 42119 row with these fields: Risk ID, Risk type, Component, Likelihood, Severity, Breadth, Reversibility, Score, Test types, Mitigation. Use a fictional MSD benefit-fraud-flagging model as the context.
“Risk: the model might be biased. This is risky. We should test it. Mitigation: be careful.”
Rewrite as a complete 42119 risk register row:
Show model answer
Risk ID: R-05 Risk type: Fairness Component: Fraud-likelihood score applied to benefit applications Likelihood: Medium — historical investigation data over-represents certain regions and benefit types, so the model may flag some groups more often. Severity: High — a wrong fraud flag subjects a vulnerable applicant to investigation, payment holds, and stress. Breadth: High — the model scores every incoming benefit application. Reversibility: Low — once a person is investigated, the time, stress, and any payment delay cannot be undone, even if cleared. Score: HIGH (severe, broad, hard to reverse; likelihood medium). Test types: Bias and fairness testing (group-level flag-rate parity); data representativeness testing; explainability testing so a flagged person can be given a reason. Mitigation: Group-level fairness thresholds with a fail gate before release; mandatory human review before any flag leads to action; quarterly fairness re-test; documented appeal path. What makes this 42119-compliant: each factor is reasoned separately, the score follows from the factors (not just "it's risky"), the test types match the fairness risk type, and the mitigation is specific and checkable — unlike "be careful".
Rank these four NZ AI systems from highest to lowest testing risk, and for each pick the top one or two test types it most needs, justified by its strongest risk dimension. Systems: playlist recommender, AI sentencing-range tool, bank FAQ chatbot, Te Whatu Ora cardiac-surgery waitlist prioritiser.
Show model answer
Rank 1 (highest risk): Te Whatu Ora cardiac-surgery waitlist prioritiser — strongest dimension: fairness + data (severe, broad, very low reversibility) — top test types: bias/fairness testing and data representativeness testing, plus explainability so clinicians can see why a patient was scored. Rank 2: AI sentencing-range tool — strongest dimension: explainability (a defendant has a right to know the basis) and fairness (equal treatment across groups) — top test types: explainability/transparency testing and fairness testing. Slightly below #1 because a human judge remains the decision-maker. Rank 3: Bank FAQ chatbot — strongest dimension: adversarial/performance (people may try to extract wrong or harmful answers) — top test types: adversarial/robustness testing and model testing for response correctness. Harm is usually reversible. Rank 4 (lowest risk): Playlist recommender — strongest dimension: none material; errors are frequent but trivial and fully reversible — top test type: light model testing for sensible output and latency. Strong answers justify the ranking with the four factors (not error rate) and match each test type to the named risk dimension. The cardiac prioritiser must outrank the recommender even though the recommender is "wrong" more often — that is the whole point.
12 Self-Check
Click each question to reveal the answer.
Q1: Why is it a mistake to give a film recommender and a benefits model the same test plan?
Because the consequences of failure differ enormously. A wrong recommendation is mild and reversible; a wrong benefits decision is severe, falls on vulnerable people, and is hard to undo. Equal effort over-tests the harmless system and dangerously under-tests the high-stakes one. Risk-based testing puts depth where the consequences are greatest — the triage rule.
Q2: Name the four factors in the AI risk equation.
Likelihood (how often it is wrong), severity (how bad the harm is), breadth (how many people are affected), and reversibility (how hard the harm is to undo). Risk is roughly the product of all four — not accuracy alone.
Q3: A model is rarely wrong. Can it still be the highest-risk system you test?
Yes. Low likelihood is only one factor. If the rare errors are severe, affect many people, and cannot be reversed — a missed cancer referral, say — the system can be the highest risk of all. Judging risk by error rate alone is the classic mistake.
Q4: What does the risk register link a scored risk to, and why does that matter?
It links each named, scored risk to specific test types and a mitigation. That matters because every serious AI test case can then trace back to a numbered risk — so when someone asks why a model was tested so heavily (or so lightly), the answer is a register entry, not a hunch. It also records, on paper, which risks the team chose to test lightly.
Q5: Does the tester decide what level of risk is acceptable?
No. Setting risk appetite is a governance decision for the people accountable for the system. The tester identifies and scores risks honestly, maps them to test types, runs the tests, and makes the findings — including residual risk — visible. The decision to accept a residual risk belongs to the business; making it visible belongs to the tester.
13 Interview Prep
Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.
“We have limited time to test three AI features. How would you decide where to focus?”
I would risk-rank them rather than split time evenly. For each feature I would judge four factors: how often it is likely to be wrong, how severe the harm is when it is, how many people it affects, and how reversible the harm is. The feature scoring highest across those — usually one that makes decisions about people and is hard to undo — gets the deepest, best-documented testing; the lowest-risk feature gets sanity checks. I would record that ranking in a risk register so the decision is visible and defensible, not just my preference.
“A stakeholder says our model is 99% accurate, so it is low risk. How do you respond?”
I would agree that accuracy is good news and then point out it is only one of four risk factors. The question I would raise is what happens in the 1% — how severe is the harm, how many people does it touch, and can it be reversed? A 1% error rate on a model that delays surgery or wrongly flags someone for fraud is very high risk despite the headline number, because those errors are severe and hard to undo. I would want to assess severity, breadth, and reversibility before agreeing on the testing depth.
“If you think a model is too risky to release, do you block it?”
No — that is not the tester’s call. Deciding what risk the organisation will accept is a governance decision for the people accountable for the system. My job is to make the risk impossible to miss: score it honestly, show the evidence, and state plainly what I found and what residual risk remains. I would escalate clearly and make sure the decision-maker is choosing with full information. The accountable owner accepts or rejects the residual risk; I make sure they are not doing it blind.