Test with AI · ISO/IEC 42119

Audit-Ready Test Artefacts

42119 does not just tell you what to test. It tells you how to document it — so that when a regulator asks “what did you test, and why?”, the answer is in your artefacts, not in someone’s memory.

Test with AI ISO/IEC TS 42119-2:2025 — Lesson 5 of 6 ~30 min read · ~70 min with exercises

1 The Hook

Taurangi Wealth, a fictional NZ financial services firm, ran an AI that recommended investments to retail customers. A customer complained to the FMA that the recommendations were unsuitable, and the FMA opened a review. They asked the firm to demonstrate how the system had been tested.

The QA lead was not worried at first. Testing had been done — there were hundreds of test results sitting in Jira, all marked passed. She pulled them up and shared them.

Then the auditor started asking questions. Which AI risks did your test plan address? Fair question — she had no risk register to point to. What test level was each of these? They were just labelled “test 1”, “test 2”. What technique did this one use — was it a fairness test, a performance test, a data test? The Jira tickets did not say. Can you show me the traceability from each test case to a specific risk in the system? There was no traceability, because there were no documented risks. Who reviewed this result, and on what basis did they sign it off? The ticket just said “Pass”.

The testing was real. The documentation was not audit-ready. From the regulator’s seat, a pile of passed tickets with no risk linkage, no test levels, no technique tags, and no decision rationale is almost indistinguishable from no testing at all — because there is no way to verify that what was tested actually covered the risks that mattered.

This lesson is about the half of 42119 that teams forget: the standard specifies not only what to test but how to document it. An audit-ready artefact carries enough metadata for a regulator to understand what was tested, why, against which risk, and what was decided. This lesson shows you exactly what those artefacts look like.

2 The Rule

ISO/IEC 42119 specifies not just what to test but how testing must be documented. An audit-ready test artefact is traceable, risk-tagged, and carries enough metadata for a regulator to understand what was tested, why, and what was found. Testing that happened but cannot be evidenced this way is, to an auditor, almost the same as testing that never happened.

3 The Analogy

Analogy

A great builder with no producer statements.

Picture a builder who does genuinely excellent work — the framing is true, the bracing is right, the waterproofing is perfect. But they kept no records. No producer statements, no inspection sign-offs, no photos of the bracing before the linings went on, nothing tied back to the building consent. When the council comes to issue the Code Compliance Certificate, “trust me, it’s a good build” is not enough. Without the paper trail, the council cannot certify it, and the house cannot legally be occupied — no matter how good the actual work was.

AI testing is the same. The work can be excellent, but if it is not documented so an auditor can trace each test back to a risk and see who signed off on what, it does not count when it matters. 42119 artefacts are the producer statements of AI testing — the evidence that turns “we tested it” into “here is exactly what we tested, against which risks, and what we decided.”

4 The Mandatory Test Case Fields

A 42119 test case carries metadata a traditional test case does not. These are the fields that let an auditor read a single test case and understand its place in the whole effort:

Field	What it records	Why an auditor needs it
Test ID	Unique identifier.	To reference and cross-link the case.
Risk category	Which AI failure mode it addresses (data / model / fairness / explainability / drift).	To group coverage by the risks that matter.
Test level	Data, model, system, or integration (Lesson 3).	To confirm testing happened at the right layer.
AI technique type	The specific 42119 test type (e.g. demographic parity, drift, label correctness).	To verify the right technique was used for the risk.
Traceability reference	The numbered risk (and/or requirement) this case traces to.	To prove the test addressed a real, identified risk.
Decision rationale	Why this test, at this depth, with this criterion.	To show the test choices were deliberate, not arbitrary.
Audit timestamp	When it was run and by/for which model version.	To tie the result to a specific point in the lifecycle.
Evidence pointer	Where the actual evidence (metrics, tables, logs) lives.	To let the auditor inspect the underlying proof.

Compare a traditional test case — ID, description, steps, expected result, pass/fail — against that list. The traditional case answers “did it pass?” The 42119 case also answers “which risk did this protect against, at which level, using which technique, decided by whom, and where is the proof?” Those extra fields are exactly the questions the Taurangi Wealth auditor asked and the QA lead could not answer.

5 What 42119 Requires in a Test Plan

The test plan is where coverage is justified. Under 42119 an AI test plan must make explicit:

A risk register reference: the plan points to the AI risk register (Lesson 1) and shows that every significant risk has test coverage.
The AI-specific test types included: which of data, model, fairness, drift, and explainability testing are in scope — and, importantly, which are deliberately out of scope and why.
Coverage rationale: why this set of tests, at these depths, adequately addresses the risks — the link from risk severity to test effort.
Lifecycle phase covered: which phases (design, development, deployment, post-deployment) the plan addresses, so it is clear that, for example, drift testing is scheduled for production and not assumed done at go-live.

The thread running through all of this is the risk register. A 42119 test plan is, in essence, an argument that the testing covers the registered risks adequately — and the artefacts are the evidence backing the argument.

6 The Test Summary Report Under 42119

The summary report is what an auditor or executive actually reads. It translates a cycle of AI testing into a verifiable account. A 42119-aligned summary report section covers:

Test scope: what was tested this cycle — which model version, which lifecycle phase, which data.
AI test types executed: the specific 42119 techniques run, not just “testing was done”.
Risks addressed: which numbered risks from the register this cycle covered.
Coverage achieved: what proportion of in-scope risks were tested, and to what depth.
Open risks: what was not covered, or covered with reservations — stated plainly, because hiding open risk is what gets organisations in trouble with regulators.
Sign-off recommendation: a clear, evidenced recommendation — proceed, proceed with mitigations, or do not proceed — with the rationale.

Pro tip: The “open risks” section is the one inexperienced testers want to leave out, and the one auditors trust the report for. A summary that honestly states “drift testing not yet established — recommend go-live conditional on continuous validation being in place within 30 days” is far stronger than one that implies everything is perfect. Regulators are reassured by candour about residual risk, not by its absence.

7 Traceability in AI Testing — To Risks, Not Just Requirements

In traditional testing, traceability usually runs test case → requirement: every requirement has a test, every test traces to a requirement. That still applies. But AI testing adds a second, more important thread: test case → risk.

Why the shift matters: many AI failures are not violations of a written requirement. No requirement said “the model must not under-perform for South Auckland” — yet that was the failure. The Revenue Analytics Unit had no requirement that was breached; it had an unmanaged risk. Requirements-only traceability would have shown full coverage while the real exposure went untested.

So 42119 traceability connects each test case to the AI risk register. When the auditor asks “which risks did your testing address?” — the Taurangi Wealth question — risk-based traceability is the answer. You can show, risk by risk, which test cases covered it, at which level, with which technique, and what the result was. Requirements traceability alone cannot answer that question, because the most dangerous AI failures live in the gap between “what we specified” and “what could go wrong.”

8 Evidence Requirements by Test Type

“Evidence” means different things for different AI test types. What you point the auditor at depends on what you tested:

Test type	What counts as evidence
Data tests	Distribution comparison tables (training vs live), provenance/lineage records, label-accuracy and inter-labeller agreement scores, the query and dataset snapshot used.
Model performance tests	Precision/recall/F1 results against the agreed threshold, broken down by group; the labelled test set ID; the model version.
Adversarial tests	The exact input transformations applied, per-case scores, and pass/fail against the robustness threshold; the script used.
Fairness tests	Per-group outcome tables (demographic parity), the matched counterfactual pairs and their results, the chosen fairness metric and tolerance.
Drift tests	The dated time series of the monitored metric, the intervention threshold, alert records, and the fresh-labelled-data snapshots each run used.
Explainability tests	The sampled decisions, the explanations produced, and the expert/reviewer confirmation that each explanation was accurate and defensible.

The common thread: evidence is the underlying measurement, reproducible and dated — never a sentence asserting that something was fine. “Fairness was checked” is not evidence; the per-group table with the tolerance and the reviewer sign-off is.

9 The NZ Audit Context

This is not abstract for NZ organisations. Several regulators and frameworks make 42119-style documentation a practical necessity, not a nice-to-have.

FMA (Financial Markets Authority)

Oversees fair conduct in financial services. An AI giving investment, lending, or insurance recommendations sits squarely in scope — the Taurangi Wealth scenario. The FMA can and does ask how systems affecting customer outcomes were tested.

RBNZ (Reserve Bank of New Zealand)

Prudential regulator for banks and insurers. Models affecting credit, capital, or solvency decisions attract scrutiny of their governance and validation — exactly the territory of 42119 model and drift testing with documented evidence.

Public sector — Algorithm Charter & OAG

Government agencies signed up to the Government Algorithm Charter commit to transparency and review of algorithmic decision-making, and the Office of the Auditor-General can examine how public systems are assured. Audit-ready AI test artefacts are how an agency demonstrates it met those commitments.

The pattern across all three: if an AI system makes or shapes a consequential decision about a person, someone with authority can ask you to show how you tested it — and “the tests passed in Jira” is not an answer. 42119 artefacts are.

10 Common Mistakes

🚫 Believing “the tests passed” is the same as being audit-ready

Why it happens: A wall of green tickets feels like proof.
The fix: Passed tickets with no risk linkage, test levels, technique tags, or decision rationale tell an auditor nothing about whether the right things were tested. Audit-readiness is about traceable, risk-tagged metadata — the Taurangi Wealth lesson.

🚫 Tracing test cases only to requirements, not to risks

Why it happens: Requirements traceability is the habit from traditional testing.
The fix: The most dangerous AI failures break no written requirement — they are unmanaged risks. Trace each test case to the AI risk register as well, or requirements-only traceability will show full coverage while the real exposure is untested.

🚫 Writing an assertion instead of attaching evidence

Why it happens: “Fairness was checked and is fine” is quick to write.
The fix: An assertion is not evidence. The per-group table, the drift time series, the counterfactual pairs, the query and snapshot — the reproducible, dated measurement — is what an auditor can verify. Point to the proof, do not summarise it away.

🚫 Hiding open risks to make the report look clean

Why it happens: A report with no open risks feels safer to present.
The fix: Regulators trust candour about residual risk, not its absence. State open risks plainly with a recommendation (e.g. conditional go-live). A hidden risk that later surfaces is far worse for the organisation than one disclosed and managed.

11 Now You Try

Three graded exercises on building audit-ready artefacts. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Missing Fields

Below is a test case written in a traditional format. List every 42119-required field that is missing, and say why each one matters to an auditor.

ID:               TC-118

Description:      Check the model gives sensible claim decisions

Steps:            Submit 50 sample claims through the model

Expected result:  Most decisions look reasonable

Result:           Pass

List the missing 42119 fields and why each matters:

Show model answer

Missing 42119-required fields:

1. Risk category — which AI failure mode (data/model/fairness/explainability/drift) does this address? Without it, the auditor can't group coverage by risk.

2. Test level — is this data, model, system, or integration testing? Confirms testing happened at the right layer.

3. AI technique type — which specific 42119 test type? "Check it gives sensible decisions" names no technique (is it performance? fairness? explainability?).

4. Traceability reference — which numbered risk in the register does this protect against? There's no link to a real risk, so coverage can't be proven.

5. Decision rationale — why these 50 claims, at this depth, with this criterion? The choices look arbitrary.

6. Audit timestamp / model version — when, and against which model version? The result isn't tied to a point in the lifecycle.

7. Evidence pointer — where are the actual 50 decisions and their assessment? "Pass" points to nothing.

What's wrong with the acceptance criterion: "Most decisions look reasonable" is subjective and unmeasurable — it has no threshold, no ground truth, and no definition of "reasonable". A 42119 criterion would be measurable (e.g. "decisions match the adjudicated ground-truth label in at least 90% of the 50 cases, with precision and recall reported by claimant group"). As written, two testers could reach opposite conclusions and both be "right".

🔧 Exercise 2 of 3 — Make It Audit-Ready

Rewrite the test case from Exercise 1 as a fully 42119-compliant artefact with all required fields, using realistic values for a fictional NZ insurance claim-decision AI.

Write the complete artefact:

Show model answer

Test ID: TC-118-R2

Risk category: Model — performance (with fairness breakdown)

Test level: Model

AI technique type: Model performance testing (precision/recall against ground truth, segmented by claimant group)

Description: Submit the 500-claim adjudicated benchmark set through claim-decision model v2.3 and compare the model's accept/decline/refer decisions against the human-adjudicated ground-truth labels, overall and broken down by region and claim type.

Acceptance criteria: Overall recall for "should refer" claims ≥ 0.85; precision for "decline" ≥ 0.90; and no region's recall more than 8 percentage points below the overall figure. Any breach is a fail with the affected segment listed.

Traceability reference: AI risk register risk R-03 (model wrongly auto-declines valid claims) and R-08 (decision quality varies by region).

Decision rationale: The 500-claim benchmark is the adjudicated reference set; recall on "should refer" is prioritised because a missed referral means a valid claim is wrongly auto-declined — the highest-harm error for a claimant. Regional breakdown included because R-08 flags geographic disparity risk.

Audit timestamp / model version: Run 2026-05-14, model v2.3, benchmark set BM-CLAIM-500 (snapshot 2026-05-01).

Evidence pointer: /qa-evidence/TC-118-R2/ — confusion matrix, per-segment precision/recall table, benchmark set manifest, reviewer sign-off (J. Patel, 2026-05-15).

Result: [Pass / Fail] with any breaching segments listed.

What makes it audit-ready: every mandatory field present, a measurable criterion tied to claimant harm, traceability to two numbered risks, a rationale for the choices, a dated model version, and a pointer to reproducible evidence — versus the original's "most decisions look reasonable / Pass".

🏗️ Exercise 3 of 3 — Write a Test Summary Report Section

Write a one-page 42119-aligned test summary report section for a data quality testing cycle on a fictional MSD benefits eligibility AI. Cover: test scope, AI test types executed, risks addressed, coverage achieved, open risks, and sign-off recommendation.

Show model answer

TEST SUMMARY REPORT — Data Quality Testing Cycle
System: MSD Benefits Eligibility AI v1.4 | Cycle: 2026-05 | Author: [QA Lead]

1. Test scope: Data quality testing of the training and evaluation datasets for model v1.4, prior to UAT. Covers the applicant dataset snapshot DS-2026-04 (320,000 records). Lifecycle phase: development. Out of scope this cycle: model performance, fairness, and drift testing (scheduled separately).

2. AI test types executed: Data representativeness testing (coverage by age band, region, benefit type); data provenance testing (lineage and Privacy Act lawful-basis verification); label correctness testing (sample re-check and inter-labeller agreement on 1,000 records).

3. Risks addressed: R-04 (under-represented applicant groups receive less reliable decisions); R-06 (training data lacks lawful basis for AI use); R-11 (eligibility labels inconsistent across assessors).

4. Coverage achieved: Representativeness — all age bands and benefit types within ±3pp tolerance; 2 of 16 regions outside tolerance (rural South Island under-represented). Provenance — lawful basis confirmed for all fields; lineage documented. Label correctness — inter-labeller agreement 0.81; 4% of sampled labels corrected.

5. Open risks: (a) Two rural regions remain under-represented (R-04 partially open) — fairness impact to be confirmed in fairness testing cycle. (b) Inter-labeller agreement of 0.81 is acceptable but not strong; recommend a labelling-guideline review before the next training round.

6. Sign-off recommendation: PROCEED TO UAT WITH CONDITIONS — data is fit for development-stage use, conditional on (i) the two under-represented regions being flagged as known limitations into fairness testing, and (ii) a labelling-guideline review scheduled before the next retrain. Do not proceed to production sign-off until fairness and drift cycles are complete.

What makes it strong: it names specific test types and numbered risks, gives measured coverage (not "we tested the data"), states open risks honestly, and gives a clear conditional recommendation rather than a bare "pass". The candour in section 5 is what an auditor trusts.

12 Self-Check

Click each question to reveal the answer.

Q1: Why is a wall of passed Jira tickets not audit-ready under 42119?

Because it carries no risk linkage, no test levels, no technique tags, no decision rationale, and no evidence pointers. The auditor cannot verify that the right things were tested against the risks that mattered — the Taurangi Wealth failure. Audit-readiness is about traceable, risk-tagged metadata, not pass counts.

Q2: Name four of the mandatory fields a 42119 test case carries beyond a traditional one.

Any four of: Risk category, Test level, AI technique type, Traceability reference, Decision rationale, Audit timestamp, Evidence pointer. These answer “which risk, at which level, using which technique, decided by whom, with proof where?” — not just “did it pass?”

Q3: Why does AI testing trace to risks, not just requirements?

Because the most dangerous AI failures break no written requirement — they are unmanaged risks (the Revenue Analytics Unit had no breached requirement, just an untested risk). Requirements-only traceability shows full coverage while the real exposure goes untested. Risk-based traceability answers “which risks did your testing address?”

Q4: What counts as evidence for a fairness test versus a drift test?

Fairness: per-group outcome tables, the matched counterfactual pairs and results, the chosen metric and tolerance. Drift: the dated time series of the monitored metric, the intervention threshold, alert records, and the fresh-labelled-data snapshots. In both cases evidence is the reproducible, dated measurement — not an assertion that it was fine.

Q5: Why should a test summary report state open risks plainly rather than hide them?

Because regulators trust candour about residual risk, not its absence. A clear statement like “drift testing not yet established — conditional go-live recommended” is stronger than implying perfection. A hidden risk that later surfaces is far worse for the organisation than one disclosed and managed.

13 Interview Prep

Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.

“A regulator asks how our AI system was tested. What do you need to be able to show them?”

Risk-based traceability, basically. For each significant AI risk in our register, I’d want to show which test cases covered it, at which test level, using which technique, when they ran and against which model version, and a pointer to the actual evidence — the per-group tables, drift series, or provenance records. Plus a test summary report that states coverage achieved and any open risks honestly. “The tests passed in Jira” doesn’t survive that conversation, because passed tickets with no risk linkage or technique tags don’t tell the regulator whether we tested the things that actually mattered.

“What’s the difference between requirements traceability and risk-based traceability, and why does AI need both?”

Requirements traceability links each test to a written requirement — that still matters. But many AI failures don’t break any requirement; they’re unmanaged risks. No requirement said “the model mustn’t under-perform for one region,” yet that’s a real failure. If I only trace to requirements, I can show 100% coverage while the most dangerous exposure is completely untested. So in AI work I trace each test case to the risk register as well, which is also exactly what an auditor asks for: which risks did your testing address?

“Our test report has no open risks listed — isn’t that a good thing?”

Usually it’s a red flag, not a good sign. Real AI testing almost always leaves some residual risk — drift monitoring still being set up, a group that’s under-represented, a metric that’s acceptable but not strong. A report with zero open risks usually means they weren’t looked for or weren’t disclosed. I’d rather state them plainly with a recommendation — like conditional go-live pending continuous validation — because regulators trust candour about residual risk. A hidden risk that surfaces later does far more damage than one we disclosed and managed.

← Bias and Fairness Testing Next: Applying 42119 in a Real NZ Project →