Test with AI · ISO/IEC 42119

Data Quality Testing

Q: Name the three data test types in 42119 and the question each answers.

Representativeness — does the data reflect the full population we serve? Provenance — where did it come from and are we allowed to use it this way? Label correctness — are the answers the model learned from actually right?

Q: Which data quality dimension is the seed of drift, and why?

Timeliness. If the training data is stale, the model has learned a world that no longer exists — and as the real world keeps moving, the gap widens. That is drift. It is where data quality testing and the drift testing in Lesson 3 meet.

In an AI system, the data is the product. A model trained on flawed data is a flawed system — no matter how clean the code. Data quality testing is not a pre-condition for AI testing. It is AI testing.

Test with AI ISO/IEC TS 42119-2:2025 — Lesson 2 of 6 ~30 min read · ~70 min with exercises

1 The Hook

Awa Mutual, a fictional NZ health insurer, built an AI to triage incoming claims — sorting them into fast-track, standard, and manual-review queues so urgent claims got seen first. They trained it on five years of historical claim data. In testing it looked excellent: high accuracy on the held-out test set, fast, consistent. It went live.

Within a few months, complaints surfaced. Claims from Pasifika customers were being routed to the slow manual-review queue at a noticeably higher rate than claims from other customers — the same kinds of claims, treated differently. People who needed a fast decision were waiting longest.

The model had no bug. The investigation traced the problem to the training data. Historically, Awa Mutual’s book had been concentrated in central and eastern Auckland suburbs. Claims from South Auckland and Manukau — where much of the region’s Pasifika population lives — were under-represented in the five years of history the model learned from. The model had simply seen fewer examples of these claims, so it was less confident classifying them, so it dumped them into manual review. The data taught it to.

Here is the lesson hidden in that story: the team tested the model and never tested the data. They checked that the model performed well on a test set — but that test set was drawn from the same skewed history, so it carried the same blind spot. A model and its test set can agree perfectly and both be wrong, because they inherited the same flawed data.

Under ISO/IEC 42119, the data is a first-class test target. You test it for representativeness, for provenance, and for label correctness — before and alongside testing the model. That is what this lesson teaches.

2 The Rule

In an AI system the data is the product, so untested data is an untested system. Testing the model on data drawn from the same flawed source will not save you — the model and its test set inherit the same blind spots. Data quality testing is not a setup step before the real testing. It is the real testing.

3 The Analogy

Analogy

A MasterChef NZ contestant given off ingredients.

Put the best cook in the country in the kitchen and hand them spoiled fish, bruised produce, and salt mislabelled as sugar, and they will plate up a bad dish. Not because they cannot cook — because you cannot cook your way out of bad ingredients. The model is the contestant; the training data is the ingredients. A brilliant model architecture trained on unrepresentative, mislabelled, or dubious data produces a confidently bad system.

And here is the part that catches teams out: if you taste the dish using a fork made from the same spoiled batch — if your test set comes from the same flawed data — everything seems fine. Data quality testing is checking the ingredients before they go in the pot, not just tasting the dish at the end.

4 The Three Data Test Types

ISO/IEC 42119 defines three distinct kinds of data testing. They answer three different questions, and a serious AI test plan covers all three.

Data representativeness testing

The question: does the training and test data reflect the full population the system will actually serve? This is the Awa Mutual failure. Representativeness testing checks that every group, condition, and scenario the live system will encounter is present in the data in sufficient quantity — across geography, demographics, edge conditions, and rare-but-important cases. A dataset can be enormous and still unrepresentative if it over-samples one slice of reality.

Data provenance testing

The question: where did this data come from, and are we allowed to use it this way? Provenance testing checks the origin, lineage, and permitted use of every dataset feeding the model. Was it collected lawfully? Does its consent basis cover training an AI? Has it been altered since collection, and by whom? For NZ systems this is where the Privacy Act 2020 lives — data collected for one purpose cannot simply be repurposed to train a model without a lawful basis. Provenance is also a security concern: data of unknown origin can be poisoned.

NZ Regulatory Checkpoint — Privacy Act 2020 & OPC AI Guidance

In New Zealand, data quality testing for AI systems must satisfy the Privacy Act 2020 alongside 42119. Three information privacy principles bear directly on training data:

IPP 4 (Collection from source) — personal data in training sets must have been collected from the individual where practicable. Scraped or third-party data often fails this test.
IPP 9 (Data quality) — agencies must not use personal information unless it is accurate, up to date, complete, and not misleading. Stale training data is a Privacy Act risk, not just a quality risk.
IPP 10 (Limits on use) — training data may only be used for a purpose consistent with why it was collected. An NZ insurer cannot retrain a fraud model on health data collected for claims without explicit re-consent.

The Office of the Privacy Commissioner has published an AI guidance document specifically addressing generative AI and automated decision-making under the 2020 Act.

Label correctness testing

The question: are the “right answers” the model learned from actually right? Supervised models learn from labelled examples — this claim was fraud, that one was not; this document is a succession application, that one is a rates query. If the labels are wrong, the model faithfully learns the wrong thing. Label correctness testing samples the labels and checks them against ground truth, looks for inconsistency between labellers, and flags systematic labelling errors. Garbage labels produce a garbage model that tests perfectly against its own garbage.

Pro tip: The three map cleanly to three questions you can ask in any AI project review: “Does the data cover everyone we serve?” (representativeness), “Are we allowed to use it?” (provenance), and “Are the answers it learned from correct?” (label correctness). If a team cannot answer all three, the data has not been tested.

5 What to Test For in Each Type

Representativeness — what to check

Demographic coverage: for a Revenue NZ income-assessment model, are all income bands, employment types (salary, wage, self-employed, contractor), and regions present in realistic proportions?
Geographic coverage: the Awa Mutual case — is every region and urban/rural split represented, not just the head-office catchment?
Edge and rare cases: for a Toka Tū Ake EQC damage model, are rare-but-critical events (major liquefaction, multi-storey damage) present, or only common minor claims?
Temporal coverage: does the data span enough time to capture seasonal and cyclical patterns, or only one unusual year?

Provenance — what to check

Source and lineage: for a Fern Bank credit model, can every field be traced to a documented source system, with a record of any transformations applied?
Lawful basis and consent: does the consent under which the data was collected cover its use for training an AI model under the Privacy Act 2020?
Data residency: for government and health data, was it stored and processed within the jurisdictions the contract requires?
Integrity: is there evidence the data has not been tampered with or silently corrupted between collection and training?

Label correctness — what to check

Accuracy against ground truth: sample labels and re-check them — for a fraud model, were the “fraud” labels confirmed fraud, or just suspected?
Inter-labeller agreement: when two people labelled the same item, did they agree? Low agreement means the labelling task itself is ambiguous.
Systematic bias in labelling: were certain groups or case types labelled more harshly or more leniently by the humans who created the training data?
Class balance: are there enough examples of each label? A fraud dataset that is 99.9% “not fraud” needs careful handling or the model learns to always say “not fraud” and scores 99.9% accuracy while catching nothing.

6 Data Quality Dimensions and the Failure Modes They Cause

42119 draws on a set of data quality dimensions. You do not need to memorise a taxonomy — you need to know how each dimension, when it fails, shows up as an AI failure mode.

Completeness — are values missing? Missing data for a group means the model guesses for that group. → data quality / fairness failure.
Accuracy — are the values correct? Wrong inputs or wrong labels teach the model wrong behaviour. → model performance failure.
Consistency — do values agree across sources? The same customer recorded two ways confuses the model. → model performance failure.
Timeliness — is the data current? Stale training data is the seed of drift — the model learns a world that no longer exists. → drift.
Representativeness — does it reflect the real population? Over-/under-sampled groups get unequal treatment. → bias / fairness failure.

This table is the bridge between Lesson 1 and this one: every data quality dimension that fails surfaces as one of the five AI failure modes. Timeliness in particular is worth holding on to — it is where data quality testing (this lesson) and drift testing (Lesson 3) meet.

7 Building Data Test Cases

A data test case looks different from a functional test case. There are no UI steps and no “click submit.” The system under test is a dataset, the acceptance criterion is a statistical or documentary threshold, and the evidence is a measurement, not a screenshot.

Here is a 42119-aligned data representativeness test case for the Awa Mutual triage model:

Test ID:            DQ-REP-014

Risk category:      Data — representativeness (fairness risk)

Data test type:     Representativeness

Description:        Verify the training dataset represents claimants from all Auckland

                  territorial areas in proportion to the live claim population, with

                  particular attention to South Auckland and Manukau.

Acceptance criteria: For each territorial area, training-data claim share is within

                  ±3 percentage points of that area’s share of live claims (rolling 12 months).

Evidence required:  Distribution comparison table (training vs live) by area;

                  query used to produce it; date of the live-population snapshot.

Traceability:       Risk R-07 (under-served regions misrouted) in AI risk register.

Result:             [Pass / Fail] — areas outside tolerance listed.

Notice the shape: the acceptance criterion is a tolerance band, not an exact value (Lesson 1’s probabilistic principle); the evidence is a reproducible measurement with the query and snapshot date recorded; and the case traces back to a numbered risk in the register. Those three properties are what make it a 42119 data test case rather than an informal data check.

8 Audit-Ready Evidence for Data Testing

42119 expects data testing to leave evidence a third party can inspect. “We checked the data” is not evidence. For data testing specifically, audit-ready documentation includes:

The measurement, not the assertion: the actual distribution tables, agreement scores, or provenance records — not a sentence saying they were fine.
Reproducibility: the query, script, or method used, and the dataset snapshot it ran against, so the measurement can be re-run and confirmed.
Traceability to a risk: which numbered AI risk this data test addresses (the register from Lesson 1).
A dated decision: who reviewed the result, when, and what they decided — pass, fail, or accept-with-mitigation.

This is a preview of Lesson 5, which covers audit-ready artefacts across all AI test types. The point to take now: data testing without recorded evidence is invisible to an auditor, and “the model was accurate in testing” is not an answer to “show me that your training data represented South Auckland.”

Pro tip: The single highest-value data test for most NZ systems is a representativeness comparison of the training data against a recent snapshot of the live population, broken down by the groups the system makes decisions about. It is cheap to run, it directly addresses the most common and most damaging data failure, and it produces exactly the evidence a regulator asks for.

Senior engineer insight

The most dangerous moment in an AI project is when the model test results come back clean and everyone exhales. In every project I have worked on where data was the real problem, the model tests looked fine right up until production — because the test set came from the same warehouse as the training data and inherited its exact blind spots. The shift that changed how I approach this work: I now treat the data audit as a separate deliverable from the model test plan, with its own sign-off, before any model testing begins.

The most common mistake: teams run a train/test split, see good accuracy, and call that “data validation” — it is model validation on a sample of the same data, which is a completely different thing.

9 Common Mistakes

🚫 Testing the model on a test set drawn from the same flawed data

Why it happens: Splitting one historical dataset into train and test is the standard workflow, and it feels rigorous.
The fix: If the source data is unrepresentative, the test set inherits the same blind spot and the model scores well against its own gap — exactly the Awa Mutual trap. Test the data for representativeness independently, against the real live population, not just the model against a held-out slice.

🚫 Treating “more data” as the same thing as “representative data”

Why it happens: Bigger datasets intuitively feel safer and teams chase volume.
The fix: A million records that all come from one region are still unrepresentative. Volume does not fix skew. Representativeness testing measures coverage of the groups that matter, not row count.

🚫 Assuming the labels are correct because they came from an expert

Why it happens: Labels often come from caseworkers or specialists, so they feel authoritative.
The fix: Experts disagree, get tired, and carry their own biases into labels. Sample and re-check labels against ground truth, and measure inter-labeller agreement. The model can only ever be as right as the labels it learned from.

🚫 Ignoring provenance because “the data was already in our warehouse”

Why it happens: Data that is already on hand feels free of legal questions.
The fix: Data collected for one purpose may not be lawfully usable to train an AI under the Privacy Act 2020. Provenance testing confirms the lawful basis and lineage before training — not after a complaint.

From the field

A NZ public health agency built a model to prioritise outreach for a chronic disease management programme, trained on several years of PHO enrolment and consultation records. The team assumed the data was solid — it came from a mature clinical system, it was large, and it had been used for reporting for years. What they found during provenance testing was that a third of the records had been migrated from a legacy system without a documented field-mapping, so derived fields like “last contact date” were calculated differently for pre- and post-migration records, meaning the model had quietly learned two different definitions of recency. The fix was straightforward once found — re-derive the field consistently and retrain — but the lesson generalises: data that is old enough to be “trusted” is often old enough to have accumulated undocumented transformations that only surface when you trace lineage end to end.

10 Now You Try

Three graded exercises across the three data test types. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Data Risks

Read the description of a training dataset for a fictional Revenue NZ fraud detection AI below. Identify 3 data quality risks that could cause the model to produce biased or unreliable outputs, and name the data test type that addresses each.

Dataset: Revenue NZ return-fraud detection model
The model flags income tax returns for investigation. It was trained on 80,000 returns from the 2021–2022 tax year that were manually reviewed by investigators. “Fraud” labels were applied by whichever investigator handled the case; there was no second reviewer. The dataset is drawn almost entirely from salary-and-wage earners, because that is where the review team focused that year — self-employed and contractor returns make up under 2% of the data. About 6% of records are missing the “prior-year income” field. The data was exported from the case management system; no one recorded which fields were derived or transformed during export.

List 3 data quality risks and the test type for each:

Show model answer

There are at least five real risks in this dataset; any three well-explained earns full marks.

1. Representativeness — Self-employed and contractor returns are under 2% of the data, but they are a real and higher-risk part of the live population. The model will be unreliable on exactly the returns where fraud is often more complex. Test type: data representativeness testing.

2. Label correctness — Fraud labels were applied by a single investigator with no second reviewer. There is no inter-labeller agreement check, so labelling bias and inconsistency are baked in. Test type: label correctness testing.

3. Provenance — The data was exported with no record of which fields were derived or transformed. Lineage is unknown, so you cannot trust that fields mean what you think, or reproduce the dataset. Test type: data provenance testing.

Bonus risks: Completeness — 6% missing prior-year income; the model will guess or drop these records. Timeliness/representativeness — a single tax year (2021–22) may not represent normal patterns and seeds drift.

The trap: a tester focused only on the model would never see any of this, because the held-out test set is drawn from the same 80,000 salary-and-wage records and would look fine.

🔧 Exercise 2 of 3 — Fix the Test Case

The data representativeness test case below is too vague to be useful or auditable. Rewrite it to be 42119-compliant, with these fields: Test ID, Risk category, Data test type, Description, Acceptance criteria, Evidence required, Traceability. Use a fictional Benefits NZ benefits eligibility AI as the context.

Original (too vague):
“Check that the training data is representative and not biased. Make sure all groups are included. Pass if the data looks balanced.”

Rewrite as a complete 42119 data test case:

Show model answer

Test ID: DQ-REP-021

Risk category: Data — representativeness (fairness risk)

Data test type: Representativeness

Description: Verify the training dataset represents benefit applicants across all age bands, regions, and benefit types in proportion to the live applicant population, with specific attention to under-25 applicants and rural regions, which are suspected of being under-represented.

Acceptance criteria: For each age band (under-25, 25–39, 40–54, 55–64, 65+), each region, and each benefit type, the training-data share is within ±3 percentage points of that group's share of live applications over the most recent rolling 12 months. Any group outside tolerance is a fail and must be listed.

Evidence required: Distribution comparison table (training data vs live population) for each dimension; the SQL/query used to generate it; the date and source of the live-population snapshot; reviewer sign-off.

Traceability: AI risk register risk R-04 (under-represented applicant groups receive less reliable eligibility decisions).

What makes this 42119-compliant: a measurable tolerance instead of "looks balanced", named groups instead of "all groups", reproducible evidence instead of an assertion, and an explicit link to a numbered risk. The original had none of these.

🏗️ Exercise 3 of 3 — Build a Provenance Test Plan

Design a data provenance test plan of 5 test cases for a fictional synthetic credit risk model trained on Fern Bank lending data. Each test case should have at least: an ID, what it verifies, an acceptance criterion, and the evidence required. Cover lawful basis, lineage, residency, and integrity.

Show model answer

PROV-01 | Verifies: every field in the training data traces to a documented source system | Acceptance criteria: 100% of fields mapped to a named source system in a data lineage document; 0 unmapped fields | Evidence required: field-to-source lineage map; reviewer sign-off

PROV-02 | Verifies: lawful basis for using customer lending data to train an AI model | Acceptance criteria: a documented Privacy Act 2020 basis (consent or other lawful ground) covering AI training exists for every customer-data field used | Evidence required: privacy assessment / DPIA reference; legal sign-off; list of fields and their basis

PROV-03 | Verifies: data residency complies with Fern Bank's data handling requirements | Acceptance criteria: all training data was stored and processed within approved jurisdictions; no field processed offshore without approval | Evidence required: data location attestation from the platform team; processing-location log

PROV-04 | Verifies: data integrity between source extract and training set | Acceptance criteria: row counts and checksums match between the source extract and the training dataset; 0 silent drops or duplications | Evidence required: checksum/row-count reconciliation report; extract and training snapshot IDs

PROV-05 | Verifies: transformations applied during preparation are documented and reversible to source | Acceptance criteria: every derived or transformed field has a documented transformation rule; spot-check of 20 records reproduces the derived values from source | Evidence required: transformation specification; spot-check results with the 20 record IDs

Strong plans: each case is specific, has a measurable criterion, names concrete evidence, and together they cover lawful basis (PROV-02), lineage (PROV-01, PROV-05), residency (PROV-03), and integrity (PROV-04). Weak plans restate "check the data source is good" five times — that is the difference being marked.

Why teams fail here

Conflating model accuracy with data quality — a model that scores 95% on a held-out test set drawn from the same skewed source has proven nothing about the underlying data; it has only proven it learned the skew consistently.
No independent representativeness baseline — teams compare training data to the test set rather than to the live population; without an external reference like Stats NZ population data or a rolling snapshot of real transactions, you cannot know what you are missing.
Assuming warehouse data is clean data — data that has been in a system for years has often survived migrations, schema changes, and undocumented transformations; provenance testing traces lineage end to end, not just to the warehouse boundary.
Single-labeller label sets with no agreement check — when labels come from one expert or one team with a shared bias, the model learns that bias perfectly; inter-labeller agreement on a sample is the only way to catch this before training.
Treating Privacy Act compliance as a legal team problem — under NZ Privacy Act 2020 IPP 10, data collected for one purpose cannot be used for AI training without a fresh lawful basis; by the time legal raises this after training, you may need to discard the model entirely.
No timeliness check on training data vintage — a model trained on pre-COVID lending patterns, pre-restructure claims data, or pre-census Stats NZ population files is learning a world that no longer exists; timeliness is where data quality failure becomes drift before the model is even deployed.

11 Self-Check

Click each question to reveal the answer.

Q1: Why does testing the model on a held-out test set not prove the data is sound?

Because the test set is usually drawn from the same source as the training data, so it inherits the same blind spots. The model can score perfectly against a test set that shares its gaps — the Awa Mutual trap. You must test the data for representativeness independently, against the real live population.

Q2: Name the three data test types in 42119 and the question each answers.

Representativeness — does the data reflect the full population we serve? Provenance — where did it come from and are we allowed to use it this way? Label correctness — are the answers the model learned from actually right?

Q3: Why is “more data” not the same as “representative data”?

Volume does not fix skew. A million records all drawn from one region or one customer type is still unrepresentative of a population that includes others. Representativeness is about coverage of the groups that matter, measured against the live population — not row count.

Q4: Which data quality dimension is the seed of drift, and why?

Timeliness. If the training data is stale, the model has learned a world that no longer exists — and as the real world keeps moving, the gap widens. That is drift. It is where data quality testing and the drift testing in Lesson 3 meet.

Q5: What makes a data test case “42119-compliant” rather than an informal check?

A measurable acceptance criterion (a tolerance or threshold, not “looks fine”), reproducible evidence (the query/method and the dataset snapshot it ran against), and traceability to a numbered risk in the AI risk register. Plus a dated reviewer decision.

12 Interview Prep

Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.

“Our model scores 94% accuracy on the test set. Why would you still want to test the training data?”

Because the test set is almost certainly drawn from the same source as the training data, so it shares the same blind spots — the model can score 94% against a test set that under-represents the same groups the training data did. I’d test the data independently: a representativeness comparison against a recent snapshot of the live population broken down by the groups we make decisions about, a label-correctness sample, and a provenance check. A high accuracy number on a skewed test set is exactly how systems pass testing and then fail in production.

“What is label correctness testing, and how would you do it on a fraud model?”

It checks whether the “right answers” the model learned from are actually right. On a fraud model I’d sample the records labelled “fraud” and re-check them against ground truth — were these confirmed fraud, or just suspected? I’d measure inter-labeller agreement where two people labelled the same case, because low agreement means the task is ambiguous and the labels are noisy. And I’d look for systematic bias — whether certain groups were labelled more harshly. If the labels are wrong, the model faithfully learns the wrong thing and tests perfectly against its own bad labels.

“How does the Privacy Act 2020 come into data testing for an AI model?”

It lands squarely in provenance testing. Data collected for one purpose can’t simply be repurposed to train an AI without a lawful basis — so before training, I’d confirm that the consent or other lawful ground under which each field was collected actually covers AI training, and that residency requirements were met for any government or health data. I’d want that documented as a provenance test result with legal sign-off, not assumed because the data was already in our warehouse. Doing it after a complaint is too late.

Key takeaway

A model is only as honest as the data it learned from — so testing the model without testing the data is like proofreading a translation without checking the original.

← Why AI Testing Is Different Next: Model Testing →