Data Quality Testing
In an AI system, the data is the product. A model trained on flawed data is a flawed system — no matter how clean the code. Data quality testing is not a pre-condition for AI testing. It is AI testing.
1 The Hook
Awa Mutual, a fictional NZ health insurer, built an AI to triage incoming claims — sorting them into fast-track, standard, and manual-review queues so urgent claims got seen first. They trained it on five years of historical claim data. In testing it looked excellent: high accuracy on the held-out test set, fast, consistent. It went live.
Within a few months, complaints surfaced. Claims from Pasifika customers were being routed to the slow manual-review queue at a noticeably higher rate than claims from other customers — the same kinds of claims, treated differently. People who needed a fast decision were waiting longest.
The model had no bug. The investigation traced the problem to the training data. Historically, Awa Mutual’s book had been concentrated in central and eastern Auckland suburbs. Claims from South Auckland and Manukau — where much of the region’s Pasifika population lives — were under-represented in the five years of history the model learned from. The model had simply seen fewer examples of these claims, so it was less confident classifying them, so it dumped them into manual review. The data taught it to.
Here is the lesson hidden in that story: the team tested the model and never tested the data. They checked that the model performed well on a test set — but that test set was drawn from the same skewed history, so it carried the same blind spot. A model and its test set can agree perfectly and both be wrong, because they inherited the same flawed data.
Under ISO/IEC 42119, the data is a first-class test target. You test it for representativeness, for provenance, and for label correctness — before and alongside testing the model. That is what this lesson teaches.
2 The Rule
In an AI system the data is the product, so untested data is an untested system. Testing the model on data drawn from the same flawed source will not save you — the model and its test set inherit the same blind spots. Data quality testing is not a setup step before the real testing. It is the real testing.
3 The Analogy
A MasterChef NZ contestant given off ingredients.
Put the best cook in the country in the kitchen and hand them spoiled fish, bruised produce, and salt mislabelled as sugar, and they will plate up a bad dish. Not because they cannot cook — because you cannot cook your way out of bad ingredients. The model is the contestant; the training data is the ingredients. A brilliant model architecture trained on unrepresentative, mislabelled, or dubious data produces a confidently bad system.
And here is the part that catches teams out: if you taste the dish using a fork made from the same spoiled batch — if your test set comes from the same flawed data — everything seems fine. Data quality testing is checking the ingredients before they go in the pot, not just tasting the dish at the end.
4 The Three Data Test Types
ISO/IEC 42119 defines three distinct kinds of data testing. They answer three different questions, and a serious AI test plan covers all three.
Data representativeness testing
The question: does the training and test data reflect the full population the system will actually serve? This is the Awa Mutual failure. Representativeness testing checks that every group, condition, and scenario the live system will encounter is present in the data in sufficient quantity — across geography, demographics, edge conditions, and rare-but-important cases. A dataset can be enormous and still unrepresentative if it over-samples one slice of reality.
Data provenance testing
The question: where did this data come from, and are we allowed to use it this way? Provenance testing checks the origin, lineage, and permitted use of every dataset feeding the model. Was it collected lawfully? Does its consent basis cover training an AI? Has it been altered since collection, and by whom? For NZ systems this is where the Privacy Act 2020 lives — data collected for one purpose cannot simply be repurposed to train a model without a lawful basis. Provenance is also a security concern: data of unknown origin can be poisoned.
Label correctness testing
The question: are the “right answers” the model learned from actually right? Supervised models learn from labelled examples — this claim was fraud, that one was not; this document is a succession application, that one is a rates query. If the labels are wrong, the model faithfully learns the wrong thing. Label correctness testing samples the labels and checks them against ground truth, looks for inconsistency between labellers, and flags systematic labelling errors. Garbage labels produce a garbage model that tests perfectly against its own garbage.
5 What to Test For in Each Type
Representativeness — what to check
- Demographic coverage: for an IRD income-assessment model, are all income bands, employment types (salary, wage, self-employed, contractor), and regions present in realistic proportions?
- Geographic coverage: the Awa Mutual case — is every region and urban/rural split represented, not just the head-office catchment?
- Edge and rare cases: for a Toka Tū Ake EQC damage model, are rare-but-critical events (major liquefaction, multi-storey damage) present, or only common minor claims?
- Temporal coverage: does the data span enough time to capture seasonal and cyclical patterns, or only one unusual year?
Provenance — what to check
- Source and lineage: for a Kiwibank credit model, can every field be traced to a documented source system, with a record of any transformations applied?
- Lawful basis and consent: does the consent under which the data was collected cover its use for training an AI model under the Privacy Act 2020?
- Data residency: for government and health data, was it stored and processed within the jurisdictions the contract requires?
- Integrity: is there evidence the data has not been tampered with or silently corrupted between collection and training?
Label correctness — what to check
- Accuracy against ground truth: sample labels and re-check them — for a fraud model, were the “fraud” labels confirmed fraud, or just suspected?
- Inter-labeller agreement: when two people labelled the same item, did they agree? Low agreement means the labelling task itself is ambiguous.
- Systematic bias in labelling: were certain groups or case types labelled more harshly or more leniently by the humans who created the training data?
- Class balance: are there enough examples of each label? A fraud dataset that is 99.9% “not fraud” needs careful handling or the model learns to always say “not fraud” and scores 99.9% accuracy while catching nothing.
6 Data Quality Dimensions and the Failure Modes They Cause
42119 draws on a set of data quality dimensions. You do not need to memorise a taxonomy — you need to know how each dimension, when it fails, shows up as an AI failure mode.
Accuracy — are the values correct? Wrong inputs or wrong labels teach the model wrong behaviour. → model performance failure.
Consistency — do values agree across sources? The same customer recorded two ways confuses the model. → model performance failure.
Timeliness — is the data current? Stale training data is the seed of drift — the model learns a world that no longer exists. → drift.
Representativeness — does it reflect the real population? Over-/under-sampled groups get unequal treatment. → bias / fairness failure.
This table is the bridge between Lesson 1 and this one: every data quality dimension that fails surfaces as one of the five AI failure modes. Timeliness in particular is worth holding on to — it is where data quality testing (this lesson) and drift testing (Lesson 3) meet.
7 Building Data Test Cases
A data test case looks different from a functional test case. There are no UI steps and no “click submit.” The system under test is a dataset, the acceptance criterion is a statistical or documentary threshold, and the evidence is a measurement, not a screenshot.
Here is a 42119-aligned data representativeness test case for the Awa Mutual triage model:
Risk category: Data — representativeness (fairness risk)
Data test type: Representativeness
Description: Verify the training dataset represents claimants from all Auckland
territorial areas in proportion to the live claim population, with
particular attention to South Auckland and Manukau.
Acceptance criteria: For each territorial area, training-data claim share is within
±3 percentage points of that area’s share of live claims (rolling 12 months).
Evidence required: Distribution comparison table (training vs live) by area;
query used to produce it; date of the live-population snapshot.
Traceability: Risk R-07 (under-served regions misrouted) in AI risk register.
Result: [Pass / Fail] — areas outside tolerance listed.
Notice the shape: the acceptance criterion is a tolerance band, not an exact value (Lesson 1’s probabilistic principle); the evidence is a reproducible measurement with the query and snapshot date recorded; and the case traces back to a numbered risk in the register. Those three properties are what make it a 42119 data test case rather than an informal data check.
8 Audit-Ready Evidence for Data Testing
42119 expects data testing to leave evidence a third party can inspect. “We checked the data” is not evidence. For data testing specifically, audit-ready documentation includes:
- The measurement, not the assertion: the actual distribution tables, agreement scores, or provenance records — not a sentence saying they were fine.
- Reproducibility: the query, script, or method used, and the dataset snapshot it ran against, so the measurement can be re-run and confirmed.
- Traceability to a risk: which numbered AI risk this data test addresses (the register from Lesson 1).
- A dated decision: who reviewed the result, when, and what they decided — pass, fail, or accept-with-mitigation.
This is a preview of Lesson 5, which covers audit-ready artefacts across all AI test types. The point to take now: data testing without recorded evidence is invisible to an auditor, and “the model was accurate in testing” is not an answer to “show me that your training data represented South Auckland.”
9 Common Mistakes
🚫 Testing the model on a test set drawn from the same flawed data
Why it happens: Splitting one historical dataset into train and test is the standard workflow, and it feels rigorous.
The fix: If the source data is unrepresentative, the test set inherits the same blind spot and the model scores well against its own gap — exactly the Awa Mutual trap. Test the data for representativeness independently, against the real live population, not just the model against a held-out slice.
🚫 Treating “more data” as the same thing as “representative data”
Why it happens: Bigger datasets intuitively feel safer and teams chase volume.
The fix: A million records that all come from one region are still unrepresentative. Volume does not fix skew. Representativeness testing measures coverage of the groups that matter, not row count.
🚫 Assuming the labels are correct because they came from an expert
Why it happens: Labels often come from caseworkers or specialists, so they feel authoritative.
The fix: Experts disagree, get tired, and carry their own biases into labels. Sample and re-check labels against ground truth, and measure inter-labeller agreement. The model can only ever be as right as the labels it learned from.
🚫 Ignoring provenance because “the data was already in our warehouse”
Why it happens: Data that is already on hand feels free of legal questions.
The fix: Data collected for one purpose may not be lawfully usable to train an AI under the Privacy Act 2020. Provenance testing confirms the lawful basis and lineage before training — not after a complaint.
10 Now You Try
Three graded exercises across the three data test types. Write your answer, run it for AI feedback, then compare to the model answer.
Read the description of a training dataset for a fictional IRD fraud detection AI below. Identify 3 data quality risks that could cause the model to produce biased or unreliable outputs, and name the data test type that addresses each.
The model flags income tax returns for investigation. It was trained on 80,000 returns from the 2021–2022 tax year that were manually reviewed by investigators. “Fraud” labels were applied by whichever investigator handled the case; there was no second reviewer. The dataset is drawn almost entirely from salary-and-wage earners, because that is where the review team focused that year — self-employed and contractor returns make up under 2% of the data. About 6% of records are missing the “prior-year income” field. The data was exported from the case management system; no one recorded which fields were derived or transformed during export.
List 3 data quality risks and the test type for each:
Show model answer
There are at least five real risks in this dataset; any three well-explained earns full marks. 1. Representativeness — Self-employed and contractor returns are under 2% of the data, but they are a real and higher-risk part of the live population. The model will be unreliable on exactly the returns where fraud is often more complex. Test type: data representativeness testing. 2. Label correctness — Fraud labels were applied by a single investigator with no second reviewer. There is no inter-labeller agreement check, so labelling bias and inconsistency are baked in. Test type: label correctness testing. 3. Provenance — The data was exported with no record of which fields were derived or transformed. Lineage is unknown, so you cannot trust that fields mean what you think, or reproduce the dataset. Test type: data provenance testing. Bonus risks: Completeness — 6% missing prior-year income; the model will guess or drop these records. Timeliness/representativeness — a single tax year (2021–22) may not represent normal patterns and seeds drift. The trap: a tester focused only on the model would never see any of this, because the held-out test set is drawn from the same 80,000 salary-and-wage records and would look fine.
The data representativeness test case below is too vague to be useful or auditable. Rewrite it to be 42119-compliant, with these fields: Test ID, Risk category, Data test type, Description, Acceptance criteria, Evidence required, Traceability. Use a fictional MSD benefits eligibility AI as the context.
“Check that the training data is representative and not biased. Make sure all groups are included. Pass if the data looks balanced.”
Rewrite as a complete 42119 data test case:
Show model answer
Test ID: DQ-REP-021 Risk category: Data — representativeness (fairness risk) Data test type: Representativeness Description: Verify the training dataset represents benefit applicants across all age bands, regions, and benefit types in proportion to the live applicant population, with specific attention to under-25 applicants and rural regions, which are suspected of being under-represented. Acceptance criteria: For each age band (under-25, 25–39, 40–54, 55–64, 65+), each region, and each benefit type, the training-data share is within ±3 percentage points of that group's share of live applications over the most recent rolling 12 months. Any group outside tolerance is a fail and must be listed. Evidence required: Distribution comparison table (training data vs live population) for each dimension; the SQL/query used to generate it; the date and source of the live-population snapshot; reviewer sign-off. Traceability: AI risk register risk R-04 (under-represented applicant groups receive less reliable eligibility decisions). What makes this 42119-compliant: a measurable tolerance instead of "looks balanced", named groups instead of "all groups", reproducible evidence instead of an assertion, and an explicit link to a numbered risk. The original had none of these.
Design a data provenance test plan of 5 test cases for a fictional synthetic credit risk model trained on Kiwibank lending data. Each test case should have at least: an ID, what it verifies, an acceptance criterion, and the evidence required. Cover lawful basis, lineage, residency, and integrity.
Show model answer
PROV-01 | Verifies: every field in the training data traces to a documented source system | Acceptance criteria: 100% of fields mapped to a named source system in a data lineage document; 0 unmapped fields | Evidence required: field-to-source lineage map; reviewer sign-off PROV-02 | Verifies: lawful basis for using customer lending data to train an AI model | Acceptance criteria: a documented Privacy Act 2020 basis (consent or other lawful ground) covering AI training exists for every customer-data field used | Evidence required: privacy assessment / DPIA reference; legal sign-off; list of fields and their basis PROV-03 | Verifies: data residency complies with Kiwibank's data handling requirements | Acceptance criteria: all training data was stored and processed within approved jurisdictions; no field processed offshore without approval | Evidence required: data location attestation from the platform team; processing-location log PROV-04 | Verifies: data integrity between source extract and training set | Acceptance criteria: row counts and checksums match between the source extract and the training dataset; 0 silent drops or duplications | Evidence required: checksum/row-count reconciliation report; extract and training snapshot IDs PROV-05 | Verifies: transformations applied during preparation are documented and reversible to source | Acceptance criteria: every derived or transformed field has a documented transformation rule; spot-check of 20 records reproduces the derived values from source | Evidence required: transformation specification; spot-check results with the 20 record IDs Strong plans: each case is specific, has a measurable criterion, names concrete evidence, and together they cover lawful basis (PROV-02), lineage (PROV-01, PROV-05), residency (PROV-03), and integrity (PROV-04). Weak plans restate "check the data source is good" five times — that is the difference being marked.
11 Self-Check
Click each question to reveal the answer.
Q1: Why does testing the model on a held-out test set not prove the data is sound?
Because the test set is usually drawn from the same source as the training data, so it inherits the same blind spots. The model can score perfectly against a test set that shares its gaps — the Awa Mutual trap. You must test the data for representativeness independently, against the real live population.
Q2: Name the three data test types in 42119 and the question each answers.
Representativeness — does the data reflect the full population we serve? Provenance — where did it come from and are we allowed to use it this way? Label correctness — are the answers the model learned from actually right?
Q3: Why is “more data” not the same as “representative data”?
Volume does not fix skew. A million records all drawn from one region or one customer type is still unrepresentative of a population that includes others. Representativeness is about coverage of the groups that matter, measured against the live population — not row count.
Q4: Which data quality dimension is the seed of drift, and why?
Timeliness. If the training data is stale, the model has learned a world that no longer exists — and as the real world keeps moving, the gap widens. That is drift. It is where data quality testing and the drift testing in Lesson 3 meet.
Q5: What makes a data test case “42119-compliant” rather than an informal check?
A measurable acceptance criterion (a tolerance or threshold, not “looks fine”), reproducible evidence (the query/method and the dataset snapshot it ran against), and traceability to a numbered risk in the AI risk register. Plus a dated reviewer decision.
12 Interview Prep
Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.
“Our model scores 94% accuracy on the test set. Why would you still want to test the training data?”
Because the test set is almost certainly drawn from the same source as the training data, so it shares the same blind spots — the model can score 94% against a test set that under-represents the same groups the training data did. I’d test the data independently: a representativeness comparison against a recent snapshot of the live population broken down by the groups we make decisions about, a label-correctness sample, and a provenance check. A high accuracy number on a skewed test set is exactly how systems pass testing and then fail in production.
“What is label correctness testing, and how would you do it on a fraud model?”
It checks whether the “right answers” the model learned from are actually right. On a fraud model I’d sample the records labelled “fraud” and re-check them against ground truth — were these confirmed fraud, or just suspected? I’d measure inter-labeller agreement where two people labelled the same case, because low agreement means the task is ambiguous and the labels are noisy. And I’d look for systematic bias — whether certain groups were labelled more harshly. If the labels are wrong, the model faithfully learns the wrong thing and tests perfectly against its own bad labels.
“How does the Privacy Act 2020 come into data testing for an AI model?”
It lands squarely in provenance testing. Data collected for one purpose can’t simply be repurposed to train an AI without a lawful basis — so before training, I’d confirm that the consent or other lawful ground under which each field was collected actually covers AI training, and that residency requirements were met for any government or health data. I’d want that documented as a provenance test result with legal sign-off, not assumed because the data was already in our warehouse. Doing it after a complaint is too late.