Test with AI · ISO/IEC 42119

Bias and Fairness Testing

Q: What is the difference between counterfactual fairness and demographic parity testing?

Counterfactual fairness tests individuals — change only a protected characteristic in one case and check the decision does not flip (matched pairs). Demographic parity tests groups — compare outcomes across a population. Use both: each catches what the other misses.

Q: What is the difference between bias mitigation and bias testing?

Mitigation is what the model team does to reduce bias (rebalancing data, removing proxies, fairness constraints). Testing is the tester independently measuring whether the model is fair against a defined criterion, with evidence. “Verify the model has been de-biased” is not a test — it assumes the answer.

Fairness is not an ethical opinion you hold about a system. It is a quality characteristic you can measure, with specific techniques, before the system ever goes live.

Test with AI ISO/IEC TS 42119-2:2025 — Lesson 4 of 6 ~30 min read · ~75 min with exercises

1 The Hook

Back to the NZ Revenue Analytics Unit from Lesson 1 — the agency whose document classification AI misrouted Māori land trust correspondence. In Lesson 1 we used it to show a data quality failure. Now look at the same failure through the fairness lens, because that is where the test gap really sat.

The test team had done real work. They ran functional tests — documents went somewhere. They ran performance tests — the API was fast and accurate on the test set. What they never ran was a single test that asked: does this model perform equally well across the different groups of people it serves? Not one test case compared classification accuracy for Māori land correspondence against the rest. The question was simply not in the test plan.

It was not in the plan because the team thought of fairness as an ethics topic — something for a governance committee, a values statement, a tick-box at the end. It did not occur to them that fairness was a thing you test, with a test case, an acceptance criterion, and a pass/fail, exactly like performance.

That is the misconception this lesson removes. Under ISO/IEC 42119, fairness is a measurable quality characteristic with its own test types. The disparity in the Revenue Analytics Unit model was detectable before go-live — a demographic parity test comparing accuracy across correspondence types would have failed loudly in the test environment. Nobody had to wait for a complaint. They just had to test for it.

This lesson gives you the concrete techniques: where bias comes from, the two 42119 fairness test types, the fairness metrics, how the NZ Human Rights Act 1993 tells you which groups to test, and how to build a fairness test plan that would have caught the Revenue Analytics Unit failure in week one.

2 The Rule

Bias in an AI system is detectable before deployment. ISO/IEC 42119 treats fairness as a measurable quality characteristic with concrete test types — not an ethics exercise done by a committee at the end. If “does this perform equally across groups?” is not a test case in your plan, you are not testing for fairness.

3 The Analogy

Analogy

Two identical loan applications with different names.

Imagine two people walk into a bank with applications that are identical in every way that should matter — same income, same deposit, same debts, same job — and the only difference is their name and the suburb they live in. If one is approved and the other declined, you do not need an ethics debate to know something is wrong. You can see it, because you held everything else constant and changed only the thing that should not count.

That is exactly what fairness testing does to a model. You feed it cases that are identical except for one protected characteristic, and you watch whether the decision changes. Fairness is not a feeling about the model — it is the answer to a controlled experiment you can run, repeat, and put a pass/fail on. The two-identical-applications test even has a formal name in 42119: counterfactual fairness testing.

4 What AI Bias Is and Where It Comes From

AI bias is systematic, unfair difference in how a model treats different groups of people. “Systematic” matters: a model that is occasionally wrong is not biased; a model that is reliably more wrong for one group is. Keep it practical — there are three places bias enters.

1. Training data

The most common source, and the one Lesson 2 covered. If a group is under-represented (the Revenue Analytics Unit) or was historically treated unfairly in the data the model learned from, the model reproduces and often amplifies that pattern. A model trained on past lending decisions learns the past’s biases as if they were rules.

2. Model and feature design

Bias can come from what the model is allowed to look at. A feature that seems neutral can be a stand-in for a protected characteristic — a postcode can proxy for ethnicity, a school name can proxy for age or wealth. The model never “sees” ethnicity, yet decides as if it did, because the proxy carries the signal.

3. Feedback loops

Bias can grow after deployment. If a model sends fewer South Auckland claims to fast-track, fewer get resolved quickly, that outcome feeds the next round of training data, and the bias deepens. Feedback loops turn a small initial skew into an entrenched one — another reason continuous validation (Lesson 3) matters for fairness, not just performance.

5 The Two Fairness Test Types

42119 defines two fairness test types. They are complementary — one tests individuals, one tests groups — and a thorough fairness test plan uses both.

Counterfactual fairness testing

Tests individuals. Take a real or synthetic case, change only a protected characteristic, hold everything else identical, and check the decision does not change. This is the two-identical-applications analogy made formal. It is powerful because it isolates cause: if flipping ethnicity (and nothing else) flips the decision, the model is using ethnicity, directly or through a proxy. You build it as matched test pairs: the original and its counterfactual twin.

Demographic parity testing

Tests groups. Run the model across a representative population and compare outcomes between groups — is the approval rate, or the “needs review” rate, or the accuracy, materially different for one group versus another? This is what would have caught the Revenue Analytics Unit failure: compare classification accuracy for Māori land correspondence against all other types, and the gap shows up as a failed test. Demographic parity testing works on aggregates, so it catches patterns that individual cases might not reveal.

Pro tip: Use both, because each catches what the other misses. A model can pass demographic parity (equal approval rates overall) while failing counterfactual fairness (it flips individual decisions on a protected attribute but the effects happen to cancel out in aggregate). And it can pass counterfactual checks on the cases you tried while failing demographic parity on a group you did not think to construct a pair for. Groups and individuals are different questions.

6 Fairness Metrics a Tester Needs

There is no single “fairness number.” Different definitions of fair can conflict, and part of the tester’s job is knowing which one the system needs. Three you must be able to name:

Metric	Plain meaning	When it applies
Demographic parity	Each group gets a positive outcome at the same rate (e.g. equal approval rates across ethnicities).	When equal access to the outcome is the goal — e.g. who gets shortlisted, who gets an offer shown.
Equal opportunity	Of the people who genuinely should get a positive outcome, each group is caught at the same rate (equal true-positive rate).	When missing a genuine case is the harm — e.g. a fraud or eligibility model should catch real cases equally across groups.
Predictive parity	When the model gives a positive decision, it is correct at the same rate for each group (equal precision).	When a wrong positive harms the person — e.g. being wrongly flagged for review or investigation should be equally rare across groups.

These can pull against each other — it is mathematically impossible to satisfy all three at once except in special cases. That is not a loophole; it is a decision the team must make explicitly: which definition of fairness does this system owe its users, given what a wrong decision costs them? The tester’s contribution is to force that decision into the open and then test against the chosen definition, rather than letting “it’s fair” stay undefined.

7 NZ Protected Characteristics — Human Rights Act 1993

Which groups do you test across? In NZ, the starting list is the prohibited grounds of discrimination in the Human Rights Act 1993. If your AI makes or shapes a decision about a person, differential treatment on these grounds is a legal risk, not just a quality one:

sex · marital status · religious belief · ethical belief · colour · race · ethnic or national origins · disability · age · political opinion · employment status · family status · sexual orientation

Two practical notes for the NZ context. First, you rarely test all thirteen on every system — the risk register (Lesson 1) tells you which grounds are plausibly in play for this model, given its inputs and decisions. A licence-renewal model has obvious exposure on age and disability; a marketing model may have different exposure. Second, te Tiriti o Waitangi obligations and the Crown’s responsibilities to Māori mean ethnicity — and specifically outcomes for Māori — deserve explicit attention in public-sector AI, beyond the bare minimum of the Act. The Government Algorithm Charter reflects this.

Pro tip: You often cannot test a protected characteristic directly because you (rightly) do not collect it. That is where proxies and counterfactual pairs earn their keep: you construct synthetic counterfactual cases that differ only on the characteristic, and you test whether seemingly-neutral features like postcode are acting as proxies. “We don’t collect ethnicity so we can’t be biased on it” is one of the most common and most wrong things a team will tell you.

NZ Regulatory Checkpoint — OPC Guidance

The Office of the Privacy Commissioner (OPC) has published specific guidance on generative AI and the Privacy Act 2020. Two provisions are directly relevant to fairness testing:

IPP 10 (Limits on use of personal information) — if training data includes personal information, its use must be consistent with the purpose for which it was collected. Biased training sets often contain personal data collected for a different purpose.
IPP 12A (Automated decision-making) — where an AI system makes or significantly influences decisions about individuals, affected people have rights to request human review. Your fairness test suite must identify which decisions qualify.

See OPC AI guidance for the full checklist. The Privacy Act 2020 applies to any agency (including private sector) processing personal information of NZ residents.

8 Building a Fairness Test Plan

A fairness test plan has four moving parts. Walk them in order for any system:

1. Identify the affected groups. From the Human Rights Act grounds and the risk register, list the characteristics plausibly in play for this model. For a Fern Bank lending model: age, sex, ethnicity (via proxy testing), family status.
2. Choose the fairness definition. Decide — explicitly — whether demographic parity, equal opportunity, or predictive parity is what this system owes its users, based on what a wrong decision costs. Record the decision and the reason.
3. Build the test datasets. For demographic parity, assemble a representative population labelled by group. For counterfactual fairness, construct matched pairs that differ only on the protected characteristic.
4. Set criteria and evidence. Define the acceptable disparity (e.g. “approval rates across groups within 5 percentage points”), and specify the evidence: the per-group results table, the pairs tested, the method, and a dated reviewer decision.

Here is a 42119-aligned demographic parity test case for a fictional Fern Bank lending model:

Test ID:            FAIR-DP-006

Risk category:      Fairness — demographic parity (Human Rights Act: age)

Test type:          Demographic parity

Description:        Compare loan-approval rates across age bands for applicants with

                  equivalent serviceability, on a representative test population.

Acceptance criteria: Approval rate for each age band is within 5 percentage points of

                  the overall approval rate, after controlling for serviceability.

Evidence required:  Per-age-band approval-rate table; population definition; the

                  serviceability control method; reviewer sign-off.

Traceability:       Risk R-09 (model disadvantages older or younger applicants).

Result:             [Pass / Fail] — bands outside tolerance listed.

9 Bias Mitigation vs Bias Testing

This distinction trips up a lot of test plans, so be sharp about it.

Bias mitigation — the model team’s job

Techniques to reduce bias: rebalancing the training data, removing or transforming proxy features, applying fairness constraints during training, adjusting decision thresholds per group. This is something the data scientists and ML engineers do to the model.

Bias testing — the tester’s job

Independent measurement of whether the model is fair, against a defined criterion, with evidence. You do not de-bias the model; you verify — objectively and reproducibly — whether it meets the fairness criterion, and you report the result.

Why it matters: a test case that says “verify the model has been de-biased” is not a test — it assumes the mitigation worked and gives you nothing to measure. The correct test case names a fairness metric, a population or set of pairs, an acceptance criterion, and produces a pass/fail with evidence. The model team mitigates; you independently confirm. If you both do the same job, no one is actually checking.

10 Common Mistakes

🚫 Treating fairness as an ethics topic, not a test

Why it happens: Fairness sounds like values, and values feel like a governance committee’s job, not a tester’s.
The fix: Under 42119 fairness is a measurable quality characteristic. “Does this perform equally across groups?” is a test case with a metric, a criterion, and a pass/fail — the Revenue Analytics Unit gap. If it is not in the test plan, no committee statement will catch the disparity.

🚫 “We don’t collect ethnicity, so we can’t be biased on it”

Why it happens: Not collecting a characteristic feels like it removes the risk.
The fix: Neutral-looking features act as proxies — postcode for ethnicity, school for age. Not collecting the attribute means you cannot easily measure the bias, but the model can still act on it. Use counterfactual pairs and proxy testing to find it.

🚫 Writing “verify the model has been de-biased” as a test case

Why it happens: It conflates the model team’s mitigation with the tester’s testing.
The fix: That assumes the answer. A real fairness test names a metric, a population or set of pairs, an acceptance criterion, and produces a measured pass/fail. Mitigation is done to the model; testing independently confirms whether it worked.

🚫 Picking only one fairness metric and assuming it covers everything

Why it happens: One number is simpler, and the conflicts between metrics are uncomfortable.
The fix: Demographic parity, equal opportunity, and predictive parity can’t all hold at once. Decide explicitly which one the system owes its users given what a wrong decision costs — and use both a group test (parity) and an individual test (counterfactual).

Senior engineer insight

The moment that changed how I approach fairness testing: we were reviewing a credit-risk model for a NZ lender and the team was proud that overall approval rates across gender were within 2 percentage points. Then we sliced by creditworthy applicants only — people who subsequently repaid every dollar — and found women in that group were declined at nearly double the rate of equivalent men. The headline number was fine; the thing that mattered was completely broken. Demographic parity on overall rates is the minimum bar, not the finish line.

The most common mistake: treating a single fairness metric as sufficient and calling the system fair before you have asked what a wrong decision actually costs each affected group.

From the field

A public-sector team deploying a Benefits NZ benefit-eligibility triage tool assumed the model was free of ethnicity bias because ethnicity was not a model input — they had been careful about that. What the initial fairness review found was that suburb and primary contact method (landline vs mobile, paper vs online) together functioned as a near-perfect proxy for Māori and Pacific applicants in their region. The model routed those applications to a slower manual review queue at twice the rate of other groups, adding weeks to decisions for people with the most time-sensitive needs. The fix was not to remove suburb — it had legitimate use — but to add counterfactual pairs that varied suburb and contact method in isolation, set an explicit parity tolerance on queue-routing outcomes, and run those tests every time the model was retrained. The lesson that generalises: if you serve a public and your training data contains any geographic or behavioural signal, run proxy testing before you declare the model clean.

11 Now You Try

Three graded exercises across fairness metrics, test design, and counterfactual pairs. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Name the Violation

Below are outputs from a fictional NZ loan approval AI tested across demographic groups. Identify which fairness metric is being violated and name the 42119 fairness test type that should have caught it.

Of applicants who were genuinely creditworthy (later repaid in full):

  Group A: 88% were approved by the model

  Group B: 61% were approved by the model

Overall approval rate:  Group A 70%  |  Group B 68%   (roughly equal)

Counterfactual check: an identical application with only the applicant’s suburb

changed from a Group A suburb to a Group B suburb flips from APPROVE to DECLINE.

Identify the violation(s) and the test type(s):

Show model answer

Metric violated: Equal opportunity. Of applicants who were genuinely creditworthy, 88% of Group A were approved but only 61% of Group B — the true-positive rate is far lower for Group B. Creditworthy Group B applicants are wrongly declined at a much higher rate.

42119 test type that should have caught it: Demographic parity testing in the broad sense (group comparison of outcomes), specifically measuring true-positive rate by group — i.e. an equal-opportunity comparison across groups. The counterfactual flip also means counterfactual fairness testing would have caught it.

What the "roughly equal overall approval rate" hides: This is the trap. Demographic parity on the headline approval rate looks fine (70% vs 68%), so a team checking only overall approval rate would declare the model fair. But equal overall rates can coexist with very unequal treatment of the people who actually deserve approval — Group B's approvals are going to less-creditworthy applicants while creditworthy ones are declined. This is exactly why you test equal opportunity (true-positive rate), not just raw approval rate, and why the counterfactual suburb-flip is decisive: suburb is acting as a proxy for group membership.

🔧 Exercise 2 of 3 — Fix the Test Case

The fairness “test case” below conflates bias mitigation with bias testing. Rewrite it as a valid 42119 demographic parity test case for a fictional KiwiSaver fund recommendation AI (recommends conservative / balanced / growth).

Original (conflates mitigation with testing): “Test that the model has been de-biased and treats everyone equally regardless of gender.”

Rewrite as a valid demographic parity test case:

Show model answer

Test ID: FAIR-DP-013

Risk category: Fairness — demographic parity (Human Rights Act: sex)

Test type: Demographic parity

Description: On a representative test population of members with equivalent age, balance, time-to-retirement, and risk-questionnaire answers, compare the distribution of recommended fund types (conservative/balanced/growth) between women and men. The recommendation should not depend on gender once the relevant financial factors are held equivalent.

Acceptance criteria: For each fund type, the recommendation rate for women is within 5 percentage points of the rate for men, on the matched-equivalent population. Any fund type outside tolerance is a fail and must be listed.

Evidence required: Recommendation-distribution table by gender for the matched population; definition of how "equivalent" members were matched; population size per group; reviewer sign-off.

Traceability: AI risk register risk R-05 (fund recommendations differ by gender for equivalent members).

Why this is valid and the original was not: the original said "test that the model has been de-biased" — that assumes the mitigation worked and gives nothing to measure; de-biasing is the model team's job. This version independently MEASURES fairness: a named metric (demographic parity), a defined population, a numeric tolerance, concrete evidence, and traceability. It produces a pass/fail, not an assumption.

🏗️ Exercise 3 of 3 — Build Counterfactual Pairs

Design a counterfactual fairness test set of 4 test pairs for a fictional TransitNZ AI that assesses driver licence renewal eligibility. Each pair must be identical except for one protected characteristic. Cover at least two protected characteristics under the NZ Human Rights Act 1993. For each pair, state what is held constant, what is varied, and the expected result.

Show model answer

Pair 1 | Protected characteristic varied: Age | Held constant: clean driving record, valid medical certificate, same licence class, same test scores | Variant A: applicant aged 35 | Variant B: applicant aged 78 | Expected result: identical eligibility decision — age alone must not change the outcome where the medical certificate already attests fitness to drive

Pair 2 | Protected characteristic varied: Disability | Held constant: same age, same clean record, same licence class; both hold a valid medical certificate clearing them to drive | Variant A: no declared disability | Variant B: declared mobility disability with an approved adapted-vehicle condition | Expected result: identical eligibility decision — a disability that has been medically cleared with appropriate conditions must not reduce eligibility

Pair 3 | Protected characteristic varied: Sex | Held constant: same age, record, licence class, scores | Variant A: male applicant | Variant B: female applicant | Expected result: identical decision

Pair 4 | Protected characteristic varied: Ethnic/national origin (proxy test) | Held constant: all driving and medical factors identical | Variant A: name and suburb associated with one group | Variant B: name and suburb associated with another group | Expected result: identical decision — if it changes, name/suburb is acting as a proxy and the model is discriminating on a prohibited ground

Strong sets: each pair changes exactly ONE protected characteristic and holds all legitimate factors constant, covers at least two grounds (here: age, disability, sex, ethnicity), and states the expected result as "identical decision" with a one-line justification. The disability and age pairs include the important nuance that a legitimate, already-assessed factor (the medical certificate) is held constant so the test isolates the protected characteristic — not a genuine fitness-to-drive issue. Pair 4 demonstrates proxy testing for a characteristic you don't collect directly.

Why teams fail here

Choosing the wrong fairness metric for the harm. A team picks demographic parity (equal overall approval rates) for an eligibility model where the real harm is missing people who genuinely qualify. Equal opportunity — equal true-positive rates — is the right metric, but it only appears when you slice by the ground truth. The headline number looks fine and the harm goes unmeasured.
Proxy blindness: “we don’t collect that attribute.” Teams drop ethnicity from the model and believe the problem is solved. Postcode, school attended, preferred contact channel, or even the combination of name patterns and suburb can reconstitute a protected characteristic with high fidelity. Not collecting it means you cannot measure it directly — it does not mean the model is not acting on it.
Conflating mitigation with testing. Data scientists de-bias the training data and the test plan reads “verify model has been de-biased.” That is not a test — it is an assumption. A valid fairness test independently measures an agreed metric against a representative population with a numeric tolerance and produces a pass/fail. If both teams trust the mitigation, no one is checking whether it actually worked.
Testing fairness only at launch, never after retraining. A model that passes fairness tests in month one can drift badly by month six as production feedback loops shift the training distribution — particularly where under-served groups interact less frequently and are progressively under-represented in new data. Fairness tests belong in the CI pipeline alongside performance tests, not just in the initial acceptance gate.
Leaving “fair” undefined until after the model is built. In NZ public-sector contexts — CoverNZ injury assessments, Benefits NZ eligibility, TransitNZ licensing — the choice between demographic parity, equal opportunity, and predictive parity has legal and Te Tiriti implications. Leaving it implicit means the team optimises for whichever definition makes the model look best, rather than the one that reflects what a wrong decision actually costs the people it affects.
Using only one test type and assuming coverage. A model can pass every counterfactual pair you constructed while failing demographic parity on a group you never built a pair for. It can pass demographic parity on overall approval rates while failing equal opportunity on the creditworthy subgroup. Running both group tests and individual counterfactual pairs is not redundant — they catch different things.

12 Self-Check

Click each question to reveal the answer.

Q1: Why is fairness a testable quality characteristic and not just an ethics topic?

Because under 42119 it has concrete test types (counterfactual fairness, demographic parity), measurable metrics, acceptance criteria, and a pass/fail. The disparity in a model is detectable before deployment with a test case — you do not need to wait for a complaint or a committee.

Q2: What is the difference between counterfactual fairness and demographic parity testing?

Counterfactual fairness tests individuals — change only a protected characteristic in one case and check the decision does not flip (matched pairs). Demographic parity tests groups — compare outcomes across a population. Use both: each catches what the other misses.

Q3: Why can’t you satisfy demographic parity, equal opportunity, and predictive parity all at once?

They are mathematically incompatible except in special cases. So the team must decide explicitly which definition the system owes its users — based on what a wrong decision costs — and the tester tests against that chosen definition rather than leaving “fair” undefined.

Q4: A team says “we don’t collect ethnicity, so the model can’t be biased on it.” Why is that wrong?

Neutral-looking features act as proxies — postcode for ethnicity, school for age. Not collecting the attribute means you can’t easily measure the bias, but the model can still act on it through proxies. Counterfactual pairs and proxy testing are how you detect it.

Q5: What is the difference between bias mitigation and bias testing?

Mitigation is what the model team does to reduce bias (rebalancing data, removing proxies, fairness constraints). Testing is the tester independently measuring whether the model is fair against a defined criterion, with evidence. “Verify the model has been de-biased” is not a test — it assumes the answer.

13 Interview Prep

Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.

“How would you test an AI loan model for bias when we don’t collect ethnicity?”

Two ways. First, counterfactual pairs: I take applications that are identical on all the legitimate factors — income, deposit, serviceability — and vary only a proxy like name and suburb, then check whether the decision flips. If it does, a neutral-looking feature is acting as a stand-in for ethnicity. Second, proxy and demographic-parity analysis on the features we do hold: I’d look at whether outcomes differ across suburbs or other proxies in a way that tracks ethnicity. Not collecting the attribute means I can’t measure it head-on, but it doesn’t mean the model isn’t using it — so I test for the proxy.

“The data scientists say they’ve already de-biased the model. Is there anything left for you to test?”

Yes — de-biasing is mitigation, which is their job, and testing is independent verification, which is mine. “We de-biased it” is a claim, not evidence. I’d still run demographic parity and counterfactual fairness tests against an agreed metric and tolerance, on a representative population and matched pairs, and produce a measured pass/fail with the per-group results. If the tester and the model team both just trust the mitigation, no one is actually checking whether it worked.

“Which fairness metric should we use for our eligibility model?”

It depends on what a wrong decision costs, and it’s a decision we should make explicitly rather than default into. If the main harm is missing people who genuinely qualify, I’d argue for equal opportunity — equal true-positive rates across groups. If the main harm is wrongly flagging or declining someone, predictive parity matters more. If equal access to the outcome is the goal, demographic parity. They can’t all hold at once, so my job is to force that trade-off into the open, get it agreed and recorded, and then test against the chosen definition — and I’d still run counterfactual pairs alongside whichever group metric we pick.

Key takeaway

Fairness is not a value statement about your system — it is a measurement you either ran or skipped, and “we intended to be fair” is not evidence that you were.

← Model Testing Next: Risk-Based AI Testing →