Bias and Fairness Testing
Fairness is not an ethical opinion you hold about a system. It is a quality characteristic you can measure, with specific techniques, before the system ever goes live.
1 The Hook
Back to the NZ Revenue Analytics Unit from Lesson 1 — the agency whose document classification AI misrouted Māori land trust correspondence. In Lesson 1 we used it to show a data quality failure. Now look at the same failure through the fairness lens, because that is where the test gap really sat.
The test team had done real work. They ran functional tests — documents went somewhere. They ran performance tests — the API was fast and accurate on the test set. What they never ran was a single test that asked: does this model perform equally well across the different groups of people it serves? Not one test case compared classification accuracy for Māori land correspondence against the rest. The question was simply not in the test plan.
It was not in the plan because the team thought of fairness as an ethics topic — something for a governance committee, a values statement, a tick-box at the end. It did not occur to them that fairness was a thing you test, with a test case, an acceptance criterion, and a pass/fail, exactly like performance.
That is the misconception this lesson removes. Under ISO/IEC 42119, fairness is a measurable quality characteristic with its own test types. The disparity in the Revenue Analytics Unit model was detectable before go-live — a demographic parity test comparing accuracy across correspondence types would have failed loudly in the test environment. Nobody had to wait for a complaint. They just had to test for it.
This lesson gives you the concrete techniques: where bias comes from, the two 42119 fairness test types, the fairness metrics, how the NZ Human Rights Act 1993 tells you which groups to test, and how to build a fairness test plan that would have caught the Revenue Analytics Unit failure in week one.
2 The Rule
Bias in an AI system is detectable before deployment. ISO/IEC 42119 treats fairness as a measurable quality characteristic with concrete test types — not an ethics exercise done by a committee at the end. If “does this perform equally across groups?” is not a test case in your plan, you are not testing for fairness.
3 The Analogy
Two identical loan applications with different names.
Imagine two people walk into a bank with applications that are identical in every way that should matter — same income, same deposit, same debts, same job — and the only difference is their name and the suburb they live in. If one is approved and the other declined, you do not need an ethics debate to know something is wrong. You can see it, because you held everything else constant and changed only the thing that should not count.
That is exactly what fairness testing does to a model. You feed it cases that are identical except for one protected characteristic, and you watch whether the decision changes. Fairness is not a feeling about the model — it is the answer to a controlled experiment you can run, repeat, and put a pass/fail on. The two-identical-applications test even has a formal name in 42119: counterfactual fairness testing.
4 What AI Bias Is and Where It Comes From
AI bias is systematic, unfair difference in how a model treats different groups of people. “Systematic” matters: a model that is occasionally wrong is not biased; a model that is reliably more wrong for one group is. Keep it practical — there are three places bias enters.
1. Training data
The most common source, and the one Lesson 2 covered. If a group is under-represented (the Revenue Analytics Unit) or was historically treated unfairly in the data the model learned from, the model reproduces and often amplifies that pattern. A model trained on past lending decisions learns the past’s biases as if they were rules.
2. Model and feature design
Bias can come from what the model is allowed to look at. A feature that seems neutral can be a stand-in for a protected characteristic — a postcode can proxy for ethnicity, a school name can proxy for age or wealth. The model never “sees” ethnicity, yet decides as if it did, because the proxy carries the signal.
3. Feedback loops
Bias can grow after deployment. If a model sends fewer South Auckland claims to fast-track, fewer get resolved quickly, that outcome feeds the next round of training data, and the bias deepens. Feedback loops turn a small initial skew into an entrenched one — another reason continuous validation (Lesson 3) matters for fairness, not just performance.
5 The Two Fairness Test Types
42119 defines two fairness test types. They are complementary — one tests individuals, one tests groups — and a thorough fairness test plan uses both.
Counterfactual fairness testing
Tests individuals. Take a real or synthetic case, change only a protected characteristic, hold everything else identical, and check the decision does not change. This is the two-identical-applications analogy made formal. It is powerful because it isolates cause: if flipping ethnicity (and nothing else) flips the decision, the model is using ethnicity, directly or through a proxy. You build it as matched test pairs: the original and its counterfactual twin.
Demographic parity testing
Tests groups. Run the model across a representative population and compare outcomes between groups — is the approval rate, or the “needs review” rate, or the accuracy, materially different for one group versus another? This is what would have caught the Revenue Analytics Unit failure: compare classification accuracy for Māori land correspondence against all other types, and the gap shows up as a failed test. Demographic parity testing works on aggregates, so it catches patterns that individual cases might not reveal.
6 Fairness Metrics a Tester Needs
There is no single “fairness number.” Different definitions of fair can conflict, and part of the tester’s job is knowing which one the system needs. Three you must be able to name:
| Metric | Plain meaning | When it applies |
|---|---|---|
| Demographic parity | Each group gets a positive outcome at the same rate (e.g. equal approval rates across ethnicities). | When equal access to the outcome is the goal — e.g. who gets shortlisted, who gets an offer shown. |
| Equal opportunity | Of the people who genuinely should get a positive outcome, each group is caught at the same rate (equal true-positive rate). | When missing a genuine case is the harm — e.g. a fraud or eligibility model should catch real cases equally across groups. |
| Predictive parity | When the model gives a positive decision, it is correct at the same rate for each group (equal precision). | When a wrong positive harms the person — e.g. being wrongly flagged for review or investigation should be equally rare across groups. |
These can pull against each other — it is mathematically impossible to satisfy all three at once except in special cases. That is not a loophole; it is a decision the team must make explicitly: which definition of fairness does this system owe its users, given what a wrong decision costs them? The tester’s contribution is to force that decision into the open and then test against the chosen definition, rather than letting “it’s fair” stay undefined.
7 NZ Protected Characteristics — Human Rights Act 1993
Which groups do you test across? In NZ, the starting list is the prohibited grounds of discrimination in the Human Rights Act 1993. If your AI makes or shapes a decision about a person, differential treatment on these grounds is a legal risk, not just a quality one:
Two practical notes for the NZ context. First, you rarely test all thirteen on every system — the risk register (Lesson 1) tells you which grounds are plausibly in play for this model, given its inputs and decisions. A licence-renewal model has obvious exposure on age and disability; a marketing model may have different exposure. Second, te Tiriti o Waitangi obligations and the Crown’s responsibilities to Māori mean ethnicity — and specifically outcomes for Māori — deserve explicit attention in public-sector AI, beyond the bare minimum of the Act. The Government Algorithm Charter reflects this.
8 Building a Fairness Test Plan
A fairness test plan has four moving parts. Walk them in order for any system:
- 1. Identify the affected groups. From the Human Rights Act grounds and the risk register, list the characteristics plausibly in play for this model. For a Kiwibank lending model: age, sex, ethnicity (via proxy testing), family status.
- 2. Choose the fairness definition. Decide — explicitly — whether demographic parity, equal opportunity, or predictive parity is what this system owes its users, based on what a wrong decision costs. Record the decision and the reason.
- 3. Build the test datasets. For demographic parity, assemble a representative population labelled by group. For counterfactual fairness, construct matched pairs that differ only on the protected characteristic.
- 4. Set criteria and evidence. Define the acceptable disparity (e.g. “approval rates across groups within 5 percentage points”), and specify the evidence: the per-group results table, the pairs tested, the method, and a dated reviewer decision.
Here is a 42119-aligned demographic parity test case for a fictional Kiwibank lending model:
Risk category: Fairness — demographic parity (Human Rights Act: age)
Test type: Demographic parity
Description: Compare loan-approval rates across age bands for applicants with
equivalent serviceability, on a representative test population.
Acceptance criteria: Approval rate for each age band is within 5 percentage points of
the overall approval rate, after controlling for serviceability.
Evidence required: Per-age-band approval-rate table; population definition; the
serviceability control method; reviewer sign-off.
Traceability: Risk R-09 (model disadvantages older or younger applicants).
Result: [Pass / Fail] — bands outside tolerance listed.
9 Bias Mitigation vs Bias Testing
This distinction trips up a lot of test plans, so be sharp about it.
Bias mitigation — the model team’s job
Techniques to reduce bias: rebalancing the training data, removing or transforming proxy features, applying fairness constraints during training, adjusting decision thresholds per group. This is something the data scientists and ML engineers do to the model.
Bias testing — the tester’s job
Independent measurement of whether the model is fair, against a defined criterion, with evidence. You do not de-bias the model; you verify — objectively and reproducibly — whether it meets the fairness criterion, and you report the result.
Why it matters: a test case that says “verify the model has been de-biased” is not a test — it assumes the mitigation worked and gives you nothing to measure. The correct test case names a fairness metric, a population or set of pairs, an acceptance criterion, and produces a pass/fail with evidence. The model team mitigates; you independently confirm. If you both do the same job, no one is actually checking.
10 Common Mistakes
🚫 Treating fairness as an ethics topic, not a test
Why it happens: Fairness sounds like values, and values feel like a governance committee’s job, not a tester’s.
The fix: Under 42119 fairness is a measurable quality characteristic. “Does this perform equally across groups?” is a test case with a metric, a criterion, and a pass/fail — the Revenue Analytics Unit gap. If it is not in the test plan, no committee statement will catch the disparity.
🚫 “We don’t collect ethnicity, so we can’t be biased on it”
Why it happens: Not collecting a characteristic feels like it removes the risk.
The fix: Neutral-looking features act as proxies — postcode for ethnicity, school for age. Not collecting the attribute means you cannot easily measure the bias, but the model can still act on it. Use counterfactual pairs and proxy testing to find it.
🚫 Writing “verify the model has been de-biased” as a test case
Why it happens: It conflates the model team’s mitigation with the tester’s testing.
The fix: That assumes the answer. A real fairness test names a metric, a population or set of pairs, an acceptance criterion, and produces a measured pass/fail. Mitigation is done to the model; testing independently confirms whether it worked.
🚫 Picking only one fairness metric and assuming it covers everything
Why it happens: One number is simpler, and the conflicts between metrics are uncomfortable.
The fix: Demographic parity, equal opportunity, and predictive parity can’t all hold at once. Decide explicitly which one the system owes its users given what a wrong decision costs — and use both a group test (parity) and an individual test (counterfactual).
11 Now You Try
Three graded exercises across fairness metrics, test design, and counterfactual pairs. Write your answer, run it for AI feedback, then compare to the model answer.
Below are outputs from a fictional NZ loan approval AI tested across demographic groups. Identify which fairness metric is being violated and name the 42119 fairness test type that should have caught it.
Group A: 88% were approved by the model
Group B: 61% were approved by the model
Overall approval rate: Group A 70% | Group B 68% (roughly equal)
Counterfactual check: an identical application with only the applicant’s suburb
changed from a Group A suburb to a Group B suburb flips from APPROVE to DECLINE.
Identify the violation(s) and the test type(s):
Show model answer
Metric violated: Equal opportunity. Of applicants who were genuinely creditworthy, 88% of Group A were approved but only 61% of Group B — the true-positive rate is far lower for Group B. Creditworthy Group B applicants are wrongly declined at a much higher rate. 42119 test type that should have caught it: Demographic parity testing in the broad sense (group comparison of outcomes), specifically measuring true-positive rate by group — i.e. an equal-opportunity comparison across groups. The counterfactual flip also means counterfactual fairness testing would have caught it. What the "roughly equal overall approval rate" hides: This is the trap. Demographic parity on the headline approval rate looks fine (70% vs 68%), so a team checking only overall approval rate would declare the model fair. But equal overall rates can coexist with very unequal treatment of the people who actually deserve approval — Group B's approvals are going to less-creditworthy applicants while creditworthy ones are declined. This is exactly why you test equal opportunity (true-positive rate), not just raw approval rate, and why the counterfactual suburb-flip is decisive: suburb is acting as a proxy for group membership.
The fairness “test case” below conflates bias mitigation with bias testing. Rewrite it as a valid 42119 demographic parity test case for a fictional KiwiSaver fund recommendation AI (recommends conservative / balanced / growth).
Rewrite as a valid demographic parity test case:
Show model answer
Test ID: FAIR-DP-013 Risk category: Fairness — demographic parity (Human Rights Act: sex) Test type: Demographic parity Description: On a representative test population of members with equivalent age, balance, time-to-retirement, and risk-questionnaire answers, compare the distribution of recommended fund types (conservative/balanced/growth) between women and men. The recommendation should not depend on gender once the relevant financial factors are held equivalent. Acceptance criteria: For each fund type, the recommendation rate for women is within 5 percentage points of the rate for men, on the matched-equivalent population. Any fund type outside tolerance is a fail and must be listed. Evidence required: Recommendation-distribution table by gender for the matched population; definition of how "equivalent" members were matched; population size per group; reviewer sign-off. Traceability: AI risk register risk R-05 (fund recommendations differ by gender for equivalent members). Why this is valid and the original was not: the original said "test that the model has been de-biased" — that assumes the mitigation worked and gives nothing to measure; de-biasing is the model team's job. This version independently MEASURES fairness: a named metric (demographic parity), a defined population, a numeric tolerance, concrete evidence, and traceability. It produces a pass/fail, not an assumption.
Design a counterfactual fairness test set of 4 test pairs for a fictional Waka Kotahi AI that assesses driver licence renewal eligibility. Each pair must be identical except for one protected characteristic. Cover at least two protected characteristics under the NZ Human Rights Act 1993. For each pair, state what is held constant, what is varied, and the expected result.
Show model answer
Pair 1 | Protected characteristic varied: Age | Held constant: clean driving record, valid medical certificate, same licence class, same test scores | Variant A: applicant aged 35 | Variant B: applicant aged 78 | Expected result: identical eligibility decision — age alone must not change the outcome where the medical certificate already attests fitness to drive Pair 2 | Protected characteristic varied: Disability | Held constant: same age, same clean record, same licence class; both hold a valid medical certificate clearing them to drive | Variant A: no declared disability | Variant B: declared mobility disability with an approved adapted-vehicle condition | Expected result: identical eligibility decision — a disability that has been medically cleared with appropriate conditions must not reduce eligibility Pair 3 | Protected characteristic varied: Sex | Held constant: same age, record, licence class, scores | Variant A: male applicant | Variant B: female applicant | Expected result: identical decision Pair 4 | Protected characteristic varied: Ethnic/national origin (proxy test) | Held constant: all driving and medical factors identical | Variant A: name and suburb associated with one group | Variant B: name and suburb associated with another group | Expected result: identical decision — if it changes, name/suburb is acting as a proxy and the model is discriminating on a prohibited ground Strong sets: each pair changes exactly ONE protected characteristic and holds all legitimate factors constant, covers at least two grounds (here: age, disability, sex, ethnicity), and states the expected result as "identical decision" with a one-line justification. The disability and age pairs include the important nuance that a legitimate, already-assessed factor (the medical certificate) is held constant so the test isolates the protected characteristic — not a genuine fitness-to-drive issue. Pair 4 demonstrates proxy testing for a characteristic you don't collect directly.
12 Self-Check
Click each question to reveal the answer.
Q1: Why is fairness a testable quality characteristic and not just an ethics topic?
Because under 42119 it has concrete test types (counterfactual fairness, demographic parity), measurable metrics, acceptance criteria, and a pass/fail. The disparity in a model is detectable before deployment with a test case — you do not need to wait for a complaint or a committee.
Q2: What is the difference between counterfactual fairness and demographic parity testing?
Counterfactual fairness tests individuals — change only a protected characteristic in one case and check the decision does not flip (matched pairs). Demographic parity tests groups — compare outcomes across a population. Use both: each catches what the other misses.
Q3: Why can’t you satisfy demographic parity, equal opportunity, and predictive parity all at once?
They are mathematically incompatible except in special cases. So the team must decide explicitly which definition the system owes its users — based on what a wrong decision costs — and the tester tests against that chosen definition rather than leaving “fair” undefined.
Q4: A team says “we don’t collect ethnicity, so the model can’t be biased on it.” Why is that wrong?
Neutral-looking features act as proxies — postcode for ethnicity, school for age. Not collecting the attribute means you can’t easily measure the bias, but the model can still act on it through proxies. Counterfactual pairs and proxy testing are how you detect it.
Q5: What is the difference between bias mitigation and bias testing?
Mitigation is what the model team does to reduce bias (rebalancing data, removing proxies, fairness constraints). Testing is the tester independently measuring whether the model is fair against a defined criterion, with evidence. “Verify the model has been de-biased” is not a test — it assumes the answer.
13 Interview Prep
Real questions asked in NZ QA interviews for AI-adjacent roles. Read the model answers, then practise your own version.
“How would you test an AI loan model for bias when we don’t collect ethnicity?”
Two ways. First, counterfactual pairs: I take applications that are identical on all the legitimate factors — income, deposit, serviceability — and vary only a proxy like name and suburb, then check whether the decision flips. If it does, a neutral-looking feature is acting as a stand-in for ethnicity. Second, proxy and demographic-parity analysis on the features we do hold: I’d look at whether outcomes differ across suburbs or other proxies in a way that tracks ethnicity. Not collecting the attribute means I can’t measure it head-on, but it doesn’t mean the model isn’t using it — so I test for the proxy.
“The data scientists say they’ve already de-biased the model. Is there anything left for you to test?”
Yes — de-biasing is mitigation, which is their job, and testing is independent verification, which is mine. “We de-biased it” is a claim, not evidence. I’d still run demographic parity and counterfactual fairness tests against an agreed metric and tolerance, on a representative population and matched pairs, and produce a measured pass/fail with the per-group results. If the tester and the model team both just trust the mitigation, no one is actually checking whether it worked.
“Which fairness metric should we use for our eligibility model?”
It depends on what a wrong decision costs, and it’s a decision we should make explicitly rather than default into. If the main harm is missing people who genuinely qualify, I’d argue for equal opportunity — equal true-positive rates across groups. If the main harm is wrongly flagging or declining someone, predictive parity matters more. If equal access to the outcome is the goal, demographic parity. They can’t all hold at once, so my job is to force that trade-off into the open, get it agreed and recorded, and then test against the chosen definition — and I’d still run counterfactual pairs alongside whichever group metric we pick.