Māori Data Sovereignty — Resync Bootcamp

1 The Hook

It is 9 am on a Monday and you are the senior QA engineer on the Ministry of Health EHR migration project. The integration test sprint starts today. The extract for your test environment has just arrived from the data engineering team: 80,000 de-identified patient records to support testing of the new inter-agency data sharing hub.

You open the field manifest. Among the standard fields — patient ID, date of birth, NHI number, clinical codes — you see two fields you did not fully account for in the test data strategy: iwi_affiliation and whanau_group_id.

You check the standard anonymisation script the team has been using all project. It strips both fields. Every record that had an iwi affiliation value comes through with a null. That is technically PII-safe — no individual can be re-identified from a null field. The Privacy Act check passes.

Then you check the integration test suite. The equity analysis reporting module — one of the primary requirements of this migration, mandated by the Health Strategy — has 23 test cases that verify iwi-level health outcome aggregation. With nulls in the iwi field, every single one of those tests fails. Not because the code is wrong. Because there is no data to aggregate.

You consider your options. You could restore the real iwi affiliation values, unstripped. But that means 80,000 real patients' cultural and ethnic identity data sitting in a shared test environment — a clear breach of Te Mana Raraunga and arguably the Privacy Act. You could skip the equity reporting tests. But they exist because the Ministry signed a commitment to iwi-level health equity measurement; skipping them means shipping without evidence the feature works. You could mark the tests as blocked and delay the sprint. But the migration has a hard deadline tied to funding.

Neither “strip it and break the tests” nor “keep it real and breach sovereignty” is acceptable. This lesson is about finding the third path.

2 What Is Te Mana Raraunga?

Te Mana Raraunga is the Māori Data Sovereignty Network. Its founding principle is that data about Māori people, drawn from Māori, or held by Māori, should be subject to Māori governance. This is not a software standard or a regulation. It is a declaration of indigenous rights, grounded in the same whakapapa of authority — tino rangatiratanga — that underpins the Treaty of Waitangi.

Key principle for testers: Māori data is not just another PII field. It carries cultural, relational, and collective context that standard anonymisation destroys. You can hash an email address and lose nothing of value to the test. You cannot hash iwi affiliation and expect the equity reporting module to work.

How Te Mana Raraunga differs from the Privacy Act 2020

This is the distinction that trips up most QA engineers. Both are real obligations, but they protect different things:

Dimension	Privacy Act 2020	Te Mana Raraunga
Rights holder	The individual whose data it is	The collective — the iwi, hapū, or whānau the data describes
Core concern	Individual identification and re-identification risk	Cultural integrity, collective rights, and governance authority
Satisfied by hashing?	Generally yes, if irreversible	No. Hashing destroys analytical value without resolving governance questions
Practical implication for test data	De-identify personal identifiers before use in non-prod	Synthetic generation with governed distribution — not redaction, not shuffling

A Privacy Act de-identification pass does not fulfil Te Mana Raraunga obligations. You must address both, separately, with different techniques.

3 What Counts as Māori-Specific Data?

Māori-specific data does not always announce itself with an obvious field name. Here is how to categorise it when you receive a dataset:

Direct

Iwi affiliation — explicit membership identifier for an iwi. The single most common field of this type.
Hapū membership — sub-tribal group affiliation; more granular and often more sensitive than iwi.
Marae association — the marae a person identifies with; carries geographic and relational implications.
Whānau group identifier — a link to a family grouping within a system; present in health, social services, and education systems that use whānau-centred service models.
Self-identified ethnicity (Māori) — a standard ethnicity field where the selected value is “Māori”; governed under both the Privacy Act (sensitive personal information) and Te Mana Raraunga at aggregate level.

Derived

Geographic data that implies iwi territory — addresses or location codes within specific rohe (territorial areas) can be used to infer iwi affiliation, particularly in smaller communities. This matters for datasets that lack an explicit iwi field but include addresses.
Aggregated health data with ethnicity breakdown — health conditions with known higher prevalence in Māori populations, when grouped by ethnicity, become de facto Māori population data even if no individual record has an iwi field. The aggregate carries collective meaning.

Content

Te reo Māori text fields — names of marae, whānau collectives, place names (especially within custom input fields), and user-supplied text that contains te reo content.
Transliterated names — personal names of Māori origin stored without macrons (e.g. “Maori” in place of “Māori”) are still Māori-identifying content even when the encoding is broken.

Relational

Whānau structure data — multi-household family networks, extended kinship relationships, and shared guardianship arrangements are more interconnected in Māori social structures than typical nuclear-family models. Systems that represent whānau as a network graph — common in health and social services — carry collective identification risk even when individual fields are de-identified: the network structure itself can identify the whānau.

Tester action: When you receive a dataset for test environment use, check your field manifest against all four categories above. Do not assume the data engineering team has flagged all Māori-specific fields — derived and relational fields in particular are frequently missed in standard data classification reviews.

4 The Anonymisation Challenge

Standard anonymisation techniques, developed for western data privacy frameworks, fail in specific ways when applied to Māori-specific fields. Here is what goes wrong and why:

Why hashing fails

SHA-256 of “Ngāti Porou” is a3f7d9.... You cannot reverse it, so re-identification is prevented. But you also cannot test any report that aggregates by iwi. The hash is indistinguishable from any other hashed value. You have destroyed the analytical structure the test depends on while keeping the governance question unresolved — the data still came from real patients, and the decision to use it was made without Māori governance involvement.

Why generalisation fails

Replacing all iwi values with a single value like “Māori” removes the granularity that iwi-level equity reporting requires. An EHR system that reports health outcomes at iwi level — as many NZ health programmes require — cannot be tested with generalised values. You lose the distinctions the test is designed to verify.

Why shuffling fails

Randomly reassigning iwi affiliation values between records produces impossible combinations. A patient with a Northland address assigned to Ngāi Tahu (a South Island iwi) creates geographically implausible records. Downstream tests that check geographic plausibility, or that use address + iwi combination for service routing logic, will produce false failures. You also still have the governance problem: the data still originated from real patients.

The right approach: synthetic generation with preserved distribution. Generate a population of synthetic records that preserves the statistical distribution of the real data but contains no real individuals. This satisfies the Privacy Act (no real people), addresses Te Mana Raraunga concerns at the individual level (no real Māori patients' data), and keeps your integration tests valid.

How to do it: four steps

Step 1 — Profile the real data

Before the real data leaves production or is used in any test environment, extract the statistical distribution only — not the records. Record the relative proportion of each iwi affiliation in the dataset (e.g., 23% Ngāti Porou, 18% Ngāi Tahu, 15% Ngāpuhipuhi, 11% Waikato-Tainui, and so on for all groups). Record the distribution of other fields you need to preserve: age bands, benefit types, region codes, gender. This metadata leaves production. The individual records do not.

Step 2 — Generate a synthetic population

Using the distribution profile, generate synthetic records where each attribute is independently drawn from the appropriate distribution. Use NZ name frequency tables (Statistics NZ publishes name data) for realistic person names. Generate valid-format NHI numbers using the AAANNNNN check-digit algorithm — they must be structurally valid but must not match any real NHI in the production dataset. Use geographic locations that are realistic for the assigned iwi territory. No real individual is represented in the output: the population is statistically representative, not a derived copy.

Step 3 — Validate against integration test requirements

Before loading the synthetic dataset, verify it satisfies test coverage requirements. Count records per iwi group: does the largest iwi group have enough records to test aggregation logic? (As a rough heuristic, ≈50 records per major iwi group; ≈10–20 for minor groups.) Verify that every iwi group needed by the test suite is represented. If your tests require edge cases — an iwi with only 3 records, a whānau group spanning two regions — insert those as deliberate synthetic records rather than hoping they appear naturally.

Step 4 — Apply governance controls to the synthetic dataset

This is the step most teams skip and should not. Even though the synthetic dataset contains no real individuals, it is a representative population of Māori iwi data. Under Te Mana Raraunga, the collective is the rights holder, and a dataset that accurately reflects iwi distribution carries collective cultural significance. Apply the same access controls to the synthetic dataset as you would to production: named access, audit logging, NZ data residency, and a defined retention period with secure disposal at sprint end. Do not store it on a developer’s laptop or in a publicly accessible S3 bucket.

5 Concrete Test Scenario

System: MSD case management upgrade

Dataset size: 50,000 client records including iwi_affiliation, used for whānau-centred service delivery tracking

Test requirement: Verify the iwi-level service utilisation dashboard aggregates correctly across 20+ iwi groups

Sprint constraint: Test environment must be ready in 3 working days

What data you need

Enough synthetic records per iwi group to exercise the aggregation logic. For a dashboard with 20+ iwi groups:

Major iwi groups (representing >5% of real population): at least 50 synthetic records each, so the percentage calculation in the dashboard is meaningful
Minor iwi groups (representing <2%): at least 10–20 records each, enough to confirm they appear in the output at all
Edge case: at least 1 iwi group with exactly 1 record, to test that the dashboard handles a count of 1 without dividing-by-zero or suppression logic errors
Edge case: at least 1 client record with iwi_affiliation = NULL to test the “not stated” bucket in the dashboard

What to generate

Realistic NZ personal names drawn from name frequency tables (not copied from any real person in the dataset)
Valid NHI format: three uppercase letters + four digits + check digit (AAANNNNN), verified structurally correct but confirmed not present in production NHI list
Geographic locations (suburb, region code) that are plausible for the assigned iwi territory — assign a Ngāi Tahu record an address in Canterbury or Otago, not Northland
Benefit type, duration, and service codes drawn from the real distribution of the same, not random values
Date of birth generating valid ages that match the real age-band distribution

What NOT to do

Do not

Use real NHI numbers from any source, including public records or reference datasets
Source personal names from iwi member rolls, published whānau databases, or social media profiles
Store the synthetic dataset in a shared dev environment with open access (Slack DMs, shared drives, public repos)
Keep the dataset after the sprint ends — it should have a defined expiry date and be securely destroyed
Share the distribution profile (Step 1 metadata) outside the immediate test team without governance approval

Do

Generate NHI numbers algorithmically and run a cross-check against a production NHI list to confirm zero overlap
Use Statistics NZ name frequency data or a purpose-built NZ synthetic name generator
Store the synthetic dataset in an environment with named, audited access and NZ data residency
Set a retention end date in the test data register and assign a named owner responsible for disposal
Document the governance decision: who approved this approach, and on what basis

Test assertion example

Test ID: IWIAGG-007
Description: Dashboard aggregation for Ngāi Tahu returns correct count
Pre-condition: Synthetic dataset loaded; Ngāi Tahu records seeded = 63
Steps:
1. Navigate to /dashboard/iwi-utilisation
2. Apply date filter: current synthetic data period
3. Locate the Ngāi Tahu row in the aggregation table
Expected result: Ngāi Tahu count displayed is between 60 and 66 (seeded count ±5%)
Rationale: ±5% tolerance allows for legitimate filtering logic (e.g. records
excluded due to incomplete data) without failing on implementation detail
Pass criteria: Displayed count is within tolerance AND the percentage column
is calculated against total active records, not total seeded records
Notes: If count is 0, the iwi_affiliation field is likely being null-coalesced
or the filter is case-sensitive (test both “Ngāi Tahu” and “Ngai Tahu”)

Pro tip: Seed a small number of records with deliberate errors — an iwi affiliation value that exists in your reference table but is misspelled in the case management system, a record with a mismatched region/iwi combination — and verify the dashboard either handles or surfaces them correctly. This is the boundary testing that will reveal whether the aggregation logic is fragile.

6 Te Reo Māori Field Handling

Fields containing te reo Māori content require specific technical verification that is separate from the data sovereignty question. The encoding and search behaviour of these fields is where implementation bugs surface.

Unicode normalisation

Form fields that accept te reo Māori content — marae names, whānau collective names, iwi names in custom text fields — must be stored as NFC-normalised Unicode. NFC (Canonical Decomposition, followed by Canonical Composition) ensures that a character like ā is stored as a single code point (U+0101) rather than as the decomposed pair a + combining macron (U+0061 + U+0304). This matters for database sorting, equality comparisons, and string length calculations. A field that stores the same string in two different normalisation forms will produce false negatives on exact-match queries.

Test this explicitly: submit a form with an iwi name that includes macrons, retrieve the value via API, and assert that the stored form equals the NFC-normalised input — not just that the character displays correctly in the browser.

Macron rendering in legacy systems

This is not a trivial problem for older NZ government systems built on COBOL or Oracle databases with Latin-1 or Windows-1252 collation. Macrons are multi-byte characters. A field defined as VARCHAR2(50 BYTE) in Oracle will truncate a string of 50 macron-containing characters because each macron character occupies 2 bytes in UTF-8 — effectively halving the usable field length. Test for this specifically:

Submit a field value at exactly the declared maximum length, using macron-containing te reo text
Verify the value is stored and retrieved without truncation
If truncation occurs, the fix is VARCHAR2(100 CHAR) rather than VARCHAR2(100 BYTE) in Oracle, or equivalent in the target database

Search and filtering with and without macrons

This requires a defined acceptance criterion before you write the test, because either behaviour can be correct depending on the decision made:

Search input: "Ngai Tahu" (no macron)
Stored value: "Ngāi Tahu" (with macron)

Option A — Strict Unicode match (returns no results):
  Acceptable if documented. User should see a helpful “did you mean Ngāi Tahu?” suggestion.
  A silent no-results response is a usability defect regardless of the policy.

Option B — Normalised search (returns Ngāi Tahu results):
  Requires explicit transliteration/normalisation in the search layer.
  Verify this is implemented deliberately, not accidentally (some search
  libraries normalise silently; confirm the behaviour is intended and tested).

Whatever the decision: write it as an acceptance criterion and test both
directions. A silent mismatch either way is a defect.

7 Governance Checklist for QA Teams

Before your test sprint begins, work through this checklist with your test lead and, where applicable, the project’s Māori data governance contact. If any item cannot be answered, treat it as a sprint blocker, not a post-launch action.

Who approved the test data strategy for datasets containing Māori-specific fields? A named individual or governance body should be on record. “The data engineering team decided” is not governance approval.
Has a Māori data governance advisor or relevant iwi data governance contact been consulted on the approach? This is required under Te Mana Raraunga for datasets that have collective significance. Document the consultation outcome even if it is a brief sign-off.
Is the synthetic dataset stored with the same access controls as the production data? Named, audited access; NZ data residency; no open S3 buckets or shared-drive folders.
Does the test data retention policy specify an expiry date and a named owner responsible for secure disposal? “Keep it until we no longer need it” is not a retention policy for taonga.
Has the distribution profile (Step 1 metadata) been treated as sensitive? Even anonymised population statistics can carry sensitivity for small communities. Confirm it is stored with appropriate controls.
Does the test coverage for the equity reporting module include test cases that would detect a systematic bias or filtering error that disproportionately affects one iwi group? Functional tests that only check “a value appears” will not catch equity-level defects.
Is there a Māori stakeholder review planned for the equity reporting module before go-live? This review should look at whether the aggregation logic, the dashboard labels, and the data categories reflect the iwi-level reality they are meant to represent. QA can surface the need for this review; the decision to conduct it belongs to the project and agency leadership.
Are the test assertions for iwi-level aggregation checking the correct denominator? (Total active records in the period, not total records in the synthetic dataset.) This is a common implementation error that produces systematically wrong percentages.

8 Prompt Lab

Use this prompt lab to generate a synthetic test data specification for a real project you are working on, or to get AI feedback on a governance approach you are designing. Edit the prompt before running — the more specific your scenario, the more useful the output.

▶ Prompt Lab — Synthetic Data Specification

Calls /ai-testing/proxy.php — response appears below

▶ Prompt Lab — Governance Review

Calls /ai-testing/proxy.php — response appears below

9 Self-Check

Click each question to reveal the answer.

1. Why does hashing iwi affiliation values fail to solve the Māori Data Sovereignty problem, even if it satisfies the Privacy Act?

Hashing prevents individual re-identification (the Privacy Act concern) but destroys the analytical structure the test depends on — you cannot group, aggregate, or count records by iwi if every value is an opaque hash. More fundamentally, it does not resolve the Te Mana Raraunga governance question: the data still originated from real Māori patients, and the decision to use it was made without Māori governance involvement. The Privacy Act protects the individual; Te Mana Raraunga protects the collective. Hashing addresses only the first.

2. What is the practical difference between Te Mana Raraunga and the Privacy Act 2020 from a test data perspective?

The Privacy Act 2020 is about protecting individual identification — you meet it by ensuring no individual can be re-identified from your test data. Te Mana Raraunga is about collective rights and cultural governance — Māori data is taonga and Māori have governance authority over how it is used, even in aggregate. A dataset of 80,000 de-identified patient records with iwi affiliation preserved passes a Privacy Act check (no individual is identifiable) but may not satisfy Te Mana Raraunga (the iwi distribution and collective data was used without Māori governance involvement). Both must be addressed, with different techniques.

3. Why must a synthetic dataset containing representative iwi distribution data be treated with the same access controls as production data?

Because Te Mana Raraunga is about collective rights, not only individual identification. A synthetic dataset that accurately reflects the iwi distribution of a real population carries collective cultural and political significance — it is a representative picture of a community, even if no individual record maps to a real person. Leaving it in an open S3 bucket or on a shared drive treats it as meaningless generated data, which violates the principle of kaitiakitanga (guardianship). The same access controls as production — named access, audit logging, NZ data residency, defined retention — reflect appropriate respect for the collective data as taonga.

4. A search for “Ngai Tahu” (no macron) returns no results when the database contains “Ngāi Tahu”. Is this a defect?

It depends entirely on the acceptance criterion. If the specification says search must support normalised matching (so the non-macron form finds the macron form), then yes, it is a defect. If strict Unicode matching was the agreed behaviour, then it is conformant — but must be surfaced to users with a helpful message rather than a silent empty result. A silent no-results response with no suggestion is a usability defect in either case. The tester’s job is to confirm the behaviour was agreed in advance and to raise a defect if no acceptance criterion exists for this case.

5. What is the minimum governance documentation required before using a synthetic dataset containing representative iwi distribution data in a test environment?

At minimum: (1) Named approval of the test data strategy by someone with authority over Māori data governance on the project — not just the data engineering team. (2) A record that a Māori data governance advisor or relevant iwi data governance contact was consulted and the outcome documented. (3) A test data register entry specifying the dataset, who has access (named individuals), where it is stored (with confirmation of NZ data residency), and the retention end date with named responsible owner. (4) Confirmation that access controls are equivalent to production. If any of these cannot be provided, that is a sprint blocker, not a post-launch action.

10 ISTQB Alignment

ISTQB CTAL-TM v1.0 — Test Data Management
This lesson maps directly to CTAL-TM Chapter 5: Test Data. The distinction between test data that is derived from production, synthetic, and anonymised data; the risks of using production data in test environments; and the governance controls required for sensitive datasets are all core CTAL-TM topics. The Māori Data Sovereignty dimension is the NZ-specific application of what CTAL-TM calls “data governance obligations” in a regulated industry context.

ISTQB CTAL-TM v1.0 — Section 5.3: Anonymisation and Synthetic Data
CTAL-TM explicitly covers the trade-offs between anonymisation approaches (masking, pseudonymisation, synthetic generation) and their fitness for test purposes. The requirement to preserve statistical distribution for integration testing while removing individual identification is a canonical CTAL-TM problem. The lesson above is a worked application of that framework to a NZ regulatory context.

Privacy Act 2020 (NZ) — Information Privacy Principles 2, 3, 5, and 10
IPP 2 (collection from source), IPP 3 (collection of information from individuals), IPP 5 (storage and security), and IPP 10 (limits on use) are the principles most directly engaged when using production-derived data in test environments. A senior QA engineer should be able to name these principles and articulate how a synthetic data approach satisfies them, particularly IPP 10 (the purpose for which data was collected does not include use in a test environment without consent).

NZ Government Algorithm Charter — Bias Identification Obligation
Agencies that have signed the Algorithm Charter commit to actively identifying and managing unintended bias. For a QA engineer on a government programme, this creates a formal obligation to include equity-level test cases — checking that iwi-level aggregation logic does not produce systematically biased outcomes — in the test plan. The test assertions described in Section 5 of this lesson are the implementation of that obligation at the test execution level.