Specialised · Test Data & Privacy

Test Data Management

Test data is where privacy law and testing craft collide. Get it wrong and you either break tests for want of realistic data, or you break the law by using real customer records in an environment that was never built to protect them.

Specialised NZ Privacy Act 2020 · CTAL-TM v3 · ISTQB TDM ~25 min read

1 The Hook

Your UAT environment has real customer data: 47,000 names, IRD numbers, and bank accounts. Refreshed every Friday. Is this normal? Yes. Is it legal under the NZ Privacy Act 2020? Almost certainly not.

This is the situation at dozens of NZ organisations right now. The database refresh is a Friday afternoon job-run: five lines of SQL, prod backup to UAT, done. Developers, contractors, offshore support staff, and occasionally demo visitors all have access. Nobody applied for consent from those 47,000 people. Nobody classified the data. Nobody masked a column.

Under IPP 10 (use limitation) of the Privacy Act 2020, that customer data was collected for one purpose — running their accounts — and using it in a test environment is a different purpose the customers never agreed to. Under IPP 5 (storage security), a UAT environment with lighter access controls than production cannot provide the same security safeguards that production data requires. Both failures are notifiable-breach territory if the data is later disclosed.

Nobody is malicious. Nobody thought about it. That is exactly the problem test data management solves.

Analogy

Training a new bank teller with live customer accounts.

You would never sit a new teller in front of open live accounts to practise on — real names, real balances, real credit histories. You would give them a practice environment that behaves exactly like the real thing, but is populated with fictional people who cannot be harmed. The teller learns everything they need; no customer is exposed. Test data management builds those practice accounts so that testers never have to bring real ones into the back room.

2 The Four Test Data Problems

Teams suffer from four distinct test data failure modes. Each needs a different fix.

Problem 1: No Data

The test environment is empty. Testers spend the first half of every sprint creating accounts by hand before they can test anything. Sprint velocity collapses. Edge cases are never reached because building them is too time-consuming.

Fix: Idempotent seed scripts that populate a known baseline in one command. Every new environment starts with the same dataset, already in place.

Problem 2: Wrong Data

The environment has data, but it does not exercise the right scenarios. The team has hundreds of happy-path accounts and zero accounts in arrears, at a threshold boundary, or with a missing mandatory field. Real bugs hide in the gaps.

Fix: Purpose-built synthetic datasets that deliberately include boundary values, error states, and rare-but-important configurations. Design your data to your test plan, not the other way around.

Problem 3: Stale Data

The environment was seeded three months ago. Since then, the schema has changed, test runs have mutated records, and the data no longer matches any realistic production state. Tests fail for data reasons, not code reasons.

Fix: Idempotent seeds (re-running them resets state to baseline) plus version-controlled seed scripts that evolve with the schema. Treat test data like code: it lives in source control and has a migration history.

Problem 4: Unsafe Data

The environment contains real customer personal information. IRD numbers are real. Names are real. Bank accounts belong to living people. A laptop export, a misconfigured access control, or a careless email turns this into a notifiable privacy breach.

Fix: Never use real personal data in test environments. Use synthetic data generated from scratch, or irreversibly masked derivatives of production data, applied before the data touches the lower environment.

Pro tip: Most organisations think they have Problem 2 or 3. They actually have Problem 4 as well — they just have not checked. Before any test data initiative, audit what is currently in each non-production environment and confirm it contains no real personal information.

3 NZ Privacy Act 2020

Two Information Privacy Principles land directly on test data. You should be able to cite them by number in any meeting about a database refresh.

IPPNameWhat it means for test data
IPP 5 Storage and security Personal information must be protected by security safeguards that are reasonable given the sensitivity of the information. A UAT environment with weaker access controls, no audit logging, and contractor access does not meet this standard for real customer data.
IPP 10 Use limitation Personal information must not be used for a purpose other than the purpose for which it was collected, unless an exception applies. Customers consented to having their data used to run their accounts, not to populate a test environment for UAT.

Production data in a test environment is a notifiable breach risk. Under section 113 of the Privacy Act 2020, an agency must notify the Office of the Privacy Commissioner (and affected individuals) of a breach that has caused, or is likely to cause, serious harm. Exposure of IRD numbers, bank account details, or financial position to unauthorised parties in a test environment can meet that threshold. “It was only UAT” is not a defence.

The practical implication: the test data decision is a legal decision, not just an engineering one. When a developer proposes a Friday database refresh from production, the correct QA response is to name IPP 5 and IPP 10, ask whether a PIA covers this use case, and decline until a safe alternative is in place.

What counts as personal information?

Under the Privacy Act 2020, personal information is any information about an identifiable individual. In a typical NZ application this includes:

  • Name, address, email, phone number
  • IRD number, NHI number, passport number
  • Bank account numbers and financial position
  • Date of birth, gender, ethnicity
  • IP address and device identifiers (these are personal information in NZ law)
  • Support ticket text, chat transcripts, complaint notes — even when the name is removed, if the text contains enough context to identify the person

The last point is critical: free-text fields are where masking silently fails. You can mask every structured column perfectly and still leave a person identifiable because a notes field reads: “spoke to Aroha about her overdue account, ph 021 …”.

4 Data Masking Techniques

When you must derive test data from production records (for realistic volume or distribution), masking replaces real values with realistic-but-fake ones. Three core techniques:

TechniqueHow it worksUse whenNZ example
Substitution Replace each value with a different plausible value from a lookup dictionary (e.g., real first names replaced with other real first names) Name, address, email fields where format realism matters Replace “James Smith” with a randomly drawn name like “Wiremu Tane” from a NZ name dictionary
Shuffling Reassign values between rows. Customer A gets Customer B’s IRD number; Customer B gets Customer C’s. Each value is still realistic; it just belongs to the wrong person. Fields where value distribution must match production exactly (e.g., income ranges, postcode spread) Shuffle IRD numbers across rows so the set retains realistic checksum distribution without any IRD mapping to the original person
Format-preserving encryption (FPE) Deterministically encrypt a value so the output has the same format and length as the input. The same input always produces the same output (referential consistency across tables). Keys that appear in multiple tables (customer_id in customers, orders, and support_tickets) where foreign keys must still join FPE an IRD number: 049-688-321 becomes 083-241-097 — still 9 digits, no leading zero, same checksum pattern

NZ-specific masking rules

IRD Number Checksum (must be preserved)

IRD numbers use a weighted checksum algorithm. If you generate or mask IRD numbers by substituting random digits, the result fails the checksum and will be rejected by any system that validates them — breaking your tests. Your masking must either: (a) use FPE that maps valid IRD numbers to other valid IRD numbers, (b) generate synthetic IRD numbers using the algorithm, or (c) use a lookup table of pre-validated fictional IRD numbers.

Checksum weights (8-digit base): 3, 2, 7, 6, 5, 4, 3, 2
Checksum weights (9-digit, when first digit is non-zero): 7, 4, 3, 2, 5, 2, 7, 6, (check digit)
Remainder 0 → check digit = 0. Remainder 1–9 → use secondary weights.

NZ Bank Account BECS Format

NZ bank accounts follow the BECS format: BB-bbbb-AAAAAAA-SS — 2-digit bank, 4-digit branch, 7-digit account, 2–3 digit suffix. Bank codes 01–38 are assigned to actual NZ banks. For test data, use bank code 00 (unassigned) with branch 0000 to produce accounts that are structurally valid but cannot be mistaken for real: 00-0000-0123456-00.

NZ Phone Numbers

NZ mobile numbers begin with 02x. For test data, use the reserved test range 021 000 0000–021 000 9999 which cannot belong to a real subscriber. NZ landlines begin with a regional code (04 = Wellington, 09 = Auckland). Use 04 000 0000 or 09 000 0000 for test landlines.

Masking tools

Faker.js (JavaScript)

The @faker-js/faker package with en_NZ locale generates NZ names, addresses, and phone numbers. Combine with custom IRD/bank account generators for full NZ coverage. Ideal for Node.js test seeds and Playwright globalSetup.

Python Faker

The Faker package with en_NZ locale. Best for SQL seed scripts, data pipeline masking, and pytest fixtures. Write a custom provider for IRD numbers and BECS bank accounts.

Mockaroo

Web-based data generator with custom field types, referential integrity across tables, and CSV/JSON/SQL export. Good for one-off seed files. Free tier: 1,000 rows per download. Add a custom formula for IRD checksum.

SQL masking queries

For in-database masking of production clones, a parameterised UPDATE run before the clone leaves the production network ensures unmasked data never reaches the test environment. Run in transit, not after-the-fact.

Critical rule: Masking must happen before data lands in the non-production environment, never after. If you copy production first and mask second, there is a window — however brief — where real personal information existed in the weaker environment. That window is the breach.

5 Synthetic Data Generation

The safest test data was never anyone’s in the first place. Synthetic data is generated from rules, not copied from real records. There is no real person behind it to re-identify, so Privacy Act obligations do not arise at all.

When to use synthetic data

  • First choice in all cases: if you can satisfy the test with synthetic data, do so. It carries no privacy risk and gives you full control over the dataset.
  • New features and greenfield systems: no production data exists yet; all data must be synthetic.
  • Specific edge cases: scenarios that are rare in production (a customer at exactly the income threshold, a KiwiSaver member who has made no contributions for 7 years) are designed into synthetic data rather than hunted for in production.
  • CI/CD pipelines: automated test runs need a deterministic seed that resets on every run. Synthetic data scripts run fast, produce the same output every time, and do not depend on a production copy.

NZ-specific generators

Standard Faker libraries do not include NZ-specific identifiers. Write your own generators for these:

Valid NZ IRD number generator (JavaScript)

// Generate a structurally valid IRD number for test use only. // Never use in any real system. Generates 8-digit IRD numbers. function generateIRD() { const weights1 = [3, 2, 7, 6, 5, 4, 3, 2]; const weights2 = [7, 4, 3, 2, 5, 2, 7, 6]; while (true) { // Random 7-digit base (IRD starts 000-000-001 to 150-000-000) const base = Math.floor(Math.random() * 9999999) + 10000000; const digits = String(base).split('').map(Number); let sum = 0; for (let i = 0; i < 7; i++) sum += digits[i] * weights1[i + 1]; let remainder = sum % 11; if (remainder === 0) { digits.push(0); } else { remainder = 11 - remainder; if (remainder < 10) { digits.push(remainder); } else { // Try secondary weights sum = 0; for (let i = 0; i < 7; i++) sum += digits[i] * weights2[i + 1]; remainder = 11 - (sum % 11); if (remainder === 10) continue; // Invalid, retry digits.push(remainder % 10); } } const n = digits.join(''); return `${n.slice(0, 3)}-${n.slice(3, 6)}-${n.slice(6)}`; } } // Usage: // generateIRD() // => e.g., "049-688-321"

Python Faker with NZ providers

from faker import Faker from faker.providers import BaseProvider import random fake = Faker('en_NZ') class NZProvider(BaseProvider): """NZ-specific synthetic data for test environments.""" NZ_POSTCODES = [ '0110', '0112', '0600', '0614', '1010', '1011', '4410', '6011', '6012', '7910', '8011', '9010', ] # NZ postcodes run 0110-9999; above is a representative sample. def nz_postcode(self): return self.random_element(self.NZ_POSTCODES) def nz_bank_account(self): """Returns a test bank account using bank code 00 (unassigned).""" account = str(random.randint(1000000, 9999999)) return f"00-0000-{account}-00" def nz_mobile(self): """Returns a test mobile in the reserved 021 000 xxxx range.""" n = random.randint(0, 9999) return f"021 000 {n:04d}" def nz_ird(self): """Generate a checksum-valid IRD number for test use.""" weights = [3, 2, 7, 6, 5, 4, 3, 2] while True: digits = [random.randint(0, 9) for _ in range(8)] if digits[0] == 0: continue total = sum(d * w for d, w in zip(digits, weights)) remainder = total % 11 if remainder == 0: check = 0 else: check = 11 - remainder if check == 10: # Use secondary weights w2 = [7, 4, 3, 2, 5, 2, 7, 6] total2 = sum(d * w for d, w in zip(digits, w2)) check = 11 - (total2 % 11) if check == 10: continue digits.append(check % 10) s = ''.join(map(str, digits)) return f"{s[:3]}-{s[3:6]}-{s[6:]}" fake.add_provider(NZProvider) # Usage: # fake.name() => "Wiremu Tane" # fake.nz_ird() => "049-688-321" # fake.nz_bank_account() => "00-0000-4812037-00" # fake.nz_postcode() => "6012" # fake.nz_mobile() => "021 000 4829"

6 Environment Seeding Strategies

A seed script populates a known, deterministic dataset into a test environment. Good seeds are idempotent: running them twice produces the same result as running them once. They are version-controlled alongside the application code and run as part of CI/CD setup.

SQL seeds

The simplest approach for database-backed applications. Use INSERT OR REPLACE / UPSERT semantics so the script is idempotent:

-- Idempotent KiwiSaver account seed (5 account states) -- Run before any test suite. Safe to re-run: ON CONFLICT updates existing rows. INSERT INTO members (id, full_name, ird_number, email, join_date) VALUES ('TM-0001', 'Test Active', '049-688-321', 'active@resync-test.nz', '2018-04-01'), ('TM-0002', 'Test Suspended', '083-241-097', 'suspended@resync-test.nz', '2019-07-15'), ('TM-0003', 'Test Withdrawn', '127-534-062', 'withdrawn@resync-test.nz', '2015-01-01'), ('TM-0004', 'Test Threshold', '214-892-043', 'threshold@resync-test.nz', '2020-04-01'), ('TM-0005', 'Test No Contrib', '398-017-654', 'nocontrib@resync-test.nz', '2016-04-01') ON CONFLICT (id) DO UPDATE SET full_name = EXCLUDED.full_name, ird_number = EXCLUDED.ird_number; INSERT INTO kiwisaver_accounts (member_id, status, balance_nzd, last_contribution_date) VALUES ('TM-0001', 'ACTIVE', 42500.00, CURRENT_DATE - INTERVAL '14 days'), ('TM-0002', 'SUSPENDED', 8300.00, '2022-11-30'), ('TM-0003', 'WITHDRAWN', 0.00, '2021-06-30'), ('TM-0004', 'ACTIVE', 500000.00, CURRENT_DATE - INTERVAL '3 days'), -- at provider threshold ('TM-0005', 'ACTIVE', 15000.00, '2018-04-01') -- no contributions 7 yrs ON CONFLICT (member_id) DO UPDATE SET status = EXCLUDED.status, balance_nzd = EXCLUDED.balance_nzd, last_contribution_date = EXCLUDED.last_contribution_date;

Factory patterns

In JavaScript/TypeScript test suites, object factories build test entities with sensible defaults that can be overridden:

// factories/member.ts — used in Jest/Vitest/Playwright tests import { generateIRD } from '../utils/nz-generators'; let _seq = 1; export function memberFactory(overrides: Partial<Member> = {}): Member { const id = _seq++; return { id: `TM-${String(id).padStart(4, '0')}`, fullName: `Test User ${id}`, irdNumber: generateIRD(), email: `test.user.${id}@resync-test.nz`, joinDate: new Date('2020-04-01'), status: 'ACTIVE', ...overrides, // caller can override any field }; } // Usage in a test: // const active = memberFactory(); // const suspended = memberFactory({ status: 'SUSPENDED' }); // const threshold = memberFactory({ balance: 500_000 });

API seeding

When there is no direct database access (microservices, third-party SaaS), seed data through the application’s own API. This validates the creation path as well as the test scenario:

// api-seed.ts — seeds via REST API, useful in staging/UAT async function seedMember(overrides = {}) { const payload = { fullName: 'Test Suspended User', irdNumber: '083-241-097', email: 'suspended@resync-test.nz', ...overrides, }; const resp = await fetch('/api/v1/members', { method: 'POST', headers: { 'Content-Type': 'application/json', 'X-Test-Seed': 'true' }, // flag so prod blocks it body: JSON.stringify(payload), }); return resp.json(); }

Playwright globalSetup

For end-to-end test suites, run seeding once before all tests and teardown after:

// playwright.config.ts import { defineConfig } from '@playwright/test'; export default defineConfig({ globalSetup: './tests/global-setup.ts', globalTeardown: './tests/global-teardown.ts', }); // tests/global-setup.ts import { chromium } from '@playwright/test'; import { seedDatabase } from './helpers/seed'; export default async function globalSetup() { await seedDatabase(); // idempotent — safe to re-run // Optionally: log in once and save auth state const browser = await chromium.launch(); const page = await browser.newPage(); await page.goto('/login'); await page.fill('[name=email]', 'active@resync-test.nz'); await page.fill('[name=password]', 'TestPass123!'); await page.click('[type=submit]'); await page.context().storageState({ path: 'auth/active-user.json' }); await browser.close(); }
The idempotency rule: Every seed script must be safe to run twice. Use UPSERT in SQL, check-before-insert in API seeding, and counter-resets in factories. A seed that fails on second run will silently corrupt your CI environment.

7 Edge Case Test Data

Bugs cluster at boundaries. The following edge cases are endemic to NZ applications and are regularly missed in test datasets. Design them in explicitly.

Timezone transitions: NZST/NZDT

New Zealand observes daylight saving. The clocks go forward 1 hour at 2:00 am on the last Sunday of September (moving to NZDT, UTC+13) and back at 3:00 am on the first Sunday of April (returning to NZST, UTC+12). Dates and times exactly at these transitions expose off-by-one and missing-hour bugs:

ScenarioTest valueExpected behaviour
Spring forward (September) 2025-09-28T02:30:00+12:00
(this local time does not exist)
System rejects, or maps to 03:30 NZDT without losing data
Fall back (April) 2025-04-06T02:30:00
(this local time occurs twice)
System disambiguates correctly; no duplicate records created
Midnight crossing Event timestamped 2025-09-27T23:59:59NZST Reported on correct calendar date (Sunday, not Monday)

Financial year boundary (1 April)

The NZ financial year runs 1 April to 31 March. Any system that aggregates by financial year must handle the boundary correctly:

  • Transaction on 2025-03-31 → FY2025
  • Transaction on 2025-04-01 → FY2026
  • Transaction at 2025-03-31T23:59:59NZDT (which is 2025-03-31T10:59:59Z) — is it FY2025 or FY2026? Depends on whether the system uses local or UTC for the cut-off.
  • KiwiSaver contribution caps, tax returns, and benefit assessments all reset on 1 April.

Leap years

  • Birthday: 1992-02-29 — testers born on this date. Systems that calculate age or send birthday emails must handle this without crashing on non-leap years.
  • Financial calculations: a 30-day billing cycle starting 2024-01-31 ends 2024-02-29 (valid in 2024) but 2023-02-28 in a non-leap year. Add both to your dataset.
  • The year 2000 is a leap year (divisible by 400); 1900 is not. Any system with a Y1900 date might have this lurking.

Unicode and macrons (ā ē ī ō ū)

Te reo Māori names containing macrons (tohutō) are common in NZ systems. They are a reliable source of encoding bugs, truncation failures, and sort-order issues:

// NZ name edge cases — add all of these to your name field test data const NZ_NAME_EDGE_CASES = [ // Macrons (tohutō) 'Māia Ngāti', // ā — most common macron 'Tūhoe Pōhatu', // ū, ō 'Hēmi Rīpeka', // ē, ī 'Ātaahua Mōana', // Ā (capital macron) // Very long Māori names 'Tūhoe-o-te-Rangi Tūwharetoa Ngātai', // compound with hyphens, 38 chars // Mixed scripts 'Li Wei Tūhoe', // East Asian given name + Māori surname // Special chars in email derived from name // māia.ngāti@example.com — must NOT appear in email field as-is // (email with macrons is technically valid but usually causes issues) // Apostrophe variants "O'Brien Tūhoe", // ASCII apostrophe '‘Ihi Māui', // Unicode left single quote (opening curly apostrophe) ]; // Also test max-length boundary: NZ given names can be very long. // The Legal Name Change Act 1995 allows names up to 100 chars. const LONG_MAORI_NAME = 'Tūhoe-o-te-Rangi-Tūwharetoa-Arikirangi-Te-Tūruki';

Special characters in email addresses

Email fields connected to name fields are a common source of encoding bugs. When a customer’s name contains a macron and the system auto-generates an email, the result may contain invalid characters:

// Email edge cases for NZ systems const EMAIL_EDGE_CASES = [ // Valid but unusual 'test+tag@resync-test.nz', // plus-addressing 'test.user@subdomain.resync-test.nz', // subdomain '"test user"@resync-test.nz', // quoted local-part (RFC 5321 valid) 'tēst@resync-test.nz', // macron in local-part (UTF-8 email, RFC 6531) // Common truncation/encoding failure sources 'a'.repeat(64) + '@resync-test.nz', // max local-part length (64 chars) 'user@' + 'a'.repeat(253) + '.nz', // max domain length (253 chars) // Invalid — system must reject 'not-an-email', // no @ 'missing@', // no domain '@nodomain.nz', // no local-part 'two@@signs.nz', // double @ ];

NZ edge case reference table

Keep this table as your test data checklist for any NZ application:

Data typeEdge case valueWhy it matters
IRD number000-000-000Zero value — must be rejected; also tests null-handling
IRD number049-688-321Valid checksum — structurally correct test IRD
NZ postcode0110 (Whangarei), 9999 (Invercargill)Min/max of valid range; both extremes must resolve to a city
NZ postcode1000Unassigned — valid format but maps to no real location
Bank account02-0500-0890754-00 (ANZ)Real bank/branch code — test that system doesn’t validate against live routing tables
Bank account00-0000-0000001-00Reserved test bank code; structurally valid, cannot be a real account
Phone021 000 0000Reserved mobile range; use for all test records
Phone0800 123 456Freephone number — edge case if system requires a mobile
NameMāia NgātiMacrons; tests UTF-8 storage, display, and sort order
NameTūhoe-o-te-Rangi-Tūwharetoa-ArikirangiVery long Māori name; tests column length limits (100 chars)
NameO’Brien vs O‘BrienASCII apostrophe vs Unicode curly quote; different codepoints, same intent
Date1992-02-29Leap day birthday; must not crash on non-leap years
Date2025-04-01NZ financial year start; FY attribution boundary
DateTime2025-09-28T02:30:00+12:00Non-existent time during NZDT spring-forward

8 Self-Check

Click each question to reveal the answer.

Q1. A developer wants to copy the production database into UAT on Friday afternoon to get realistic data for next week’s testing. What do you say, and which IPPs do you cite?

No. Under IPP 10 (use limitation), that customer data was collected to run customer accounts, not to populate a test environment — a different purpose the customers never agreed to. Under IPP 5 (storage security), UAT with lighter access controls and contractor access cannot provide the required safeguards for real personal information. The correct path is synthetic data or irreversibly masked derivatives applied before the data touches UAT. Cite both IPPs by name; this is a legal question, not just an engineering preference.

Q2. Your masking script updates the first_name, last_name, email, and ird_number columns. Is this sufficient? What have you missed?

Almost certainly not. The common gaps are: (1) free-text fields such as notes, complaint_text, or support_history that contain names and phone numbers in unstructured text; (2) quasi-identifier combinations — date of birth + postcode + a rare attribute can re-identify a person even with name and IRD removed; (3) phone numbers, addresses, and device identifiers which are personal information and must also be masked; (4) audit/log tables that duplicate PII from the main tables. Mask every column that could identify a person, not just the obvious structured ones.

Q3. What makes a seed script idempotent, and why does it matter?

An idempotent seed script produces the same result regardless of how many times it is run. In SQL, this means using INSERT ... ON CONFLICT DO UPDATE (UPSERT) rather than plain INSERT. In API seeding, it means checking whether the record already exists before creating it. It matters because CI/CD pipelines often run setup more than once (failed jobs, re-runs, parallel environments), and a non-idempotent seed will either fail with duplicate-key errors or create duplicate records that break test assumptions.

Q4. You are generating synthetic IRD numbers with random 9-digit numbers. Tests are failing with “invalid IRD number” validation errors. What is wrong?

IRD numbers use a weighted checksum. A random 9-digit number almost never passes it. You must use the IRD checksum algorithm to generate structurally valid test IRD numbers — either by implementing it yourself (as shown in the lesson code examples) or by using a pre-validated list of fictional IRD numbers. Never substitute random digits directly for an IRD field in any NZ system that validates the format.

Q5. Name three NZ-specific edge cases that should be in every test dataset for a customer-facing application.

Any three from: macrons in name fields (ā ē ī ō ū, as in Māori names); NZST/NZDT transition datetimes (the non-existent 2:30 am during spring-forward, or the ambiguous 2:30 am during fall-back); 1 April financial year boundary (transaction FY attribution); 29 February birthdate (leap day); IRD number with valid checksum (random digits will not pass validation); NZ bank account in BECS format using reserved test bank code 00; very long Māori compound name (up to 100 characters under the Legal Name Change Act 1995); apostrophe variants in names (ASCII vs Unicode curly quote).