Database Testing · Lesson 4

Test Data Management & Masking

Copying production into a test system is the easiest way to get realistic data — and the easiest way to cause a privacy breach. This lesson teaches you to get realistic test data without leaking a single real person.

Database Testing Database & Backend — Lesson 4 of 4 ~30 min read · ~70 min with exercises

1 The Hook

A fictional NZ lender, Kahu Finance, needed realistic data to test a new loan-arrears feature. The quickest path was obvious: copy the production database into the test environment. A developer ran the copy on a Friday, the test environment filled with real customers, real IRD numbers, real loan balances and arrears histories, and the team got on with testing.

The test environment did not have production’s controls. It was accessible to contractors, its access logs were thin, and a copy of it was later restored onto a developer’s laptop to debug an issue. Thousands of real customers’ financial details — including who was behind on their loans — were now sitting in several places that were never assessed to hold them. No system was hacked. The data was simply taken out of the protected environment it belonged in and scattered into ones that were not.

Months later, during an unrelated review, the spread of real customer data across non-production systems came to light. Under the Privacy Act 2020, using and holding personal information this way — well beyond what testing required, in environments with weaker safeguards — was a serious problem. The breach was not a clever attack. It was a copy command, run because real data was the easy way to get realistic data.

Here is the lesson hidden in that story. Production data is the most realistic test data there is, which is exactly why teams reach for it — and exactly why it is dangerous. The job of test data management is to give testers data that behaves like production without being production: synthetic where you can, masked where you must, and never a raw copy of real people into a system that was not built to protect them.

2 The Rule

Real personal data does not belong in a test environment. Test data must behave like production without being production — synthetic where you can build it, irreversibly masked where you must derive it from real records. Under the Privacy Act 2020, a raw copy of customers into a less-protected environment is a breach waiting to be found, not a shortcut.

3 The Analogy

Analogy

Training a new bank teller with real customers’ open accounts.

You would never train a new teller by handing them the live accounts of real customers to practise on — letting them riffle through real balances, real arrears, real names, in a back room with the door open. You would give them realistic practice accounts: ones that look and behave like the real thing, with the same kinds of balances and edge cases, but that belong to nobody. The teller learns everything they need without ever touching a real person’s money or secrets.

Test data is the same. Kahu Finance trained its test environment on real open accounts. The right move was practice accounts — synthetic or masked data that exercises the system exactly as production would, while belonging to no actual customer. A test data manager builds the practice accounts so nobody ever has to bring the real ones into the back room.

4 The Privacy Act 2020 and Privacy by Design

The Privacy Act 2020 governs how NZ organisations collect, hold, use, and disclose personal information. Two of its information privacy principles land directly on test data, and a tester should be able to name them.

Use and disclosure limits. Personal information collected for one purpose — running a customer’s loan — may not simply be used for another. Copying it into a test environment is a use, and one the customer never agreed to. Storage and security. An organisation must protect personal information with safeguards reasonable for its sensitivity. Moving it into an environment with weaker controls — the Kahu test system, a developer’s laptop — fails that duty directly.

Privacy by Design is the principle that privacy protection is built into a system and its processes from the start, not added after a breach. For test data, Privacy by Design means the default path to test data is a safe one: synthetic generation and masking are the standard, copying raw production is blocked by design, and a tester does not have to remember to be careful because the easy path is already the safe path. The Kahu breach is what the absence of Privacy by Design looks like — the easy path was the unsafe one, so the unsafe thing happened.

For a tester this is not abstract compliance. A serious privacy breach can trigger mandatory notification to the Office of the Privacy Commissioner and to affected people, reputational damage, and regulator action. Knowing that real personal data in a test environment is itself a breach — before anything is “leaked” further — is the foundation everything else in this lesson rests on.

5 Masking, Anonymisation and De-identification

When you genuinely need data derived from production — for volume, for realistic distributions, for a tricky edge case — you mask it so it no longer identifies anyone. The terms matter and are often confused.

  • Masking is replacing real values with realistic-but-fake ones — a real name becomes a different plausible name, an IRD number becomes a different valid-format number. The data still looks and behaves right; it just no longer belongs to the real person.
  • De-identification is removing or altering the information that ties a record to a specific individual, so the record can no longer reasonably be linked back to them.
  • Anonymisation is the strong end: the data is altered so the individual cannot be re-identified by anyone, by any means reasonably available. True anonymisation, properly done, takes the data outside the Privacy Act’s reach because it is no longer personal information.

The critical property is irreversibility. Masking that can be undone — a reversible scramble, a lookup table that maps fake values back to real ones — has not actually protected anyone; it has just added a lock that someone holds the key to. And masking must be consistent and referentially safe: if customer 48213 appears in three tables, the same masked identity must apply across all three, or the joins between them break and the data becomes useless for testing.

The most dangerous trap is re-identification through combination. You can mask the name and still leave a person identifiable. A record with date of birth, full postcode, ethnicity, and a rare medical condition can pin down one individual in a small NZ town even with the name removed. De-identification means looking at the combination of quasi-identifiers, not just the obvious name and IRD number fields. A tester reviewing masked data asks: could I pick a real person out of this by combining what is left?

Pro tip: Free-text fields are where masking silently fails. You can mask the structured name and ird_number columns perfectly and still leak everything because a notes or complaint_text field contains “spoke to Materangi about her ACC claim, ph 021…”. Always check the unstructured columns, not just the obvious ones.

6 Synthetic Data Design

The safest test data is data that was never anyone’s in the first place. Synthetic data is generated from rules and distributions rather than copied from real records, so it carries no privacy risk at all — there is no real person behind it to re-identify. Privacy by Design points here first: prefer synthetic, fall back to masking only when synthetic cannot meet a need.

But synthetic data is only useful if it is well designed. Random rubbish passes no realistic test. Good synthetic data is engineered to:

  • Be structurally valid: IRD numbers that pass the checksum, NHI numbers in the right format, dates that are plausible, foreign keys that actually link. Data the system will accept and process.
  • Cover the cases that matter: not just the happy path but the edge cases — a loan in arrears, a closed account, a customer with no email, a joint account, the boundary values. Synthetic data lets you design in the rare case you could wait months to see in real data.
  • Reflect realistic distributions: if 30% of customers are in one region, the synthetic set should roughly mirror that, so volume and performance behaviour looks like production.
  • Preserve referential integrity: every generated child row points at a generated parent that exists — the integrity rules from Lesson 2 apply to the data you build, too.

The quiet advantage of synthetic data is control. With production data you are stuck with whatever cases happen to exist; with synthetic data you build exactly the scenarios your tests need, including the dangerous edge cases that are rare and important — the MSD client with overlapping benefits, the KiwiSaver member at the contribution cap. You design the data to your test plan instead of hunting for it.

7 Refreshing Non-Production Data Safely

Test environments go stale — the data drifts from production’s shape, new code needs new structures, and teams want a refresh. The refresh is exactly the moment the Kahu breach happened, so it needs a safe, repeatable process rather than an ad-hoc copy command.

A safe refresh has these properties:

  • Masking happens before the data lands in non-prod, never after. If raw production data is copied in and masked later, there is a window — however short — where real data sat in the weaker environment. That window is itself the breach. Mask in transit, so unmasked data never touches non-prod.
  • It is automated and repeatable. A scripted pipeline applies the same masking rules every time, so a refresh cannot “forget” to mask a column. A manual copy relies on someone remembering — and the day they forget is the day real data leaks.
  • New columns fail safe. When production gains a new column, the refresh must not pass it through unmasked by default. The safe design is deny-by-default: any column not explicitly approved as non-personal is masked or excluded, so a newly added passport_number is not quietly copied in clear.
  • It is verified. After a refresh, a tester checks that the masking actually worked — no real IRD numbers, no real names, no re-identifiable combinations, including in free-text fields. A refresh is not done until someone has confirmed no real person is in it.

The role of the tester here is twofold: confirm the refreshed data is fit for testing (realistic, complete, referentially intact) and confirm it is safe (no real personal data survived the mask). Both must be true. Fit-but-unsafe is the Kahu breach; safe-but-useless wastes everyone’s time. Good test data management delivers both, by design.

8 Building Test-Data Test Cases

A test-data test case asserts that the data in a non-production environment is both safe and useful, and it is run as a gate after every refresh.

Here is a verification test case written to catch the exact Kahu failure — real data surviving into non-prod:

Test ID: TDM-MASK-005
Asserts: No real personal data survives in the test environment.
Risk category: Privacy — real PII in a less-protected environment.
Checks: 1) No test IRD number matches any real production IRD number.
                  2) No test customer name matches a real production name on the
                  same record.
                  3) Free-text fields (notes, complaints) contain no real names,
                  phone numbers, or IRD numbers.
                  4) Masking is consistent: customer 48213 has the same masked
                  identity across all tables (joins still work).
                  5) No re-identifiable combination (DOB + postcode + rare attribute).
Pass criterion: All five return zero real-data matches. Any match fails the refresh.
Evidence required: Query results for each check; masking pipeline run ID; sign-off.
Traceability: Risk R-01 (real PII in non-prod, Privacy Act 2020 breach).
Result: [Pass / Fail]

Notice the shape: it tests both safety (checks 1, 2, 3, 5 prove no real person survived, including the free-text and re-identification traps) and usefulness (check 4 proves masking stayed consistent so the data still works for testing). The pass criterion is zero real-data matches — the exact gate Kahu Finance never ran before letting contractors near the environment.

9 Common Mistakes

🚫 Copying raw production data into a test environment

Why it happens: Production is the most realistic data and a copy is one command away.
The fix: Real personal data in a less-protected environment is itself a Privacy Act 2020 breach — the Kahu trap — before anything is leaked further. Use synthetic data, or mask in transit so unmasked production never lands in non-prod. The easy path must be the safe one.

🚫 Masking only the obvious name and ID fields

Why it happens: Name and IRD number are the fields everyone thinks of as personal.
The fix: People are re-identifiable through combinations — date of birth plus postcode plus a rare attribute can pin down one person with the name removed. And free-text fields leak real names and numbers no structured mask touches. Mask the combination and the unstructured columns, not just the obvious ones.

🚫 Using masking that can be reversed

Why it happens: A reversible scramble or a fake-to-real lookup table feels convenient.
The fix: Masking that can be undone has not protected anyone — it just hands the key to whoever holds the mapping. Real protection is irreversible: no path exists from the masked value back to the real person. A reversible mask in a weak environment is a breach with extra steps.

🚫 Masking after the copy instead of in transit

Why it happens: It is simpler to copy everything in, then run a masking job.
The fix: If you mask after copying, there is a window where raw real data sat in the weaker environment — and that window is the breach. Mask in transit so unmasked data never touches non-prod, automate it so a refresh cannot forget, and deny-by-default so new columns are not passed through in clear.

10 Now You Try

Three graded exercises: spot the privacy risks, fix the unsafe refresh, then build a synthetic test data set. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Privacy Risks

Read the test-data setup for a fictional ACC claims system below. Identify 3 privacy risks under the Privacy Act 2020, and name the issue each is (raw copy, re-identification by combination, free-text leak, reversible mask).

Test-data setup
To test a new claims dashboard, the team copied production into the UAT environment, then ran a job that replaced the full_name and ird_number columns with fake values via a lookup table that maps each fake value back to the real one (kept “in case we need to trace an issue”). The claim_notes free-text field was left untouched. The date_of_birth, suburb, and injury_type columns were also left as-is, “because they aren’t names”. UAT is accessible to an offshore support vendor.

List 3 privacy risks and the issue type of each:

Show model answer
There are at least four real risks; any three well-explained earns full marks.

1. Raw copy then mask-after / reversible mask — production was copied into UAT first and masked afterwards, so raw real data sat in the weaker environment (the breach window), AND the lookup table maps fake values back to real ones, so the mask is reversible and protects nobody. Issue type: raw copy + reversible mask. Why it breaches: real PII reached a less-protected environment accessible to an offshore vendor, and the "masked" data can be turned back into real people by anyone holding the table.

2. Free-text leak — claim_notes was left untouched. Free-text routinely contains real names, phone numbers, and details ("rang Hemi on 021..., ACC claim for back injury"). Issue type: free-text leak. Why it breaches: masking the structured name/IRD columns is pointless while the same information sits in clear in the notes field.

3. Re-identification by combination — date_of_birth + suburb + injury_type were left because "they aren't names", but together these quasi-identifiers can pin down one real person, especially a rare injury in a small suburb. Issue type: re-identification by combination. Why it breaches: the record is still personal information — someone can be re-identified — so removing the name alone did not de-identify it.

Bonus: offshore vendor access raises cross-border disclosure concerns under the Privacy Act 2020.

The thread: masking the obvious fields is not enough. Mask in transit, irreversibly, including free-text and quasi-identifier combinations.
🔧 Exercise 2 of 3 — Fix the Unsafe Refresh

A fictional Kiwibank team refreshes its test environment like this: “Restore last night’s production backup into TEST, then run mask.sql to scramble the name and email columns. Anyone can trigger a refresh.” Rewrite this into a safe, repeatable refresh process that would satisfy Privacy by Design. State the order of operations and the controls.

Write the safe refresh process — ordered steps and the controls at each:

Show model answer
Why the current process is unsafe: (a) it restores raw production into TEST and masks AFTER, so real data sits in the weaker environment for the window between restore and mask — that window is the breach. (b) It masks only name and email, leaving IRD numbers, DOB, free-text, and quasi-identifier combinations in clear. (c) "Anyone can trigger a refresh" means no control and no record of who exposed what.

Safe refresh process (ordered steps):
1. Mask in transit: the pipeline reads from production (or a secured production snapshot) and applies masking BEFORE the data is written to TEST, so unmasked data never lands in the weaker environment.
2. Apply a complete, irreversible masking ruleset: name, email, IRD, DOB, phone, address, free-text fields, and any quasi-identifier combinations — using non-reversible transforms with no fake-to-real lookup retained, and consistent masking so the same customer maps to the same masked identity across tables.
3. Deny-by-default on columns: any column not explicitly classified as non-personal is masked or excluded, so a newly added column is never passed through in clear.
4. Verify and gate: run the post-refresh safety checks (no real IRD/name matches, free-text clean, no re-identifiable combos) AND usefulness checks (referential integrity intact, realistic volume) before the environment is released. Refresh is not "done" until verification passes.

Controls: masking is automated/scripted so it cannot be forgotten; irreversible with no retained mapping; deny-by-default for new columns; refresh restricted to an authorised, logged role, not "anyone"; sign-off recorded.

The core fix: mask in transit (never restore-then-mask), make it complete and irreversible, and verify every time.
🏗️ Exercise 3 of 3 — Build a Synthetic Test-Data Set

Design a synthetic test-data plan of 5 records/scenarios for a fictional MSD benefit-eligibility system (so no real client is ever used). Cover the edge cases that matter. For each, give: an ID, the scenario it exercises, the key field values that make it valid and realistic, and the test it enables. Ensure structural validity and referential integrity.

Show model answer
SYN-01 | Scenario: standard eligible single adult (happy path) | Key field values: synthetic client_id, valid-format IRD (passes checksum), age 34, single income source under threshold, NZ resident flag true | Test it enables: a straightforward grant approval

SYN-02 | Scenario: applicant just over the income threshold (boundary) | Key field values: income $1 above the cut-off, all else eligible | Test it enables: the boundary decision — declined for income, proving the threshold is applied at the exact edge

SYN-03 | Scenario: client with overlapping benefits | Key field values: one client_id linked to two active benefit rows of types that should not co-exist, valid foreign keys to the client | Test it enables: the overlap/conflict rule — system flags or blocks the second benefit

SYN-04 | Scenario: incomplete application (missing mandatory data) | Key field values: client with NULL in a required field (e.g. no recorded residency evidence) | Test it enables: validation handling — application held/rejected rather than silently approved

SYN-05 | Scenario: edge identity / non-resident | Key field values: valid-format but non-resident applicant, plausible DOB making them just under/over an age rule | Test it enables: the residency and age eligibility rules at their boundaries

What makes this strong: every record is structurally valid (IRDs pass checksum, formats correct), referential integrity holds (benefit rows point at a real synthetic client), and the set DESIGNS IN the edge cases — boundary income, overlapping benefits, missing data, residency/age edges — that you could wait months to find in real data. And because it is all synthetic, there is zero privacy risk: no real client is ever touched. Weak plans generate five near-identical happy-path clients and miss the boundaries — that is the difference being marked.

11 Self-Check

Click each question to reveal the answer.

Q1: Why is copying production into a test environment a breach, even if nothing is “leaked” further?

Because under the Privacy Act 2020, using personal information for a purpose the customer never agreed to and holding it in an environment with weaker safeguards is itself a failure of the use-limit and security principles. The breach is the real data sitting in the less-protected environment — the Kahu trap — not only what happens to it afterwards.

Q2: Why is masking only the name and IRD number fields insufficient?

Because people are re-identifiable through combinations of quasi-identifiers — date of birth plus postcode plus a rare attribute can pin down one person with the name removed — and free-text fields leak real names and numbers no structured mask touches. De-identification means addressing the combination and the unstructured columns, not just the obvious fields.

Q3: Why must masking be irreversible?

Because masking that can be undone — a reversible scramble or a fake-to-real lookup table — has not protected anyone. It just hands the key to whoever holds the mapping. Real protection means no path exists from the masked value back to the real person; a reversible mask in a weak environment is a breach with extra steps.

Q4: Why mask in transit rather than after copying into non-prod?

Because if you mask after copying, there is a window — however short — where raw real data sat in the weaker environment, and that window is itself the breach. Masking in transit means unmasked production data never touches non-prod at all. The refresh should also be automated and deny-by-default so it cannot forget a column.

Q5: What is the advantage of synthetic data over masked production data?

It carries no privacy risk at all — there is no real person behind it to re-identify — and it gives you control. You design in exactly the edge cases your tests need (a loan in arrears, overlapping benefits, boundary values) instead of hunting for cases that happen to exist in production. Privacy by Design points to synthetic first, with masking as the fallback.

12 Interview Prep

Real questions asked in NZ QA interviews for data-heavy roles. Read the model answers, then practise your own version.

“A developer wants to copy production into the test environment to get realistic data. What do you say?”

I would stop it. Under the Privacy Act 2020, real personal data in a test environment is itself a breach — it is being used for a purpose the customers never agreed to, and held somewhere with weaker controls. That is true before anything is leaked further. I would offer the safe alternatives: synthetic data designed to my test plan as the first choice, because it carries no privacy risk and lets me build the exact edge cases I need; and where I genuinely need production-derived data, an irreversible mask applied in transit so unmasked data never lands in non-prod. I would also point out the traps a quick copy ignores — free-text fields and re-identifiable combinations — and that the right fix is a Privacy-by-Design refresh process, not a copy command.

“What is the difference between masking, de-identification, and anonymisation?”

Masking replaces real values with realistic fake ones — the data still behaves right but no longer belongs to the real person. De-identification removes or alters what ties a record to a specific individual so it can no longer reasonably be linked back. Anonymisation is the strong end: the data is altered so no one can re-identify the individual by any means reasonably available, and done properly it takes the data outside the Privacy Act because it is no longer personal information. The properties that matter across all of them are irreversibility — no path back to the real person — and addressing combinations of quasi-identifiers and free-text, not just the obvious name and IRD fields, because that is where re-identification actually happens.

“How would you verify that a test environment refresh did not leak real data?”

I would run a verification gate that checks both safety and usefulness. For safety: no test IRD number matches a real production IRD, no test name matches a real name on the same record, the free-text fields contain no real names or phone numbers or IRD numbers, and there is no re-identifiable combination like date of birth plus postcode plus a rare attribute. For usefulness: masking is consistent so the same customer maps to one masked identity across tables and the joins still work, referential integrity holds, and the volume is realistic. The pass criterion is zero real-data matches — any match fails the refresh and the environment is not released. I would have that run automatically after every refresh, with the result and sign-off recorded, so it is a gate, not a hope.