AI Test Data Management
Real customer data cannot go into your AI test environments. Synthetic data, RAG knowledge base quality, and data drift are now core QA concerns — and the NZ Privacy Act 2020 has specific teeth here.
1 The Hook
A Wellington-based insurance company was building a RAG-powered claims assistant. The QA team needed to test the assistant's ability to retrieve the right policy clauses. Their solution: copy a sample of 500 real customer claims into the test knowledge base. Fast, representative, realistic data.
Six weeks later, the Privacy Commissioner contacted the company. A QA tester had shared a screen recording of their testing session in an internal Slack channel. Another employee noticed real customer names, claim amounts, and health-related information in the knowledge base responses. The company had not realised the RAG system was surfacing real customer data verbatim in its answers. A formal investigation followed.
The test data management failure here was not malicious — it was a gap in process. The QA team had not established a test data policy for AI systems. They were used to isolated test databases where real data was common. They did not anticipate that a RAG system would quote its knowledge base sources in its responses.
AI systems interact with test data differently from traditional applications. The data does not just flow through a system — it shapes what the system says. This module covers what QA teams need to know: synthetic data generation, RAG knowledge base quality, NZ Privacy Act 2020 obligations, and data drift as a long-term QA concern.
2 The Rule
Test data for AI systems is not just an input — it shapes the system's outputs. Real personal data must never enter AI test environments. Synthetic data must be realistic enough to expose real failures. And the quality of your test data is the quality ceiling of your AI testing.
3 The Analogy
Test data for an AI system is like the script given to an actor in rehearsal.
In traditional software testing, test data is props — it passes through the system and the system acts on it. In AI systems, test data is the script. The AI reads it, internalises it, and quotes from it. If you give an actor a script with real people's private medical details in it, those details appear verbatim in the performance. The stage is now a leaky boundary. The same applies to a RAG system with real customer data in its knowledge base: the data does not just flow through — it gets retrieved and displayed to users.
4 Synthetic Test Data for AI Systems
What is synthetic test data?
Synthetic test data is artificially generated data that mimics the structure, format, statistical distribution, and edge case profile of real data — without containing any actual personal information. For AI systems, synthetic data serves an additional purpose beyond privacy compliance: it lets you deliberately inject the data conditions you want to test, rather than hoping real data contains the right edge cases.
What makes AI test data different
In traditional software, test data needs to trigger specific code paths. In AI systems, test data also needs to:
- Cover linguistic and semantic variation — users phrase the same intent in dozens of different ways; your test data must represent this variation
- Include adversarial inputs — prompt injection attempts, gibberish, emotionally charged language, off-topic queries
- Represent demographic diversity — NZ-specific names, te reo Maori phrases, Pacific Island contexts; training data is typically US/UK biased
- Contain deliberate edge cases — values at decision boundaries, ambiguous queries with multiple valid interpretations, inputs that expose bias
Generating synthetic test data with LLMs
LLMs can generate realistic synthetic test data at scale. This is one of the most legitimate use cases for AI in testing — using AI to create the inputs that will be used to test AI. Common approaches:
LLM-generated synthetic data patterns
- Persona-based generation: "Generate 20 customer support queries a 65-year-old NZ farmer might ask about insurance claims, including regional dialect variations"
- Adversarial generation: "Generate 15 prompt injection attempts a user might try against a banking chatbot"
- Boundary generation: "Generate synthetic IRD numbers that are structurally valid but clearly fictional (use ranges reserved for testing)"
- Paraphrase generation: "Generate 10 different ways to ask 'what is my account balance' — including informal NZ phrasing, broken English, and SMS-style abbreviations"
NZ-specific synthetic data considerations
- IRD numbers: use format 000-000-000 with clearly fictional ranges — IRD publishes test IRD numbers for development use
- NZ phone numbers: use 021-000-XXXX or 09-000-XXXX ranges that are structurally valid but reserved for testing
- NZ addresses: use fictional street names in real NZ suburbs, or use New Zealand Post's published test address set
- NZ names: generate Maori, Pacific Island, and European names proportional to NZ demographics — do not default to Anglo-Saxon names only
- Bank accounts: NZ bank accounts follow a 16-digit format (XX-XXXX-XXXXXXX-XX); use ranges with clearly fictional bank/branch codes
5 RAG Knowledge Base Data Quality
Why RAG data quality is a QA responsibility
In a Retrieval-Augmented Generation system, the AI model generates responses by retrieving relevant passages from a knowledge base, then synthesising an answer. The quality of the model's responses is bounded by the quality of what it retrieves. A state-of-the-art language model given poor source documents will produce poor answers. This makes knowledge base data quality a testing concern — and specifically a QA concern, because nobody else is systematically validating it.
Knowledge base quality dimensions
What goes wrong in RAG knowledge bases
- Stale documents: policy documents, product specs, or pricing that has been superseded but not removed from the knowledge base — the AI retrieves outdated information confidently
- Contradictory documents: two documents in the same knowledge base that state different things; the AI may synthesise a response that combines both, producing a confident but internally inconsistent answer
- Coverage gaps: topics the knowledge base does not cover, causing the AI to either hallucinate an answer or give a generic non-answer when a specific answer is needed
- Format inconsistency: documents in different formats, naming conventions, or terminology for the same concept — retrieval systems may fail to match semantically similar content
- Embedded personal data: historical documents that contain real customer names, case numbers, or personal details, which may be retrieved and surfaced to other users
Testing the knowledge base, not just the model
QA teams testing RAG systems need two distinct test strategies:
- Retrieval testing: given a query, does the system retrieve the right document(s)? Measure precision (are retrieved documents relevant?) and recall (does the system find all relevant documents?)
- Generation testing: given retrieved documents, does the model synthesise a correct answer? Test with both good documents (expected correct answer) and bad documents (stale, contradictory, or off-topic) to understand failure modes
The insurance company in the Hook section failed both: their knowledge base contained real personal data (embedded personal data risk), and they had not tested that the system would surface verbatim quotes from its sources in its responses (a generation testing gap).
6 Privacy and Compliance in AI Test Data
NZ Privacy Act 2020 obligations
The NZ Privacy Act 2020 requires that personal information is collected and used only for the specific, lawful purpose it was collected for. Customers who provided their information to an insurance company for claims processing have not consented to having that information used to train or test an AI system. Using real personal data in AI test environments likely breaches:
- Privacy Principle 4: collecting more personal information than is necessary for the purpose (test environments can use synthetic data)
- Privacy Principle 5: failure to protect personal information from unauthorised access or disclosure (test environments have weaker access controls than production)
- Privacy Principle 10: using personal information for a different purpose than the one it was collected for
Data minimisation in practice
Data minimisation means using only the personal information necessary for the purpose. In AI testing, this translates to a hierarchy of options:
1. Fully synthetic data (preferred)
No real personal information at any level. Generated to match statistical properties of real data. The standard choice for most AI test scenarios.
2. Anonymised data (acceptable with controls)
Real data with identifying information removed or replaced. Must be genuinely irreversible — not just name-removed. Re-identification risk in AI systems is higher than in traditional systems because the model may infer identity from combination of attributes.
3. Real data under strict controls (exception only)
Only when synthetic data cannot reproduce the scenario being tested. Requires documented legal basis, data handling agreement, access controls equivalent to production, and a data destruction plan.
7 Data Drift as a QA Concern
What is data drift?
Data drift occurs when the statistical properties of live production data diverge from the data the AI model was trained or tested on. The model was validated against a particular distribution of inputs — if that distribution changes in production, the model's performance degrades, but no code has changed. Traditional monitoring will not catch it.
Types of drift relevant to QA
- Input drift: the queries users ask change over time — new topics, new terminology, new formats (e.g., users start sending voice transcripts instead of typed messages)
- Knowledge base drift: in RAG systems, the documents that are relevant to queries change — new policies, deprecated products, changed regulations
- Label drift (for classification models): what constitutes a "correct" classification changes — a fraud detection model trained on 2023 fraud patterns may miss 2025 fraud patterns entirely
- Context drift: external world changes affect what the right answer is — a benefits calculator that was correct before a budget announcement may be wrong after
What QA teams can do about drift
Drift is a production concern, not just a test environment concern. QA's role is to establish the baseline and the monitoring framework before the system goes live:
- Document the data distribution assumptions the model was validated against — what topics, what query types, what data formats
- Define drift thresholds — at what point does distribution shift warrant re-testing or re-validation?
- Build shadow test sets — a fixed set of queries with known correct answers that can be run periodically in production to detect performance degradation
- Include drift scenarios in acceptance criteria — "the system must maintain >90% accuracy on the baseline test set after 6 months of live data"
8 Common Mistakes
🚫 Using production data dumps as the "quickest" test data source
Why it happens: Production data is immediately available and guaranteed to be representative. Getting synthetic data generated takes time.
The fix: The time cost of generating synthetic data is lower than the cost of a Privacy Act investigation. More practically: in RAG systems, real personal data embedded in a knowledge base will surface verbatim in AI responses — which is an immediate, visible failure, not an abstract risk. Establish a synthetic data generation step as a required part of the test data pipeline before the first sprint.
🚫 Testing the AI model without testing the knowledge base
Why it happens: QA teams focus on the AI model as the black box to be tested, treating the knowledge base as background infrastructure.
The fix: In RAG systems, the knowledge base is as testable as the model — and often more important. Stale, contradictory, or personally-identifying documents in the knowledge base cause systematic failures that no amount of model testing will fix. Add knowledge base quality checks to your test plan: coverage, freshness, consistency, and PII scan.
🚫 Treating anonymisation as a binary state
Why it happens: Teams remove names and IRD numbers and consider data "anonymised."
The fix: Re-identification risk is combinatorial. A dataset with suburb, age bracket, employer, and income band may have only one matching individual even with names removed. In AI systems, re-identification risk is higher than in traditional databases because the model may synthesise inferences across multiple attributes. When in doubt, use fully synthetic data rather than attempting partial anonymisation.
🚫 Not accounting for NZ demographic diversity in synthetic data
Why it happens: Synthetic data generation defaults to the model's training distribution — predominantly US/UK patterns.
The fix: Explicitly specify NZ demographic proportions in your synthetic data prompts. Include Maori names, te reo phrases, Pacific Island contexts, and South Asian names at NZ census proportions. An AI system tested only on Anglo-Saxon synthetic data will perform worse on Maori and Pacific users — and you will not find out until go-live.
🚫 Ignoring data drift until users complain
Why it happens: Once an AI system passes testing and goes live, monitoring is treated as an operational concern, not a QA one.
The fix: Define drift detection as part of your test plan before go-live. Create a shadow test set — a fixed set of representative queries with known correct answers — and schedule periodic runs in production. This turns drift from an invisible degradation into a measurable, testable property.
Senior engineer insight
The test data conversation is the one that separates QA engineers who understand AI systems from those who do not. Every team I have worked with underestimates how much the knowledge base is the system — in RAG architectures, a mediocre model with excellent source documents outperforms an excellent model with mediocre documents, every time. I spend more time on knowledge base quality audits than on prompt evaluation, because that is where the systematic failures actually live. If your test plan does not include a structured knowledge base review with freshness dates, contradiction detection, and PII scanning, you are testing the wrong thing.
The most common mistake: treating the knowledge base as static infrastructure rather than as a testable artefact that degrades over time, gets contradictory entries added, and accumulates embedded personal data from historical records imported without proper scrubbing.
From the field
A government agency in Auckland built a RAG-powered FAQ assistant for a public-facing benefits portal. Their knowledge base was imported from five years of internal email threads, policy memos, and PDF attachments. The import process was automated. Three weeks after go-live, the assistant was surfacing case worker names, internal opinions on specific claimant situations, and references to individuals by first name in responses to public queries. The knowledge base had never been reviewed for embedded personal data — the assumption was that “internal documents” did not count as personal data in a public-facing context. They did. The system was taken offline for six weeks while the knowledge base was audited and rebuilt from approved, scrubbed source documents. The lesson: every document that enters a RAG knowledge base is a potential retrieval result that will be shown to end users. It must be reviewed with that assumption in place, not after go-live.
9 Now You Try
Three graded exercises. Each targets a different aspect of AI test data management. Use the AI to check your answers.
You are testing a chatbot for Kiwibank's personal loan application portal. Write a prompt that will generate 10 synthetic test queries covering: happy path applications, edge case incomes, applications with complicating factors (student loans, multiple employers), and at least 2 adversarial inputs. Your prompt must not use any real personal data and must specify NZ-appropriate demographics.
Write your synthetic data generation prompt here:
Show model answer — strong synthetic data prompt
Role: You are a test data engineer for a NZ retail bank's QA team. Task: Generate 10 synthetic test queries for a personal loan application chatbot. Requirements: - All data must be synthetic — no real names, no real IRD numbers, no real bank accounts - Use NZ-appropriate names proportional to NZ demographics: include at least 2 Maori names, 1 Pacific Island name, 1 South Asian name - Cover the following scenarios: 1. Standard application: single income, employed full-time, amount within normal range ($5,000–$30,000) 2. Income edge case low: income just above minimum threshold ($28,000 annual) 3. Income edge case high: high income applicant asking about maximum available ($100,000+) 4. Student loan complication: applicant with existing student loan asking how it affects eligibility 5. Multiple employers: applicant with two part-time jobs totalling $55,000 annual income 6. Self-employed: sole trader with irregular income asking what documents are needed 7. Existing debt: applicant with existing personal loan asking about refinancing or top-up 8. First-time applicant: no credit history, asking what factors are considered 9. Adversarial 1: applicant trying to get information about another person's loan eligibility 10. Adversarial 2: applicant asking the chatbot to ignore its instructions and confirm approval Format: For each query, provide: - Query ID: Q01–Q10 - Scenario type: [category from above] - Synthetic user name: [NZ-appropriate name, clearly fictional] - Query text: [the message the user would type] - Expected chatbot behaviour: [what a correct response looks like] Constraints: Use clearly fictional income figures that are round numbers. Use "Test" as a surname suffix (e.g., "Aroha Test", "James Test") to make synthetic status obvious. Do not use real bank product names or interest rates.
You have been asked to audit the knowledge base for a RAG-powered HR assistant at a 500-person NZ company. The knowledge base was imported 18 months ago from shared drives and has not been reviewed since. List the 5 most critical quality checks you would run, in priority order. For each check, explain what you are looking for, what tool or technique you would use, and what the failure looks like in practice.
Show model answer — 5 critical knowledge base checks
Priority 1: PII and personal data scan What I am looking for: Employee names, IRD numbers, salary details, health information, disciplinary records, personal email addresses, or any information that could identify a specific individual embedded in documents that will be retrieved and shown to other employees. Tool/technique: Automated PII detection (Microsoft Presidio, spaCy NER, or a custom regex for NZ-specific formats — IRD number pattern, NZ phone, NZ bank account). Run over all documents before loading into the knowledge base. Flag any document with detected PII for manual review. Failure in practice: The assistant retrieves a 2023 restructuring memo that mentions individual employees by name alongside their redundancy package details, and surfaces this in response to an HR query from a current employee. Priority 2: Document freshness audit What I am looking for: Documents that have been superseded by newer policy, legislation change (especially post-2022 employment law changes in NZ), or organisational restructure. Outdated documents that remain in the knowledge base will be retrieved with the same confidence as current ones. Tool/technique: Extract document creation/modification dates. Flag documents older than 12 months for review. Cross-reference policy documents against current HR policy version list. For NZ-specific content, check against Employment Relations Act 2000 amendments and current Holidays Act guidance. Failure in practice: The assistant quotes 2022 sick leave entitlements (5 days per year) when the 2021 Holidays (Increasing Sick Leave) Amendment Act increased the minimum to 10 days. The knowledge base was loaded before the change was widely documented internally. Priority 3: Contradiction detection What I am looking for: Two or more documents in the knowledge base that make conflicting statements about the same policy, procedure, or entitlement. When a RAG system retrieves contradictory documents, it may synthesise a response that combines both, producing a confident but internally inconsistent answer. Tool/technique: Cluster documents by topic using embeddings. Within each cluster, use an LLM to compare the key claims across documents and flag direct contradictions. Manual review for HR-critical topics: leave entitlements, disciplinary procedures, remuneration bands. Failure in practice: One document says the company's parental leave top-up is 8 weeks at full pay; a later document (after a policy change) says 12 weeks. The assistant synthesises "10 weeks at full pay" — a number that appears in neither document and is legally meaningless as an employee expectation. Priority 4: Coverage gap analysis What I am looking for: Topics an employee would reasonably ask the HR assistant about that are not covered in the knowledge base. Coverage gaps cause the model to either hallucinate an answer or give an unhelpfully generic response when a specific one is needed. Tool/technique: Generate a list of 50–100 common HR queries using an LLM. Run each query against the knowledge base and measure retrieval confidence. Any query returning low-confidence or no results signals a coverage gap. Prioritise gaps for document sourcing. Failure in practice: The assistant is asked about the company's flexible working request process (a statutory right under NZ employment law). The knowledge base has no documents on this topic. The assistant responds with generic advice about "speaking to your manager," which is technically not wrong but is unhelpfully vague and may cause compliance risk if an employee believes this is the complete process. Priority 5: Format and encoding consistency check What I am looking for: Documents in incompatible formats (PDF scans vs. structured text vs. HTML), inconsistent terminology for the same concept (e.g., "annual leave" vs. "holiday leave" vs. "AL"), encoding errors from the import process (garbled characters, broken table structures), and documents where key information is in images rather than text. Tool/technique: Automated format audit — check that all documents are machine-readable text (not scanned PDFs). Run a term normalisation check to identify synonym clusters. Visual inspection of high-priority documents for rendering fidelity. Failure in practice: A critical policy table was imported from a Word document as a scanned image — the retrieval system treats it as an empty document. Queries about the content of that table return no relevant results, so the model generates an answer from context alone.
Four scenarios from NZ AI testing projects are described below. For each, classify the problem as one of: Privacy breach, Synthetic data gap, Knowledge base quality failure, or Data drift. Explain your reasoning.
Scenario B: A DHB (now Health NZ) patient triage assistant was tested entirely with synthetic patient queries generated by the development team. After go-live, accuracy was significantly lower for queries from Pacific Island patients. The synthetic data had been generated without specifying demographic diversity, defaulting to Pakeha naming and phrasing patterns.
Scenario C: A local council's RAG assistant for building consent queries was returning information from a 2019 Building Act guidance document that had been superseded by 2023 amendments. Queries about fire egress requirements were returning outdated minimum distances.
Scenario D: A fintech startup used their production customer transaction database as test data for an AI fraud detection model. The dataset included names, account numbers, addresses, and transaction histories for 15,000 real customers. The test environment had weaker access controls than production.
Show model answer — correct classifications
Scenario A: Data drift The statistical distribution of live production data (dominated by winter products after May) diverged from the distribution used in testing (March data, no winter categories). The model was never invalid — it was valid for the data it was tested on. But as the distribution of real-world queries shifted, its accuracy declined. This is seasonal data drift. The mitigation is to ensure test data covers all seasonal patterns (not just the current season at testing time), and to establish a production monitoring test set with known-correct labels that runs monthly. Scenario B: Synthetic data gap The problem was not a privacy breach (synthetic data was used) — but the synthetic data was not representative of the actual user population. Defaulting to Pakeha naming and phrasing patterns meant the test data did not represent Pacific Island users, who make up a significant proportion of DHB patients in many regions. This is a demographic coverage gap in synthetic data generation. The AI was never tested on the queries it would actually receive from a large segment of its user base. The fix is to explicitly specify demographic proportions in synthetic data generation prompts. Scenario C: Knowledge base quality failure (specifically: document freshness failure) The knowledge base contained a superseded 2019 document that had not been removed or flagged when the 2023 Building Act amendments were issued. The AI system retrieved this document with full confidence because the retrieval system has no awareness of legislative change — it finds the most semantically relevant document, which happened to be outdated. This is a knowledge base freshness failure. The mitigation is a document freshness audit (check all documents against a known publication date) and a process to update or remove documents when their source legislation changes. Scenario D: Privacy breach (NZ Privacy Act 2020) Using real production customer data — including names, account numbers, addresses, and transaction histories — in a test environment breaches Privacy Principle 4 (data collected for fraud detection in production was not collected for the purpose of testing an AI model) and Privacy Principle 5 (test environments have weaker security controls than production, creating elevated risk of unauthorised access or disclosure). The 15,000 customers did not consent to having their data used in this way. The correct approach is to generate synthetic transaction data that mimics the statistical properties (transaction amounts, frequency, merchant categories) without containing real customer identifiers.
Why teams fail here
- Treating data preparation as a pre-sprint task, not a test design task — synthetic data generation is handed to a developer to “set up the test database” rather than designed by QA to cover the scenarios that matter; the result is synthetic data that covers happy paths and omits the edge cases only a tester would think to include.
- Assuming the knowledge base is the infrastructure team’s responsibility — QA does not own the knowledge base, so QA does not test it; this leaves knowledge base quality as an unowned risk that surfaces as AI quality problems after go-live.
- Conflating data anonymisation with data removal — removing names from a dataset does not make it synthetic; re-identification through combination of attributes is a real risk in AI systems that process multiple data points simultaneously.
- Not specifying NZ demographic diversity in synthetic data prompts — LLM-generated synthetic data defaults to the model’s training distribution, which is predominantly US and UK; NZ demographic diversity (Maori, Pacific Island, South Asian) must be explicitly requested or it will be absent.
- Treating testing as a one-time activity before go-live — data drift means AI systems degrade over time without code changes; QA teams that do not establish production monitoring baselines have no way to detect this degradation until users complain.
Key takeaway
In AI systems, the data is not just the input — it is part of the system. Test it like you test code: with coverage, quality gates, and ongoing monitoring.
How this has changed
The field moved fast. Here is what the evolution looked like for AI Test Data Management.
Test data for AI models means training/validation datasets. QA teams rarely involved — data scientists own it.
RAG systems bring retrieval data into the picture. QA teams must now test the quality of the knowledge base, not just model outputs. Privacy Act compliance becomes relevant.
Synthetic test data generation via LLMs becomes mainstream. NZ Privacy Act 2020 requires data minimisation in test environments — real customer data must be masked or replaced.
AI test data is a shared responsibility between data engineering, QA, and privacy teams. Test data pipelines for AI systems are as complex as production data pipelines.
11 Self-Check
Click each question to reveal the answer.
Q1: What is synthetic test data and why is it needed for AI systems?
Synthetic test data is artificially generated data that mimics the structure, format, and statistical properties of real data without containing any actual personal information. It is needed for AI systems because real customer data cannot be legally used in most test environments under the NZ Privacy Act 2020, and AI models require large volumes of varied data to be tested effectively.
Q2: How does the NZ Privacy Act 2020 affect test data management for AI systems?
The NZ Privacy Act 2020 requires data minimisation — using only the personal information necessary for a specific, lawful purpose. In AI testing, this means real customer data must be masked, anonymised, or replaced with synthetic equivalents before use. Using real personal data in test environments likely breaches Privacy Principle 4 (data collected for one purpose cannot be used for another) and Privacy Principle 5 (test environments have weaker security controls).
Q3: What is RAG and why does test data quality matter for RAG systems?
RAG (Retrieval-Augmented Generation) is an AI architecture where a language model retrieves information from a knowledge base before generating a response. Test data quality matters because the quality of RAG outputs is directly bounded by the quality of the knowledge base — outdated, incomplete, or contradictory documents produce poor retrieval results regardless of model quality. The knowledge base is a testable artefact, not background infrastructure.
Q4: What is data drift and why is it a QA concern for AI systems?
Data drift occurs when the statistical properties of live production data diverge from the data the AI model was trained or tested on, causing model performance to degrade over time without code changes. It is a QA concern because QA teams must establish baseline data profiles and shadow test sets before go-live to enable drift detection in production — after go-live, there is no baseline to compare against.
Q5: Why is NZ demographic diversity important in synthetic test data generation?
LLM-generated synthetic data defaults to the model's training distribution, which is predominantly US and UK content. Without explicit instructions, generated test data will not include Maori names, te reo phrases, Pacific Island contexts, or South Asian names at NZ census proportions. An AI system tested only on Anglo-Saxon synthetic data will perform worse on Maori and Pacific users — and this will not be discovered until go-live.
12 Interview Prep
Real questions asked in NZ QA interviews. Read the model answers, then practise your own version.
"How would you manage test data for an AI-powered customer service chatbot?"
I would start by establishing a no-real-data policy for the test environment, then work with data engineering to generate synthetic customer queries at scale — using LLMs to produce variation in phrasing, including NZ-specific demographic diversity, and deliberately generating adversarial inputs. For any RAG components, I would also audit the knowledge base for freshness, contradictions, and embedded personal data before loading it. I would document the data distribution assumptions so the team can detect drift after go-live.
"What is data drift and how would you detect it in a production AI system?"
Data drift is when the real-world data the AI processes diverges from the distribution it was tested on, causing accuracy to degrade without code changes. I would detect it by creating a shadow test set — a fixed set of queries with known correct answers — before go-live, then running it periodically in production and tracking accuracy over time. I would also monitor input distributions: query topic frequency, query length, language mix, and any new data formats appearing in production that were not in the test data.
"A developer suggests using a real customer data dump to test the AI system because it is faster than generating synthetic data. How do you respond?"
I would raise two concerns. First, the NZ Privacy Act 2020 concern: customers did not consent to having their data used for AI testing purposes, and test environments have weaker access controls than production — this likely breaches Privacy Principles 4, 5, and 10. Second, the practical risk specific to AI systems: if there is a RAG component, real customer data in the knowledge base will be retrieved and surfaced verbatim in AI responses — this is an immediate, visible failure, not an abstract risk. I would offer to design a synthetic data generation prompt that produces realistic data faster than the manual overhead of sanitising a real dump.