Managing AI Risks in Testing
Hallucination, bias, data privacy, and non-determinism are real — and manageable. Testers who understand these risks use AI more effectively than those who avoid it or trust it blindly.
1 The Hook
An NZ telecommunications company asked their QA team to use AI to generate test cases for a new billing system migration. The team used AI effectively — 400 test cases in a day. When they ran them, 340 passed. The project manager celebrated: 85% pass rate.
But three months after go-live, customers started receiving double charges. The investigation revealed that 18 of the 60 "failed" test cases had been dismissed as AI hallucinations — invented test conditions that seemed unrealistic. Twelve of those 18 were actually valid edge cases the team had not considered. The AI was right. The team had dismissed real test cases because they assumed AI output was unreliable.
This is not an argument against using AI. The team was right to use AI — 400 test cases in a day is a genuine productivity gain. The failure was not the tool. The failure was a lack of understanding about how AI tools behave and how to review their output critically.
AI tools in testing introduce a specific and learnable set of risks: hallucination (confident but incorrect output), reasoning errors (right facts, wrong conclusions), bias in what the model generates, data privacy hazards when test inputs contain real customer data, and non-determinism that makes test artefacts hard to reproduce.
Testers who understand these risks do not avoid AI — they work with it more effectively. They know which outputs to trust, which to verify, and which to question. That is the skill this module builds.
2 The Rule
Every AI model generates output that may be plausible but incorrect. This is not a bug — it is a fundamental characteristic of how LLMs work. A tester who understands hallucination, bias, data privacy risks, and non-determinism can use AI confidently and safely. A tester who does not will either avoid AI entirely or trust it blindly — both are wrong.
3 The Analogy
A very well-read junior tester who has never worked in your organisation.
An LLM generating test cases is like a junior tester who has studied thousands of test plans but has never worked in your specific organisation. They are fast, enthusiastic, and generally knowledgeable — but they will sometimes confuse your system's rules with another system they read about. They may generate IRD validation rules from memory that are slightly off, or produce phone number formats that are correct for Australia but not New Zealand. The solution is not to fire the junior tester. The solution is to review their work before it goes into production.
4 Hallucinations and Reasoning Errors
CT-GenAI-3.1.1, 3.1.2, 3.1.3, 3.1.4
What is a hallucination?
An LLM hallucination is when the model generates factually incorrect content with apparent confidence. The model does not "know" it is wrong — it produces the most statistically likely continuation of its input, regardless of whether that continuation is true. In a testing context, this means:
- Field names that do not exist in your system
- Status codes that are wrong (e.g., the model says the endpoint returns 201, but it returns 200)
- Business rules that are plausible but incorrect (e.g., "IRD numbers are always 9 digits" — they are actually 8 or 9)
- Test data that looks valid but fails your system's validation (e.g., an NZ mobile number formatted as a landline)
Reasoning errors (CT-GenAI-3.1.1)
Reasoning errors are distinct from hallucinations. The model may have the correct facts but draw the wrong conclusion. A common example: given a field with valid range 0–100, a reasoning error generates boundary tests at -1 and 101 but omits tests at 0 and 100 — the actual boundary values. The model understood boundary value analysis in principle but misapplied it.
Bias in AI output (CT-GenAI-3.1.1)
- Training data bias: if the training data was dominated by US software examples, the model may default to US formats — ZIP codes, SSNs, MM/DD/YYYY dates — even when you explicitly ask for NZ context
- Confirmation bias in output: AI tends to generate positive test cases (happy path) over negative ones unless explicitly instructed otherwise
- Recency bias: the model may be more familiar with older API patterns or deprecated libraries than current standards
Identifying hallucinations (CT-GenAI-3.1.2)
Red flags in AI-generated test artefacts
- Field names you do not recognise from the actual specification
- Status codes that do not match your API documentation
- Business rules stated with certainty that contradict your spec or your domain knowledge
- Test data that passes a surface inspection but fails your system's validator when you run it
Mitigation strategies (CT-GenAI-3.1.3)
- Always provide the actual spec as context — models hallucinate significantly less when given ground truth to work from
- Use few-shot prompting with verified examples — provide one or two correct test cases to establish the right pattern before asking the model to generate more
- Ask the model to cite its reasoning: "For each test case, explain which requirement it is testing" — this surfaces hallucinations quickly because the model cannot cite a requirement that does not exist
- Cross-reference generated test cases against the actual spec before adding them to your test suite
Non-determinism (CT-GenAI-3.1.4)
Running the exact same prompt twice will produce different outputs. This is an inherent property of how LLMs sample from probability distributions. It makes regression testing of AI-generated artefacts challenging — you cannot "re-run" an AI generation session and expect identical results.
- Set temperature to 0 (or the lowest available value) for the most deterministic output possible
- Use seed values if the API supports them
- Save and version the output of AI generation sessions — the saved output is your test artefact, not the prompt
5 Data Privacy and Security
CT-GenAI-3.2.1, 3.2.2, 3.2.3
NZ Privacy Act 2020 context
Under the NZ Privacy Act 2020, personal information must only be collected for a specific, lawful purpose — and individuals have not consented to their data being sent to an AI provider for model training or inference logging. Pasting real customer data into a public AI tool likely breaches Privacy Principle 4 (limiting collection to authorised purposes) and possibly Principle 5 (storage and security of personal information).
What counts as sensitive data in a testing context
- Customer names, email addresses, phone numbers, dates of birth
- IRD numbers (New Zealand's unique national tax identifier — highly sensitive)
- Bank account numbers and payment card details
- Health information of any kind
- Government-issued ID details (passport number, driver licence number)
- Proprietary system specifications (commercially sensitive even if not personal data)
Data privacy risks when using AI for testing (CT-GenAI-3.2.1, 3.2.2)
- Training on inputs: some consumer-tier AI tools use your prompts to improve their models — your test data becomes training data
- Data residency: AI providers may process data in overseas jurisdictions, which can conflict with NZ data residency requirements in government and health contracts
- Logging and retention: AI providers log prompts for safety and debugging, creating an unintended data store of confidential information
- API key exposure: embedding AI API keys in test automation code creates a risk of unauthorised use if the code is committed to a public repository
Mitigation strategies (CT-GenAI-3.2.3)
- Use only AI tools approved by your organisation's security and privacy team
- Replace real customer data with synthetic equivalents — fake NZ names, fake IRD numbers, fake bank account numbers — before prompting
- Use on-premise or private-deployment LLM solutions for sensitive government and health contexts
- Review the AI provider's data processing agreement (DPA) before first use
- Never commit API keys to version control — use environment variables or a secrets manager
6 Environmental Impact
CT-GenAI-3.3.1
LLM inference has a measurable carbon footprint. A single complex prompt to a large general-purpose model may consume as much energy as running several automated tests. For individual use this is negligible, but at scale — generating test cases for thousands of user stories across a large organisation — the environmental cost accumulates.
Practical implications: prefer smaller, purpose-specific models for routine generation tasks over large general models; batch prompts where possible rather than submitting them one by one; cache useful outputs rather than regenerating identical prompts; and factor this into your organisation's AI strategy when selecting tooling (see CT-GenAI Ch 5).
7 AI Regulations and Standards
CT-GenAI-3.4.1
The regulatory landscape for AI is evolving rapidly. NZ testers need to be aware of the frameworks that apply to their context:
NZ Privacy Act 2020
The primary domestic framework governing personal information. Relevant to any testing activity that involves customer or staff data, whether handled by humans or AI tools.
EU AI Act
If your organisation sells to or operates in the EU, the EU AI Act introduces strict requirements for high-risk AI systems — healthcare, employment, critical infrastructure, and law enforcement. NZ organisations with EU exposure should understand which category their systems fall into.
ISO/IEC 42001 — AI Management Systems
The international AI management system standard, increasingly required in enterprise and government procurement. Organisations using AI in critical processes may need to demonstrate compliance.
NIST AI Risk Management Framework & NZ Government Algorithm Charter
The NIST AI RMF is widely referenced in NZ government and enterprise AI governance. The NZ Government Algorithm Charter applies to government agencies using algorithmic decision-making, including AI-assisted processes.
For testers: understand which standards apply in your organisation. Ensure AI-assisted test artefacts are documented and traceable — regulators may ask how test coverage was achieved, and "the AI generated it" is not a sufficient audit trail without human review records.
Working for a government agency? Use the NZ Privacy Checklist.
Five checks — data classification, PII, residency, tool approval, and documented de-identification — that every tester must complete before pasting data into an LLM in a government context.
Open the checklist →8 Common Mistakes
🚫 Dismissing AI output as "hallucination" without checking
Why it happens: Teams learn that AI can hallucinate and become over-cautious, dismissing any output that looks unfamiliar.
The fix: Not all surprising output is wrong. Review it against the spec before dismissing it. The AI may have identified edge cases you have not considered. The NZ telco story at the start of this module is a real pattern — the AI was right and the humans dismissed it.
🚫 Using real customer data as test input in an AI prompt
Why it happens: Real data is available and seems like the easiest way to make prompts realistic.
The fix: This likely violates the NZ Privacy Act 2020 and your organisation's data handling policies. Always generate synthetic or anonymised test data before prompting. Fake NZ names, fake IRD numbers, and fake bank account numbers are straightforward to generate.
🚫 Assuming the same prompt will always produce the same output
Why it happens: Testers accustomed to deterministic tools expect identical inputs to produce identical outputs.
The fix: LLMs are non-deterministic. Running the same prompt twice can produce different test cases. Save and version your AI-generated outputs. Treat the saved output as the test artefact — not the prompt.
🚫 Using an unapproved AI tool because it is faster
Why it happens: Free or consumer-tier tools are immediately available; the approval process takes time.
The fix: Shadow AI — using tools not sanctioned by your organisation's security and compliance team — creates legal, security, and compliance risk. One data breach caused by an unapproved tool can cost more than years of accumulated productivity gains.
9 Now You Try
Three graded exercises. Each targets a different AI risk from this chapter. Use the AI to check your answers, then compare to the model answer.
Below is an AI-generated test case set for an ANZ online banking password reset. It contains 3 planted hallucinations — plausible-sounding test cases that reference incorrect business rules, invented field names, or wrong system behaviour. In the textarea, list what you think the 3 hallucinations are and why. Then run your answer to get AI feedback.
TC-02 | Reset link expires after 15 minutes | Click reset link 16 minutes after sending | System shows "This link has expired. Please request a new one." | Negative
TC-03 | Password must meet NZISM minimum of 8 characters | Enter 7-character password → click Submit | System shows "Password must be at least 8 characters" | Negative
TC-04 | Reset blocked after 5 failed attempts in 1 hour | Attempt reset 6 times in 60 minutes | System shows "Too many attempts. Try again in 24 hours." | Negative
TC-05 | NZ mobile reset: must be in 021, 022, or 027 prefix format | Enter +64 21 555 0123 | System accepts and sends OTP | Positive
TC-06 | New password cannot match last 3 passwords | Enter a password used 2 cycles ago | System shows "Password must not match your last 5 passwords" | Negative
TC-07 | Password reset not available between 2am–4am NZST due to maintenance window | Attempt reset at 3am NZST | System shows "Password reset is unavailable during scheduled maintenance (2am–4am NZST)" | Negative
List the 3 hallucinations you found and explain why each is wrong:
Show the 3 planted hallucinations
TC-03: NZISM minimum password length is 16 characters (not 8). The NZ Information Security Manual (NZISM) v3.7 specifies a minimum of 16 characters for government systems. "8 characters" is a commonly hallucinated value that reflects older international standards, not NZISM. TC-06: The test case says "last 5 passwords" in the expected result but "last 3 passwords" in the description. More importantly, the specific history depth (3 or 5) is an invented business rule — neither value is a universal standard and neither is specified in the scenario. The AI invented a specific number with false confidence. TC-07: ANZ (and most NZ retail banks) do not have a published maintenance window that blocks password reset at 2–4am NZST. This is a plausible-sounding operational rule that does not exist. The AI generated it because "maintenance window" is a common pattern in banking systems — but applied it without any source in the spec.
A tester has written the prompt below to generate test cases for an IRD income tax assessment portal. It contains serious NZ Privacy Act 2020 violations. Rewrite it in the textarea to be privacy-safe while still being specific enough to generate useful test cases.
Generate test cases for the income assessment form. Our test user is John Smith, IRD number 049-123-456, DOB 12/03/1978, address 14 Aroha Street Petone Wellington 5012, employed at Fletcher Building earning $87,500/year. The form calculates PAYE deductions and student loan repayments. John has a student loan balance of $23,400.
Rewrite the prompt to remove all privacy violations while keeping it specific enough to be useful:
Show model answer — privacy-safe rewrite
Role: You are a senior tester for an NZ tax authority digital services team. Context: The system is an online income tax assessment portal. The income assessment form calculates PAYE deductions and student loan repayments based on annual employment income. Instruction: Generate boundary value and equivalence partition test cases for the income assessment form. Input data (synthetic — no real personal data): - Test user: [synthetic name], IRD number format: 9 digits (e.g. 049-000-001 — a known test value, not a real IRD number) - Income range: $0 to $200,000 annual employment income - Student loan repayment threshold: $22,828 (2025–26 threshold per IRD) - PAYE rates: apply standard NZ tax rates for the income bands tested - Student loan deduction rate: 12 cents per dollar over the threshold Constraints: Use only synthetic test data. Do not use real names, real IRD numbers, real addresses, or real employer names. All values must be either publicly documented thresholds or clearly fictional. Output format: Table with columns: Test ID | Scenario | Input values | Expected PAYE | Expected student loan deduction | Pass/Fail criteria Privacy violations in original prompt: 1. Real name (John Smith) — use "Test User A" or a generic label 2. Real IRD number format (049-123-456) — use known test/synthetic values 3. Real date of birth — not needed for this test; remove entirely 4. Real address (14 Aroha Street Petone) — use "Test Address NZ" or omit 5. Real employer (Fletcher Building) — use "Employer A" or omit 6. Specific real salary ($87,500) — acceptable to use as a test value but should be labelled synthetic 7. Student loan balance ($23,400) — specific individual balance; use threshold-relative values instead
Read the Toka Tū Ake EQC claim portal scenario below. Four problems are described. For each problem, classify it as one of: Hallucination, Privacy breach (NZ Privacy Act 2020), Non-determinism, or Shadow AI. Explain your reasoning, then run to check.
Problem A: The test team used ChatGPT (free tier) to generate test cases. They pasted in their full requirements document, which included Toka Tū Ake EQC claim numbers, property addresses, and photos of damaged properties belonging to real Canterbury earthquake claimants.
Problem B: The AI generated test cases stating that "Toka Tū Ake EQC covers damage up to $300,000 per claim for residential buildings." The actual cap is $150,000 (plus GST) for land and $150,000 (plus GST) for buildings — and the cap changed in 2019. The AI's figure did not match the current legislation.
Problem C: The test lead ran the same prompt three times to generate claim validation test cases. Each run produced a different set of 10 test cases — different scenario descriptions, different input values, different expected results. She cannot decide which set to add to the regression suite.
Problem D: A junior tester discovered that if they run the claim amount calculation prompt at 9am they get one set of boundary values, but if they run it at 3pm they get a different set. She has been manually averaging the outputs to decide which test cases to keep.
Show model answer — correct classifications
Problem A: Shadow AI + Privacy breach (NZ Privacy Act 2020) Using the free tier of ChatGPT without organisational approval is Shadow AI — an unapproved tool handling work data. Pasting real Toka Tū Ake EQC claim numbers, property addresses, and photos of real claimants into a public AI tool is a Privacy Act 2020 breach: the claimants have not consented to their personal information being sent to a third-party AI provider. This likely breaches Privacy Principles 5 (security safeguards) and 11 (disclosure to overseas entities). Problem B: Hallucination The AI generated a specific dollar figure ($300,000) that does not match current Toka Tū Ake EQC legislation. The correct cap is $150,000 for land and $150,000 for buildings (plus GST each) under the Earthquake Commission Act 2022. The AI's confident but incorrect figure is a classic hallucination — plausible, specific, and wrong. Using this test case would mean the test suite validates incorrect business logic. Problem C: Non-determinism The same prompt producing three different outputs on three runs is a direct consequence of LLM non-determinism (temperature > 0). The mitigation is to save and version the first acceptable output rather than re-running. The test lead should pick one output, review it against the spec, and commit it to the regression suite — not run the prompt again hoping for a "better" version. Problem D: Non-determinism (with a misguided workaround) This is also non-determinism, but with a dangerous response: the tester is averaging outputs. You cannot average test cases — each run produces a distinct set of scenarios. Averaging them creates a hybrid set that was never validated as coherent. The correct response is: set temperature to 0 if the tool allows it, or accept one run's output after review. The time-of-day variation is likely server-side randomness, not a meaningful signal.
10 Self-Check
Click each question to reveal the answer.
Q1: What is the difference between an AI hallucination and a reasoning error?
A hallucination is factually incorrect content — the model generates field names, values, or rules that do not exist. A reasoning error is when the model has the right facts but draws the wrong conclusion — for example, generating boundary values at -1 and 101 for a 0–100 range while missing the actual boundaries at 0 and 100.
Q2: Under the NZ Privacy Act 2020, why is pasting real customer data into a public AI tool problematic?
Customers have not consented to their personal information being sent to an AI provider for training or logging purposes. This likely breaches Privacy Principle 4 (limiting collection of personal information to authorised purposes). Always use synthetic or anonymised test data when prompting AI tools.
Q3: What is shadow AI and why is it a risk?
Shadow AI is using AI tools that have not been approved by your organisation's security and compliance team. The risks include: data leakage via tool training on your inputs, non-compliance with data residency requirements, intellectual property exposure, and undefined liability in the event of a breach.
Q4: Why is non-determinism a challenge for AI-assisted test generation, and how do you mitigate it?
Running the same prompt twice produces different outputs, making it impossible to reproduce a specific generation session. Mitigation: save and version all AI-generated outputs, set temperature to 0 where possible, and treat the saved output as the test artefact rather than re-running the prompt when you need the test cases again.
Q5: Name two AI governance frameworks relevant to NZ testers.
Any two of: NZ Privacy Act 2020 (personal information governance), EU AI Act (for organisations with EU exposure), ISO/IEC 42001 (AI management systems), NIST AI Risk Management Framework (widely referenced in NZ government), NZ Government Algorithm Charter (government agencies using algorithmic decision-making).
11 Interview Prep
Real questions asked in NZ QA interviews. Read the model answers, then practise your own version.
"How do you handle AI hallucinations in your testing workflow?"
I treat all AI-generated test artefacts as first drafts that require human review before they enter the test suite. I cross-reference every generated test case against the actual specification, and I ask the model to explain its reasoning for each case — which surfaces hallucinations quickly because the model cannot credibly cite a requirement that does not exist. Surprising output goes back to the spec before it is dismissed.
"What would you do if a team member suggested using a free public AI tool to generate test cases for a banking application?"
I would raise it as a risk before any test data is used. I would check whether the tool's terms of service allow training on user inputs, whether it meets our organisation's data handling policies, and whether it has been approved by our security team. If not, I would suggest using our approved tooling instead — and explain the specific risk: one prompt containing real customer account data sent to an unapproved tool could constitute a Privacy Act breach.
"How does the NZ Privacy Act 2020 affect your use of AI in testing?"
It means I never use real customer data when prompting AI tools, even in internal testing environments. I use synthetic or anonymised test data. I also consider where the AI provider processes data — government and health contracts often have NZ data residency requirements that public cloud AI services may not satisfy. Before using any new AI tool, I confirm it has been reviewed against our data processing obligations.