Test with AI · GenAI Foundations

GenAI Foundations for Testers

Q: Q2. Why does tokenisation matter when using an LLM for testing?

Models have a context window limit measured in tokens. If your requirements document exceeds the limit, the model silently truncates input and generates test cases based on incomplete requirements. You receive no warning. Always check document length before prompting and break large specs into sections.

Q: Q4. Name three test tasks where generative AI provides genuine value.

Any three of: Generating test case drafts from requirements, converting manual test cases to Gherkin format, writing synthetic test data, reviewing acceptance criteria for ambiguity or contradiction, summarising defect patterns from a bug list, generating test script skeletons from plain-English descriptions, explaining error messages and stack traces in plain language.

Before you can use AI tools safely in testing, you need to understand what they actually are. This is not academic theory — it is the foundation that separates testers who use AI deliberately from those who cause incidents.

Test with AI CT-GenAI Ch 1 — GenAI-1.1.1 to 1.2.2 ~35 min read · ~60 min with exercises

1 The Hook

A Wellington government agency was excited about AI. Their business analyst, eager to move faster, used a publicly available AI chatbot to help write test cases for their RealMe integration. It seemed harmless enough — she pasted the full acceptance criteria into the chat window, got back a structured set of test cases, and felt productive. The team was delighted.

Three weeks later, someone in their security team actually read the terms of service for the tool they had been using. The terms allowed the provider to use inputs to improve their training data. Confidential requirements for a government identity system — one that handles authentication for millions of New Zealanders — had been sent to a third-party model operating under a consumer licence. The agency had to notify the Privacy Commissioner. The notification process consumed weeks of legal and leadership time. The tester who first suggested using the tool did not lose her job, but she was never quite trusted with a security-sensitive system again.

The failure was not carelessness. She had done something sensible on the surface: used a capable tool to do useful work. The failure was a gap in foundational knowledge. She did not know that a consumer AI chatbot and a properly governed enterprise AI tool are fundamentally different things — different data handling, different terms, different risk profiles. She did not know what an LLM was, where it ran, or who owned the data she fed it.

This is the problem with treating AI as a magic oracle. When something works, you stop asking questions. You stop asking where the computation happens, what happens to your input, what the model was trained on, and what it is incapable of knowing. Those questions are not theoretical. They are the difference between using AI as a professional and using it as a liability.

This page is about the foundations that make you the former. Understanding what generative AI is — and what it is not — lets you choose the right tool, prompt it effectively, evaluate its output critically, and explain your choices to a sceptical security team. Without this foundation, every subsequent lesson in AI-assisted testing is built on sand.

2 The Rule

Generative AI is a type of deep learning that generates new content by learning patterns from training data. Large language models (LLMs) are generative AI models trained on text. Testers who understand what an LLM is — and is not — can use it deliberately, evaluate its output critically, and avoid the harms that come from treating it as a magic oracle.

The word "generative" is doing important work in that definition. Unlike earlier AI systems that classified, predicted, or recommended from a fixed set of options, generative AI creates new artefacts — text, code, images, audio — that did not exist before. An LLM generating test cases is not looking up test cases from a database. It is constructing new text that matches the statistical patterns it learned during training.

This distinction matters for testing in two ways. First, it means the output can be genuinely useful: the model can produce test cases, Gherkin scenarios, or test data that would have taken you an hour to write manually. Second, it means the output can be confidently wrong. The model does not check its output against reality. It produces what looks right. A hallucinated field name or an invented business rule will appear in the output with exactly the same confidence as a correct one.

Testers are trained to be sceptical of systems under test. The same scepticism applies to AI tools. The foundation for that scepticism is understanding how these models work.

3 The Analogy

Analogy

Knowing the difference between an AI chatbot and an LLM-powered test tool is like knowing the difference between a kitchen knife and a scalpel.

Both cut. In the right hands, both do useful work. But a surgeon who reaches for a kitchen knife when a scalpel is needed will cause harm — not because knives are dangerous, but because the wrong tool applied to a sensitive task produces unpredictable damage. And a chef who insists on using a scalpel to prepare vegetables is making their job unnecessarily difficult. The tool is only as safe as the person choosing when and how to use it. GenAI fluency means knowing which tool belongs in which situation, not just knowing that AI tools exist.

4 The AI Spectrum

Not all AI is the same. The ISTQB CT-GenAI syllabus identifies four categories that every tester working with AI should understand.

Symbolic AI (rule-based)

Encodes human knowledge as explicit rules. An early example: a system that calculates whether a taxpayer qualifies for Working for Families by checking income thresholds, number of dependent children, and residency status. Every decision is traceable to a rule. Symbolic AI is deterministic, auditable, and brittle — change the rules in the legislation and someone has to update the code manually.

Classical Machine Learning (data-driven)

Learns patterns from historical data to make predictions. Example: a defect prediction model trained on your team's issue tracker. It learns which code modules have historically had high defect rates and flags them for extra testing attention in the next sprint. No rules were written by hand — the model inferred the patterns. Classical ML classifies, regresses, or clusters; it does not create new content.

Deep Learning (neural networks)

A subset of machine learning using many-layered neural networks capable of learning complex representations from raw data. Example: image-based UI testing tools that compare screenshots pixel-by-pixel and use deep learning to distinguish intentional design changes from visual regressions. Deep learning powers the visual AI engines in tools like Applitools. It requires substantial training data and compute, but once trained, it can detect subtle UI defects a human reviewer would miss.

Generative AI (creates new content)

Learns patterns from vast amounts of data and uses them to generate new artefacts. For testers: paste in acceptance criteria, receive a draft set of test cases. Paste in a manual test case, receive a Gherkin scenario. Describe a bug, receive a suggested root cause. The model was trained once (on enormous data) and you use the pre-trained result immediately — no training phase required on your side. This is the key practical difference: you can start using a GenAI tool today with zero ML expertise.

The critical distinction for testers: GenAI uses pre-trained models you can use immediately — no training phase needed — but this also means you cannot control what it learned. The model was trained on public internet data, not your organisation's internal systems, business rules, or test standards. Anything domain-specific must be provided in your prompt.

5 How LLMs Work

You do not need to understand transformer architecture to use LLMs effectively. But three concepts will directly affect the quality of your testing work.

Tokenisation and context windows

LLMs do not process text character by character or word by word. They process tokens — chunks of text that typically correspond to common words or word fragments. "Testing" is one token. "Kiwisaver" might be split into two or three tokens. This matters because every LLM has a context window limit — a maximum number of tokens it can process in a single interaction.

A 10,000-token context window is roughly 7,500 words. That sounds generous until you paste in a 120-page functional specification. If your input exceeds the context window, the model silently truncates it. It will not warn you. It will generate test cases based on whatever portion of the document it received and discard the rest.

Pro tip: When prompting with a long test specification, check the word count first. Most models have context limits. If your spec exceeds the limit, the model will silently truncate it — and you will get test cases based on incomplete requirements. Break large specs into sections and prompt per section.

Foundation, instruction-tuned, and reasoning models

A foundation model is trained on raw text from the internet. It learns to predict the next token given the previous ones. In isolation, a foundation model is not very useful for test tasks — it will complete your sentence rather than answer your question.

An instruction-tuned model has been further trained using human feedback to follow instructions. When you ask it to "generate 10 boundary value test cases for this field", it actually does that. Almost every AI assistant you interact with today is instruction-tuned. These are the models you will use for testing tasks.

A reasoning model is instruction-tuned with additional training that causes it to think through multi-step problems before generating a final answer. For complex test design tasks — such as deriving equivalence partitions from ambiguous requirements — a reasoning model will often produce more thorough and logically consistent output, at the cost of higher latency and token usage.

Multimodal LLMs

Most recent LLMs accept not just text but also images. This opens a direct use case for UI testing: take a screenshot of a broken or suspect screen, attach it to your prompt, and ask the model to identify accessibility issues, describe what a user would see, or compare it against a design specification. For testers who work across web and mobile, multimodal capability significantly expands what AI assistance can do beyond generating text artefacts.

Multimodal A11y (Accessibility) Testing

You can use multimodal models to perform a "first pass" accessibility audit. Upload a screenshot and ask: "Identify potential WCAG 2.1 violations in this UI, focusing on colour contrast, missing labels for icons, and clear navigation paths." While not a replacement for screen reader testing, it identifies "low-hanging fruit" defects seconds after a build is deployed.

6 LLMs for Test Tasks

Knowing what LLMs can and cannot do well is a core professional competency for AI-assisted testing. Overestimating capability leads to incidents. Underestimating it means leaving genuine productivity gains on the table.

What LLMs are genuinely good at (for testing)

Generating test case drafts from requirements text
Reformatting test cases (plain English to Gherkin, manual steps to automation script skeletons)
Summarising defect patterns from a list of bug descriptions
Writing synthetic test data (fake NZ addresses, phone numbers, Revenue NZ numbers, bank account numbers)
Reviewing acceptance criteria for ambiguity, contradictions, or missing edge cases
Translating test intent across formats and frameworks
Explaining error messages, stack traces, and log output in plain language

What LLMs are NOT good at

Guaranteeing correctness. Hallucination is not a bug that will be fixed — it is an inherent property of probabilistic text generation. Every output requires human review.
Knowing your system's specific business rules unless you provide them explicitly in the prompt. The model has never seen your application.
Replacing human judgment on risk. Deciding which tests matter most, what to skip, and what constitutes acceptable quality requires contextual understanding the model does not have.
Running tests. LLMs generate text. They cannot execute test scripts, interact with your system, or verify that a test actually passes. They produce artefacts for you to run.

AI chatbots vs LLM-powered test tools

This distinction is critical. AI chatbots (general-purpose consumer tools) are trained on broad internet data. They have no knowledge of your CI/CD pipeline, test framework, project conventions, or system under test. They are excellent for general-purpose tasks but produce generic output when asked about your specific system.

LLM-powered test platforms embed an LLM into a test tool that already understands test structure, integrates with your CI/CD pipeline, and can observe your actual test results. The LLM in such a platform has domain context the general chatbot lacks. The output is correspondingly more actionable. When choosing between the two, ask: does this tool know what I am actually testing, or am I providing all that context myself?

7 Worked Example

You are testing a Kiwisaver online enrolment form for an NZ fund manager. Here is how to use an AI model effectively for this task.

A good prompt

A well-structured prompt includes role, context, instruction, constraints, and output format. Every element is doing work:

Role: You are a senior software tester with expertise in financial services testing.
Context: I am testing a Kiwisaver online enrolment form for an NZ fund manager.
The form collects: full name, Revenue NZ number (9 digits), date of birth, employer name,
contribution rate (3%, 4%, 6%, 8%, or 10%), and bank account number.
Instruction: Generate a set of test cases for the Revenue NZ number field only.
Constraints: Focus on boundary values, invalid formats, and edge cases specific to
NZ Revenue NZ numbers. NZ Revenue NZ numbers are 8 or 9 digits. The last digit is a check digit
calculated using a specific algorithm.
Output format: A table with columns: Test Case ID, Input, Expected Result, Test Type.

This kind of prompt produces output like:

Test Case ID	Input	Expected Result	Test Type
TC-Revenue NZ-01	123456782 (valid 9-digit, valid check digit)	Accepted, form advances	Happy path
TC-Revenue NZ-02	12345678 (valid 8-digit, valid check digit)	Accepted, form advances	Happy path
TC-Revenue NZ-03	1234567 (7 digits — below minimum)	Validation error: "Revenue NZ number must be 8 or 9 digits"	Boundary value
TC-Revenue NZ-04	1234567890 (10 digits — above maximum)	Validation error: "Revenue NZ number must be 8 or 9 digits"	Boundary value
TC-Revenue NZ-05	12345678X (contains non-numeric character)	Validation error: "Revenue NZ number must contain digits only"	Invalid format
TC-Revenue NZ-06	123456789 (9-digit, invalid check digit)	Validation error: "Revenue NZ number is not valid"	Invalid check digit
TC-Revenue NZ-07	(empty — field left blank)	Validation error: "Revenue NZ number is required"	Required field
TC-Revenue NZ-08	000-000-000 (formatted with hyphens)	Either accepted with formatting stripped, or validation error — confirm with BA	Edge case

A bad prompt for comparison

Write test cases for my form

This is bad for three reasons. There is no context (the model does not know what form, what fields, or what business rules apply). There are no constraints (the model will invent typical field names that may not match your system). There is no output format (you will receive prose test descriptions, not a usable table). The model will generate generic, plausible-looking test cases that may have no relevance to your actual application.

Pro tip: Always tell the model what you are testing, what the business rules are, and what format you want the output in. Without context, the model invents context — and its invented context may be wrong. A test case with the right structure and wrong expected value is worse than no test case at all, because it creates false confidence.

8 Common Mistakes

🚫 Using a consumer AI chatbot for confidential test data

What happens: Consumer tools may use your inputs to train their models. Pasting real customer data, proprietary acceptance criteria, or confidential requirements into a public AI tool can breach the NZ Privacy Act 2020, violate your organisation's information security policy, and — as the Wellington agency discovered — require Privacy Commissioner notification.

Correction: Use only approved, governed tools for sensitive work. If your organisation has not approved an AI tool for use with sensitive information, treat it as unapproved. Always use synthetic or anonymised data when generating test inputs with AI.

🚫 Treating AI output as ground truth

What happens: A test case that looks correct may contain invented field names, wrong HTTP status codes, fictional business rules, or expected values that bear no relation to your system. The model produces confident text — not verified facts.

Correction: Every AI-generated test artefact must be reviewed by a human who knows the system before it is added to the test suite. Treat AI output as a first draft written by a capable contractor who has never seen your application.

🚫 Using the wrong tool type

What happens: A general AI chatbot has no knowledge of your test framework, CI/CD pipeline, or project conventions. Asking it to generate Playwright tests for your specific component structure will produce syntactically correct code that does not match your codebase.

Correction: For structured test automation tasks, prefer LLM-powered test platforms with domain context, or provide explicit framework context in your prompt. For ad-hoc test case drafting and reformatting, a general chatbot is often sufficient — but verify everything.

🚫 Ignoring the context window limit

What happens: If your requirements document is longer than the model's context window, the model silently truncates it. You receive test cases for an incomplete specification and have no way of knowing which requirements were dropped.

Correction: Check document length before prompting. Break large specifications into logical sections and prompt per section. Explicitly tell the model which section you are providing: "This is Section 3 of 5 of the functional specification."

Senior engineer insight

The thing that changed how I think about this: an LLM is a next-token predictor, not a fact retrieval system. Once you genuinely internalise that, you stop being surprised by confident hallucinations and start designing your prompts to constrain the probability space. I had a team spend two weeks debugging test cases before they realised the model had invented a field called account_ref that did not exist in the API — it just looked right because similar fields exist in similar systems across the internet. Knowing whether you are dealing with a foundation model, an instruction-tuned model, or a reasoning model matters equally: the same prompt sent to each produces results with very different structure and depth, and the right choice depends entirely on your task.

The most common mistake: teams skip understanding the model type entirely, then blame “AI” when they should be blaming their tool selection.

From the field

A central government agency in Wellington was piloting an AI assistant to help testers draft acceptance criteria for a new digital identity service. The team assumed the model already knew how the GCDO foundation models guidance classified acceptable use — after all, it was a widely used public LLM. What they discovered was that the model’s training data predated the GCDO guidance entirely, so it generated acceptance criteria referencing data handling standards that had since been superseded. When the privacy team reviewed the draft, they flagged it: the criteria would have passed a system that violated the updated Information Security Manual requirements. The fix was simple but instructive — the team added the relevant guidance text directly into the system prompt as grounding context, and the model’s output became accurate immediately. The lesson that generalises: any domain-specific regulation, policy, or standard that postdates the model’s training cutoff is invisible to it unless you inject it explicitly into your prompt.

Senior engineer insight

The most common mistake: teams skip understanding the model type entirely, then blame “AI” when they should be blaming their tool selection.

From the field

9 Now You Try

🤖 Live AI Prompt Lab — Revenue NZ Number Test Cases

You are testing a Kiwisaver online enrolment form. The Revenue NZ number field accepts 8 or 9 digit NZ Revenue NZ numbers. The last digit is a check digit. Write a prompt below and run it against a real AI model — then evaluate the output critically. Does it cover the right boundaries? Did the AI hallucinate any rules?

Show model answer prompt

You are a senior software tester specialising in NZ financial services.

Context: I am testing a Kiwisaver online enrolment form. The Revenue NZ number field accepts NZ Revenue NZ numbers, which are 8 or 9 digits long. The last digit is a check digit calculated using a specific weighting algorithm. The field is mandatory.

Instruction: Generate boundary value test cases for the Revenue NZ number field.

Constraints:
- Cover: too short (7 digits), minimum valid length (8 digits), maximum valid length (9 digits), too long (10 digits)
- Cover: valid check digit, invalid check digit
- Cover: non-numeric input (letters, special characters)
- Cover: empty field (mandatory validation)

Output format: A table with columns: Test Case ID | Input Value | Expected Result | Test Type

Key takeaway

An LLM is a pattern-completion engine, not an oracle — it produces text that looks right, and your job as a tester is to verify that it actually is right; that distinction is the whole ballgame.

Key takeaway

An LLM is a pattern-completion engine, not an oracle — it produces text that looks right, and your job as a tester is to verify that it actually is right; that distinction is the whole ballgame.

Why teams fail here

Treating hallucination as a fixable bug: Teams assume hallucinations will disappear with a better model or better prompt. They won’t — hallucination is an inherent property of probabilistic generation. Your review process must account for it, not hope it away.
Conflating model types: Using a foundation model when you need an instruction-tuned one, or a general chatbot when you need a reasoning model for complex test design. The outputs look similar enough that you miss the gap until you’re debugging wrong expected values in a test suite.
Ignoring context window boundaries: Pasting a full functional specification into a prompt without checking its length against the model’s token limit. The model silently truncates. You get test cases for 60% of your requirements and believe you have full coverage.
Not providing domain-specific rules in the prompt: The model has never seen your application, your data formats, or your business rules. If you don’t tell it that NZ Revenue NZ numbers use a specific check digit algorithm, it will invent a plausible but wrong one — and the test case will look correct to anyone who doesn’t verify it.
Using consumer tools for sensitive work without governance: Teams reach for the most capable public chatbot without checking organisational policy. In NZ public sector and financial services contexts, this is not a minor oversight — it is a potential Privacy Act 2020 breach and a reportable incident.
Assuming the model knows current standards: Any regulation or guidance published after the model’s training cutoff — GCDO foundation model policy updates, NZ Privacy Act amendments, new WCAG versions — is invisible to it unless you include it explicitly in the prompt.

Why teams fail here

Treating hallucination as a fixable bug: Teams assume hallucinations will disappear with a better model or better prompt. They won’t — hallucination is an inherent property of probabilistic generation. Your review process must account for it, not hope it away.
Conflating model types: Using a foundation model when you need an instruction-tuned one, or a general chatbot when you need a reasoning model for complex test design. The outputs look similar enough that you miss the gap until you’re debugging wrong expected values in a test suite.
Ignoring context window boundaries: Pasting a full functional specification into a prompt without checking its length against the model’s token limit. The model silently truncates. You get test cases for 60% of your requirements and believe you have full coverage.
Not providing domain-specific rules in the prompt: The model has never seen your application, your data formats, or your business rules. If you don’t tell it that NZ Revenue NZ numbers use a specific check digit algorithm, it will invent a plausible but wrong one — and the test case will look correct to anyone who doesn’t verify it.
Using consumer tools for sensitive work without governance: Teams reach for the most capable public chatbot without checking organisational policy. In NZ public sector and financial services contexts, this is not a minor oversight — it is a potential Privacy Act 2020 breach and a reportable incident.
Assuming the model knows current standards: Any regulation or guidance published after the model’s training cutoff — GCDO foundation model policy updates, NZ Privacy Act amendments, new WCAG versions — is invisible to it unless you include it explicitly in the prompt.

10 Self-Check

Click each question to reveal the answer.

Q1. What is the difference between classical machine learning and generative AI?

Classical ML learns patterns to make predictions on existing categories (classification, regression). Generative AI learns patterns to create new content (text, code, images). For testers: classical ML can predict which tests are likely to find bugs; generative AI can write the test cases themselves. They serve different purposes and should not be confused.

Q2. Why does tokenisation matter when using an LLM for testing?

Models have a context window limit measured in tokens. If your requirements document exceeds the limit, the model silently truncates input and generates test cases based on incomplete requirements. You receive no warning. Always check document length before prompting and break large specs into sections.

Q3. What is the difference between a foundation LLM and an instruction-tuned LLM?

A foundation model is trained on raw text and predicts the next token — it needs careful prompting to be useful for task-directed work. An instruction-tuned model has been fine-tuned to follow instructions, making it far more useful for test tasks out of the box. Most AI assistants you interact with are instruction-tuned. Reasoning models go further, thinking through multi-step problems before generating a final answer.

Q4. Name three test tasks where generative AI provides genuine value.

Any three of: Generating test case drafts from requirements, converting manual test cases to Gherkin format, writing synthetic test data, reviewing acceptance criteria for ambiguity or contradiction, summarising defect patterns from a bug list, generating test script skeletons from plain-English descriptions, explaining error messages and stack traces in plain language.

Q5. Why should you not paste real customer data into a public AI chatbot when generating test data?

Consumer AI tools may use your inputs to improve their training data. Pasting real customer data violates the NZ Privacy Act 2020 and the customer's reasonable expectation of privacy. It may also breach your organisation's information security policy and contractual obligations. Always use synthetic or anonymised data when working with AI tools, regardless of how trusted the tool appears.

11 Interview Questions

Common interview questions on GenAI foundations for testing roles.

Q: "How would you explain what a large language model is to a non-technical stakeholder?"

An LLM is a type of AI that has read enormous amounts of text and learned the patterns of language. It can complete sentences, answer questions, write code, and generate test cases — but it does not understand what it is writing the way a human does. It predicts what word should come next based on patterns learned during training. This means it is genuinely capable for many tasks, but also capable of generating plausible-sounding nonsense with complete confidence. Any output that matters needs a human to check it.

Q: "What would you check before using an AI tool to generate test cases for a banking application?"

I would check: (1) whether the tool is approved under our organisation's security and information handling policy; (2) whether it can accept sensitive input without using it for model training; (3) what the context window limit is and whether our specification fits within it; (4) whether I need to provide business rules explicitly in the prompt or whether the tool has embedded domain knowledge; and (5) what our review process is for AI-generated artefacts before they enter the test suite. I would not use an unapproved tool on a production banking system regardless of how capable it appeared.

Q: "Can you describe a risk of using an AI model for testing and how you would mitigate it?"

Hallucination — the model generates test cases with invented field names, fictional status codes, or wrong expected values. The output looks correct and is formatted correctly, which makes it easy to miss in review. Mitigation: every AI-generated test artefact is reviewed by a tester who knows the system before it is added to the suite. We treat AI output as a first draft, not a final artefact. For high-risk areas like financial calculations or authentication flows, we require the reviewing tester to trace each expected value back to a requirement.

← Back to Test with AI Next: Prompt Engineering →