GenAI Foundations for Testers
Before you can use AI tools safely in testing, you need to understand what they actually are. This is not academic theory — it is the foundation that separates testers who use AI deliberately from those who cause incidents.
1 The Hook
A Wellington government agency was excited about AI. Their business analyst, eager to move faster, used a publicly available AI chatbot to help write test cases for their RealMe integration. It seemed harmless enough — she pasted the full acceptance criteria into the chat window, got back a structured set of test cases, and felt productive. The team was delighted.
Three weeks later, someone in their security team actually read the terms of service for the tool they had been using. The terms allowed the provider to use inputs to improve their training data. Confidential requirements for a government identity system — one that handles authentication for millions of New Zealanders — had been sent to a third-party model operating under a consumer licence. The agency had to notify the Privacy Commissioner. The notification process consumed weeks of legal and leadership time. The tester who first suggested using the tool did not lose her job, but she was never quite trusted with a security-sensitive system again.
The failure was not carelessness. She had done something sensible on the surface: used a capable tool to do useful work. The failure was a gap in foundational knowledge. She did not know that a consumer AI chatbot and a properly governed enterprise AI tool are fundamentally different things — different data handling, different terms, different risk profiles. She did not know what an LLM was, where it ran, or who owned the data she fed it.
This is the problem with treating AI as a magic oracle. When something works, you stop asking questions. You stop asking where the computation happens, what happens to your input, what the model was trained on, and what it is incapable of knowing. Those questions are not theoretical. They are the difference between using AI as a professional and using it as a liability.
This page is about the foundations that make you the former. Understanding what generative AI is — and what it is not — lets you choose the right tool, prompt it effectively, evaluate its output critically, and explain your choices to a sceptical security team. Without this foundation, every subsequent lesson in AI-assisted testing is built on sand.
2 The Rule
Generative AI is a type of deep learning that generates new content by learning patterns from training data. Large language models (LLMs) are generative AI models trained on text. Testers who understand what an LLM is — and is not — can use it deliberately, evaluate its output critically, and avoid the harms that come from treating it as a magic oracle.
The word "generative" is doing important work in that definition. Unlike earlier AI systems that classified, predicted, or recommended from a fixed set of options, generative AI creates new artefacts — text, code, images, audio — that did not exist before. An LLM generating test cases is not looking up test cases from a database. It is constructing new text that matches the statistical patterns it learned during training.
This distinction matters for testing in two ways. First, it means the output can be genuinely useful: the model can produce test cases, Gherkin scenarios, or test data that would have taken you an hour to write manually. Second, it means the output can be confidently wrong. The model does not check its output against reality. It produces what looks right. A hallucinated field name or an invented business rule will appear in the output with exactly the same confidence as a correct one.
Testers are trained to be sceptical of systems under test. The same scepticism applies to AI tools. The foundation for that scepticism is understanding how these models work.
3 The Analogy
Knowing the difference between an AI chatbot and an LLM-powered test tool is like knowing the difference between a kitchen knife and a scalpel.
Both cut. In the right hands, both do useful work. But a surgeon who reaches for a kitchen knife when a scalpel is needed will cause harm — not because knives are dangerous, but because the wrong tool applied to a sensitive task produces unpredictable damage. And a chef who insists on using a scalpel to prepare vegetables is making their job unnecessarily difficult. The tool is only as safe as the person choosing when and how to use it. GenAI fluency means knowing which tool belongs in which situation, not just knowing that AI tools exist.
4 The AI Spectrum
Not all AI is the same. The ISTQB CT-GenAI syllabus identifies four categories that every tester working with AI should understand.
Symbolic AI (rule-based)
Encodes human knowledge as explicit rules. An early example: a system that calculates whether a taxpayer qualifies for Working for Families by checking income thresholds, number of dependent children, and residency status. Every decision is traceable to a rule. Symbolic AI is deterministic, auditable, and brittle — change the rules in the legislation and someone has to update the code manually.
Classical Machine Learning (data-driven)
Learns patterns from historical data to make predictions. Example: a defect prediction model trained on your team's issue tracker. It learns which code modules have historically had high defect rates and flags them for extra testing attention in the next sprint. No rules were written by hand — the model inferred the patterns. Classical ML classifies, regresses, or clusters; it does not create new content.
Deep Learning (neural networks)
A subset of machine learning using many-layered neural networks capable of learning complex representations from raw data. Example: image-based UI testing tools that compare screenshots pixel-by-pixel and use deep learning to distinguish intentional design changes from visual regressions. Deep learning powers the visual AI engines in tools like Applitools. It requires substantial training data and compute, but once trained, it can detect subtle UI defects a human reviewer would miss.
Generative AI (creates new content)
Learns patterns from vast amounts of data and uses them to generate new artefacts. For testers: paste in acceptance criteria, receive a draft set of test cases. Paste in a manual test case, receive a Gherkin scenario. Describe a bug, receive a suggested root cause. The model was trained once (on enormous data) and you use the pre-trained result immediately — no training phase required on your side. This is the key practical difference: you can start using a GenAI tool today with zero ML expertise.
The critical distinction for testers: GenAI uses pre-trained models you can use immediately — no training phase needed — but this also means you cannot control what it learned. The model was trained on public internet data, not your organisation's internal systems, business rules, or test standards. Anything domain-specific must be provided in your prompt.
5 How LLMs Work
You do not need to understand transformer architecture to use LLMs effectively. But three concepts will directly affect the quality of your testing work.
Tokenisation and context windows
LLMs do not process text character by character or word by word. They process tokens — chunks of text that typically correspond to common words or word fragments. "Testing" is one token. "Kiwisaver" might be split into two or three tokens. This matters because every LLM has a context window limit — a maximum number of tokens it can process in a single interaction.
A 10,000-token context window is roughly 7,500 words. That sounds generous until you paste in a 120-page functional specification. If your input exceeds the context window, the model silently truncates it. It will not warn you. It will generate test cases based on whatever portion of the document it received and discard the rest.
Foundation, instruction-tuned, and reasoning models
A foundation model is trained on raw text from the internet. It learns to predict the next token given the previous ones. In isolation, a foundation model is not very useful for test tasks — it will complete your sentence rather than answer your question.
An instruction-tuned model has been further trained using human feedback to follow instructions. When you ask it to "generate 10 boundary value test cases for this field", it actually does that. Almost every AI assistant you interact with today is instruction-tuned. These are the models you will use for testing tasks.
A reasoning model is instruction-tuned with additional training that causes it to think through multi-step problems before generating a final answer. For complex test design tasks — such as deriving equivalence partitions from ambiguous requirements — a reasoning model will often produce more thorough and logically consistent output, at the cost of higher latency and token usage.
Multimodal LLMs
Most recent LLMs accept not just text but also images. This opens a direct use case for UI testing: take a screenshot of a broken or suspect screen, attach it to your prompt, and ask the model to identify accessibility issues, describe what a user would see, or compare it against a design specification. For testers who work across web and mobile, multimodal capability significantly expands what AI assistance can do beyond generating text artefacts.
Multimodal A11y (Accessibility) Testing
You can use multimodal models to perform a "first pass" accessibility audit. Upload a screenshot and ask: "Identify potential WCAG 2.1 violations in this UI, focusing on colour contrast, missing labels for icons, and clear navigation paths." While not a replacement for screen reader testing, it identifies "low-hanging fruit" defects seconds after a build is deployed.
6 LLMs for Test Tasks
Knowing what LLMs can and cannot do well is a core professional competency for AI-assisted testing. Overestimating capability leads to incidents. Underestimating it means leaving genuine productivity gains on the table.
What LLMs are genuinely good at (for testing)
- Generating test case drafts from requirements text
- Reformatting test cases (plain English to Gherkin, manual steps to automation script skeletons)
- Summarising defect patterns from a list of bug descriptions
- Writing synthetic test data (fake NZ addresses, phone numbers, IRD numbers, bank account numbers)
- Reviewing acceptance criteria for ambiguity, contradictions, or missing edge cases
- Translating test intent across formats and frameworks
- Explaining error messages, stack traces, and log output in plain language
What LLMs are NOT good at
- Guaranteeing correctness. Hallucination is not a bug that will be fixed — it is an inherent property of probabilistic text generation. Every output requires human review.
- Knowing your system's specific business rules unless you provide them explicitly in the prompt. The model has never seen your application.
- Replacing human judgment on risk. Deciding which tests matter most, what to skip, and what constitutes acceptable quality requires contextual understanding the model does not have.
- Running tests. LLMs generate text. They cannot execute test scripts, interact with your system, or verify that a test actually passes. They produce artefacts for you to run.
AI chatbots vs LLM-powered test tools
This distinction is critical. AI chatbots (general-purpose consumer tools) are trained on broad internet data. They have no knowledge of your CI/CD pipeline, test framework, project conventions, or system under test. They are excellent for general-purpose tasks but produce generic output when asked about your specific system.
LLM-powered test platforms embed an LLM into a test tool that already understands test structure, integrates with your CI/CD pipeline, and can observe your actual test results. The LLM in such a platform has domain context the general chatbot lacks. The output is correspondingly more actionable. When choosing between the two, ask: does this tool know what I am actually testing, or am I providing all that context myself?
7 Worked Example
You are testing a Kiwisaver online enrolment form for an NZ fund manager. Here is how to use an AI model effectively for this task.
A good prompt
A well-structured prompt includes role, context, instruction, constraints, and output format. Every element is doing work:
This kind of prompt produces output like:
| Test Case ID | Input | Expected Result | Test Type |
|---|---|---|---|
| TC-IRD-01 | 123456782 (valid 9-digit, valid check digit) | Accepted, form advances | Happy path |
| TC-IRD-02 | 12345678 (valid 8-digit, valid check digit) | Accepted, form advances | Happy path |
| TC-IRD-03 | 1234567 (7 digits — below minimum) | Validation error: "IRD number must be 8 or 9 digits" | Boundary value |
| TC-IRD-04 | 1234567890 (10 digits — above maximum) | Validation error: "IRD number must be 8 or 9 digits" | Boundary value |
| TC-IRD-05 | 12345678X (contains non-numeric character) | Validation error: "IRD number must contain digits only" | Invalid format |
| TC-IRD-06 | 123456789 (9-digit, invalid check digit) | Validation error: "IRD number is not valid" | Invalid check digit |
| TC-IRD-07 | (empty — field left blank) | Validation error: "IRD number is required" | Required field |
| TC-IRD-08 | 000-000-000 (formatted with hyphens) | Either accepted with formatting stripped, or validation error — confirm with BA | Edge case |
A bad prompt for comparison
This is bad for three reasons. There is no context (the model does not know what form, what fields, or what business rules apply). There are no constraints (the model will invent typical field names that may not match your system). There is no output format (you will receive prose test descriptions, not a usable table). The model will generate generic, plausible-looking test cases that may have no relevance to your actual application.
8 Common Mistakes
🚫 Using a consumer AI chatbot for confidential test data
What happens: Consumer tools may use your inputs to train their models. Pasting real customer data, proprietary acceptance criteria, or confidential requirements into a public AI tool can breach the NZ Privacy Act 2020, violate your organisation's information security policy, and — as the Wellington agency discovered — require Privacy Commissioner notification.
Correction: Use only approved, governed tools for sensitive work. If your organisation has not approved an AI tool for use with sensitive information, treat it as unapproved. Always use synthetic or anonymised data when generating test inputs with AI.
🚫 Treating AI output as ground truth
What happens: A test case that looks correct may contain invented field names, wrong HTTP status codes, fictional business rules, or expected values that bear no relation to your system. The model produces confident text — not verified facts.
Correction: Every AI-generated test artefact must be reviewed by a human who knows the system before it is added to the test suite. Treat AI output as a first draft written by a capable contractor who has never seen your application.
🚫 Using the wrong tool type
What happens: A general AI chatbot has no knowledge of your test framework, CI/CD pipeline, or project conventions. Asking it to generate Playwright tests for your specific component structure will produce syntactically correct code that does not match your codebase.
Correction: For structured test automation tasks, prefer LLM-powered test platforms with domain context, or provide explicit framework context in your prompt. For ad-hoc test case drafting and reformatting, a general chatbot is often sufficient — but verify everything.
🚫 Ignoring the context window limit
What happens: If your requirements document is longer than the model's context window, the model silently truncates it. You receive test cases for an incomplete specification and have no way of knowing which requirements were dropped.
Correction: Check document length before prompting. Break large specifications into logical sections and prompt per section. Explicitly tell the model which section you are providing: "This is Section 3 of 5 of the functional specification."
9 Now You Try
You are testing a Kiwisaver online enrolment form. The IRD number field accepts 8 or 9 digit NZ IRD numbers. The last digit is a check digit. Write a prompt below and run it against a real AI model — then evaluate the output critically. Does it cover the right boundaries? Did the AI hallucinate any rules?
Show model answer prompt
You are a senior software tester specialising in NZ financial services. Context: I am testing a Kiwisaver online enrolment form. The IRD number field accepts NZ IRD numbers, which are 8 or 9 digits long. The last digit is a check digit calculated using a specific weighting algorithm. The field is mandatory. Instruction: Generate boundary value test cases for the IRD number field. Constraints: - Cover: too short (7 digits), minimum valid length (8 digits), maximum valid length (9 digits), too long (10 digits) - Cover: valid check digit, invalid check digit - Cover: non-numeric input (letters, special characters) - Cover: empty field (mandatory validation) Output format: A table with columns: Test Case ID | Input Value | Expected Result | Test Type
10 Self-Check
Click each question to reveal the answer.
Q1. What is the difference between classical machine learning and generative AI?
Classical ML learns patterns to make predictions on existing categories (classification, regression). Generative AI learns patterns to create new content (text, code, images). For testers: classical ML can predict which tests are likely to find bugs; generative AI can write the test cases themselves. They serve different purposes and should not be confused.
Q2. Why does tokenisation matter when using an LLM for testing?
Models have a context window limit measured in tokens. If your requirements document exceeds the limit, the model silently truncates input and generates test cases based on incomplete requirements. You receive no warning. Always check document length before prompting and break large specs into sections.
Q3. What is the difference between a foundation LLM and an instruction-tuned LLM?
A foundation model is trained on raw text and predicts the next token — it needs careful prompting to be useful for task-directed work. An instruction-tuned model has been fine-tuned to follow instructions, making it far more useful for test tasks out of the box. Most AI assistants you interact with are instruction-tuned. Reasoning models go further, thinking through multi-step problems before generating a final answer.
Q4. Name three test tasks where generative AI provides genuine value.
Any three of: Generating test case drafts from requirements, converting manual test cases to Gherkin format, writing synthetic test data, reviewing acceptance criteria for ambiguity or contradiction, summarising defect patterns from a bug list, generating test script skeletons from plain-English descriptions, explaining error messages and stack traces in plain language.
Q5. Why should you not paste real customer data into a public AI chatbot when generating test data?
Consumer AI tools may use your inputs to improve their training data. Pasting real customer data violates the NZ Privacy Act 2020 and the customer's reasonable expectation of privacy. It may also breach your organisation's information security policy and contractual obligations. Always use synthetic or anonymised data when working with AI tools, regardless of how trusted the tool appears.
11 Interview Questions
Common interview questions on GenAI foundations for testing roles.
Q: "How would you explain what a large language model is to a non-technical stakeholder?"
An LLM is a type of AI that has read enormous amounts of text and learned the patterns of language. It can complete sentences, answer questions, write code, and generate test cases — but it does not understand what it is writing the way a human does. It predicts what word should come next based on patterns learned during training. This means it is genuinely capable for many tasks, but also capable of generating plausible-sounding nonsense with complete confidence. Any output that matters needs a human to check it.
Q: "What would you check before using an AI tool to generate test cases for a banking application?"
I would check: (1) whether the tool is approved under our organisation's security and information handling policy; (2) whether it can accept sensitive input without using it for model training; (3) what the context window limit is and whether our specification fits within it; (4) whether I need to provide business rules explicitly in the prompt or whether the tool has embedded domain knowledge; and (5) what our review process is for AI-generated artefacts before they enter the test suite. I would not use an unapproved tool on a production banking system regardless of how capable it appeared.
Q: "Can you describe a risk of using an AI model for testing and how you would mitigate it?"
Hallucination — the model generates test cases with invented field names, fictional status codes, or wrong expected values. The output looks correct and is formatted correctly, which makes it easy to miss in review. Mitigation: every AI-generated test artefact is reviewed by a tester who knows the system before it is added to the suite. We treat AI output as a first draft, not a final artefact. For high-risk areas like financial calculations or authentication flows, we require the reviewing tester to trace each expected value back to a requirement.