Prompt Engineering for Testing
The AI is not the problem. The prompt is. Master the deliberate craft of structuring instructions and you will generate better test artefacts in seconds than most testers produce in hours.
1 The Hook
An NZ insurance company gave their test team access to an AI assistant for test case generation. Two weeks later, the lead tester reported: "it doesn't work — the test cases are terrible." The manager sat with her and watched her type: "write test cases for the claims form." The AI produced five generic test cases. No claim types. No validation rules. No NZ-specific fields. Just five variations of "enter valid data, click submit, verify success."
The problem was not the AI. The problem was the prompt.
When the manager rewrote the prompt — including the actual claims fields, the validation rules, the NZ-specific business logic around ACC levies, and a specific output format — the AI produced 23 high-quality test cases in 40 seconds. The only difference was the prompt. The AI was the same. The model was the same. The company's subscription had not changed. The tester's access had not changed.
This is the most important lesson in the CT-GenAI syllabus: AI does not fail because it is a bad tool. It produces poor output because it receives poor input. The model has no knowledge of your system, your business rules, or your team's standards unless you tell it. Every gap in your prompt is a gap the model fills with a plausible-sounding guess — and in testing, a plausible-sounding guess is worse than no test case at all.
Prompt engineering is not a dark art reserved for AI specialists. It is a testable, learnable skill built on the same discipline that makes you a good tester: precision, completeness, and clarity of specification. You have already been writing prompts your entire career — you called them acceptance criteria.
2 The Rule
A prompt is an instruction to an AI model. Prompt engineering is the deliberate craft of structuring prompts to get accurate, relevant, and useful output. For testers, this is the most practical skill in the CT-GenAI syllabus — it directly determines the quality of every AI-assisted test artefact you produce.
A well-engineered prompt is not longer; it is more precise. It eliminates ambiguity, supplies context the model cannot infer, and specifies the exact output format your workflow needs. The difference between a vague prompt and a structured one is not effort — it is discipline.
3 The Analogy
Writing a prompt is like writing an acceptance criterion.
A vague acceptance criterion — "the form should work" — produces useless test cases. A specific one — "when a user submits an IRD number with fewer than 8 digits, the system must display the error message 'IRD number must be 8 or 9 digits'" — produces precise, verifiable tests. The same discipline that makes you a good tester makes you a good prompter. Specificity is the tool. Ambiguity is the enemy. You already know this. You just have not applied it to AI interactions yet.
4 Prompt Structure
CT-GenAI-2.1.1 defines six components of a well-structured prompt. Each component eliminates a different class of ambiguity. Together they give the model everything it needs to produce output you can actually use.
| Component | What it does | Example |
|---|---|---|
| Role | Sets the AI's expertise and perspective | "You are a senior tester specialising in NZ financial services" |
| Context | The system, technology, and business domain | "The system is an ANZ online banking portal built on React with a .NET backend" |
| Instruction | The specific task to perform | "Generate boundary value test cases for the password reset flow" |
| Input data | The spec, user story, or field definitions | Paste the actual user story or field list here |
| Constraints | NZ-specific rules, format restrictions, limits | "Passwords must meet NZ government NZISM guidelines, minimum 16 characters" |
| Output format | How you want the result delivered | "Output as a Gherkin Feature file with Given/When/Then" |
Before and After: ANZ Password Reset
Vague prompt
Write test cases for the password reset.
Structured prompt
Role: You are a senior tester specialising in NZ financial services.
Context: The system is the ANZ NZ online banking portal. The password reset flow is accessible from the login page and allows registered customers to reset their password via their registered email or registered NZ mobile number.
Instruction: Generate boundary value and negative test cases for the password reset flow.
Input data: Rules: minimum 10 characters, must contain 1 uppercase, 1 number, 1 special character; NZ mobile must be in 02x format; reset link expires after 15 minutes; maximum 3 reset attempts per hour.
Constraints: Include NZ-specific inputs (NZ mobile format, NZ email domains). Do not include any test cases that require access to the email server.
Output format: A numbered table with columns: Test ID, Test Description, Input, Expected Result, Test Type.
The structured prompt takes 90 seconds to write. It produces test cases you can use immediately without reformatting or fact-checking every field name.
5 Core Prompting Techniques
CT-GenAI-2.1.2 and 2.1.3 cover five prompting techniques. Each suits a different testing task. Choosing the wrong one is like choosing the wrong test design technique — you get output, but not the output you need.
Zero-shot prompting
No examples given. The model uses only its training knowledge and your instructions. Best for simple, well-defined tasks where the output format is universally understood.
Example: "Generate 5 boundary value test cases for a login form with a username field that accepts 6–30 characters."
Use when: quick, one-off tasks; simple test case generation; exploring what the model knows about a standard technology.
One-shot prompting
One example provided to show the required format. The model adapts to match it. Useful when your team uses a non-standard test case format that the model would not guess correctly.
Use when: you need a specific output structure and one example is enough to demonstrate it.
Few-shot prompting
Two to three examples establish a clear pattern. The model learns format, tone, level of detail, and domain vocabulary from your examples. Best for generating test cases in your team's exact format across many user stories.
Use when: generating multiple test artefacts consistently; the format is complex; first-draft quality must be high to save review time.
Prompt chaining
Break a complex task into sequential steps. Each step's output feeds the next. The model performs better on focused tasks than on instructions that require simultaneous multi-step reasoning.
Example sequence for a user story:
Step 1 → Analyse the user story and identify ambiguities and missing acceptance criteria
Step 2 → (feed Step 1 output) Write 5 specific acceptance criteria for the identified gaps
Step 3 → (feed Step 2 output) Generate Gherkin test scenarios for each acceptance criterion
Use when: complex test analysis; multi-step reasoning; output of one stage must be reviewed before proceeding.
Meta-prompting
Ask the model to improve your prompt before using it. Useful when you know your prompt is incomplete but are not sure what is missing.
Example: "Here is my prompt: [your prompt]. Improve it to produce better test cases for NZ financial services testing. Explain what you added and why."
Use when: onboarding to a new domain; building a reusable prompt for a new test area; your first draft is generating poor output.
System prompts vs user prompts
A system prompt sets the AI's persistent persona and constraints — it runs before every user message. A user prompt is the per-request instruction. In test tooling, a system prompt might say "You are a tester specialising in NZ government digital services. Always use NZ English. Never invent field names that are not in the specification." User prompts then send individual test tasks against that persistent context.
Use system prompts when: setting up a shared AI tool for a team; configuring a testing chatbot or IDE plugin; ensuring consistent persona across a long session.
6 Applying AI to Test Analysis
CT-GenAI-2.2.1 covers using AI to support test analysis: identifying missing acceptance criteria, ambiguous requirements, and untestable specifications before test design begins. This is where prompt chaining delivers the most value.
Worked Example: Kiwisaver Contribution Rate Change
User story: "As a Kiwisaver member, I want to change my contribution rate online so that my contributions reflect my current financial situation."
This story is incomplete. It is missing contribution rate options, effective date rules, confirmation requirements, and employer notification logic. A tester who writes test cases directly from this story will miss at least half the scenarios.
Prompt 1 — Analyse for gaps
Example output (Step 1)
Prompt 2 — Generate acceptance criteria for the gaps
Prompt 3 — Review the acceptance criteria
Human review is essential at Step 2
The model's Step 2 output may contain hallucinated business rules — plausible-sounding Kiwisaver rules that do not exist in the KiwiSaver Act 2006 or your specific provider's implementation. A tester with domain knowledge must review the generated acceptance criteria against the actual spec before they are used for test design. Skipping this review is the most common mistake in AI-assisted test analysis.
7 Applying AI to Test Design
CT-GenAI-2.2.2 and 2.2.3 cover AI-assisted test case generation, test suite prioritisation, and keyword-driven script generation. Each requires a different prompting strategy.
Test Case Generation: Few-shot with Gherkin
User story: "As a taxpayer, I want to submit my IR3 tax return online so that my tax position is assessed by IRD."
Provide two Gherkin examples to establish format, then ask for five more:
Example output (partial)
Test Suite Prioritisation via Prompt Chaining
Step 1: Generate the test cases (use any technique above).
Step 2: Rank by risk.
Step 3: Identify dependencies.
Keyword-driven Script Generation
8 Applying AI to Regression and Monitoring
CT-GenAI-2.2.4 and 2.2.5 address using AI to maintain regression suites and interpret test run data — two of the highest-effort, lowest-value activities in a mature test team's calendar.
Regression Suite Maintenance
Regression suites grow by accretion. Tests are added but rarely removed. After two years, a 600-test suite may contain 150 redundant or outdated cases. AI can identify candidates for removal:
Test Run Reporting for Stakeholders
Most stakeholders cannot interpret a JUnit XML report. AI can translate raw test results into plain-English executive summaries:
Selecting the Right Technique
| Scenario | Best technique |
|---|---|
| Quick, one-off task | Zero-shot |
| Need specific output format | Few-shot |
| Complex multi-step analysis | Prompt chaining |
| Repetitive task across many items | Meta-prompt + few-shot |
| Setting up a shared team tool | System prompt |
9 Evaluating and Refining Prompts
CT-GenAI-2.3.1 and 2.3.2 cover evaluating AI output and iteratively improving prompts. The first response is a starting point. Professional prompt engineering is a refinement loop.
Metrics for Evaluating AI Test Output
| Metric | What to check |
|---|---|
| Correctness | Does the output match your actual business rules and field definitions? |
| Completeness | Has it covered all equivalence partitions from the spec? |
| Format compliance | Does the output match the format you requested? |
| Hallucination rate | What percentage of test cases reference invented field names, values, or rules? |
| Relevance | Are all generated cases testable, non-redundant, and within scope? |
Iterative Refinement: Three-round Loop
Round 1 — Baseline
Prompt: "Write test cases for the IRD GST return form."
Output problem: Generic test cases with made-up field names. No NZ GST rules. Wrong output format.
Round 2 — Add context and constraints
Add: Role (NZ tax systems tester), Context (IRD myIR portal, 2-monthly GST filers), Input data (GST return fields from IRD website), Constraints (NZ GST rate 15%, return period must be 2 months), Output format (table with Test ID, Condition, Input, Expected Result).
Output improvement: Correct field names, correct GST rate, correct format. But missing boundary cases for the filing deadline.
Round 3 — Add missing coverage
Add: "Include test cases for: submission on the due date (28th of the month after period end), submission 1 day before due date, submission 1 day after due date, and submission when the due date falls on a weekend."
Output: Complete coverage including boundary dates, with accurate NZ business rules throughout.
10 Common Mistakes
🚫 Writing vague prompts and blaming the AI
The AI cannot read your mind. It has no knowledge of your system unless you tell it. Every missing piece of context is a gap the model will fill with a plausible-sounding guess. If the output is poor, the first question is: what did I not tell it? Add the role, context, and constraints that were missing and run it again before concluding the tool does not work.
🚫 Accepting first-draft output without review
Prompt engineering is iterative. The first response is a starting point, not a finished product. Review every draft for hallucinated field names, incorrect business rules, incomplete equivalence partition coverage, and format issues before copying anything into a test management tool. A 10-minute review is not optional — it is the step that makes AI-generated test cases safe to use.
🚫 Using the same prompt for every task
Different test tasks require different prompting strategies. Test analysis (prompt chaining to analyse ambiguity → write criteria → review) is a fundamentally different task from test case generation (few-shot to match format) or test reporting (single structured prompt with raw data). Using a test case prompt for test analysis, or a zero-shot approach where few-shot is needed, produces technically valid output that is practically useless.
🚫 Forgetting the output format instruction
Without a specified format, the model chooses its own — often a conversational paragraph or a format that requires significant reformatting before it fits your test management tool. Always specify: table with named columns, Gherkin Feature file, JSON array, numbered list, or Robot Framework .robot file. The format instruction is not optional; it is what converts AI output into a usable work product.
11 Now You Try
Three graded exercises. Each builds on the technique covered in this page. Run your prompt, read the AI feedback, then check the model answer.
Below is a weak prompt for a Kiwisaver contribution rate change feature. It will produce generic, unusable output. Your task: rewrite it in the textarea using the 6-component structure (Role, Context, Instruction, Input data, Constraints, Output format). Then run your version and compare the output quality.
Write test cases for the Kiwisaver contribution rate change.
Your structured rewrite (edit below, then run):
Show model answer
Role: You are a senior tester specialising in NZ financial services and KiwiSaver products. Context: The system is a KiwiSaver provider's online member portal. The contribution rate change feature allows members to change their contribution rate from the current rate to any of the prescribed rates under the KiwiSaver Act 2006: 3%, 4%, 6%, 8%, or 10%. Instruction: Generate boundary value and negative test cases for the contribution rate change feature. Input data: - Valid rates: 3%, 4%, 6%, 8%, 10% (no others are valid under the KiwiSaver Act 2006) - Members on a contributions holiday cannot change their rate without ending the holiday first - Rate changes take effect from the next pay period after the employer is notified - Members cannot select a rate identical to their current rate - Maximum 1 rate change per calendar month Constraints: Include NZ-specific rules only. Do not invent rates or rules not listed above. Output format: A numbered table with columns: Test ID | Test Description | Input | Expected Result | Test Type (Positive/Negative/Boundary)
You are testing an IRD myGST return submission form. One Gherkin example is given below. Add a second and third example to complete the few-shot pattern, then ask for 5 more scenarios covering specific edge cases. Run your completed prompt and check whether the AI matches your format.
Show model answer examples 2 & 3
--- EXAMPLE 2 --- Scenario: Submission blocked after filing deadline Given a registered GST-registered business with a 2-monthly filing period And the GST period ended 31 March And today's date is 1 June (32 days after the 28 April deadline) When the business attempts to submit the return Then the system displays "The filing deadline for this period has passed. Contact IRD to file a late return." And the Submit button is disabled --- EXAMPLE 3 --- Scenario: Submission blocked when GST number is invalid format Given a user attempts to register a new GST return And they enter a GST number of "12345" (fewer than 8 digits) When they click "Continue" Then the system displays "Please enter a valid NZ GST number (8 or 9 digits)" And the form does not advance to the return entry screen
Prompt chaining breaks complex test analysis into steps. Below is a user story for a RealMe authentication flow. Write Step 1 of a 3-step chain: a prompt that analyses this story for ambiguities, missing acceptance criteria, and edge cases — without jumping to test case generation. The AI should give you raw analysis material that you would then refine in Step 2.
"As a citizen, I want to log in to the government portal using RealMe so that my identity is verified and I can access my personal records."
Show model answer — full 3-step chain
STEP 1 — Analyse for gaps (run this first): You are a senior test analyst for NZ government digital services. Analyse the following user story and identify: 1. Ambiguous terms requiring clarification 2. Missing acceptance criteria (what the story does not specify) 3. Implicit business rules (rules that must apply but are unstated) 4. Edge cases and failure modes not addressed User story: "As a citizen, I want to log in to the government portal using RealMe so that my identity is verified and I can access my personal records." Do NOT generate test cases. Output only a structured analysis under the four headings above. --- STEP 2 — Write acceptance criteria (feed Step 1 output): Based on the gaps identified in the analysis above, write 6 specific, testable acceptance criteria for the RealMe login feature. Each criterion must: start with "Given/When/Then" or "The system must/shall", reference specific RealMe assurance levels (AL1, AL2) where relevant, and be verifiable by a tester without backend access. --- STEP 3 — Review the criteria (feed Step 2 output): Review the acceptance criteria above. For each, assess: Is it testable? Is it unambiguous? Does it specify both the trigger and the expected system response? Flag any that fail and explain why.
12 Self-Check
Click each question to reveal the answer.
Q1. What are the 6 components of a well-structured prompt?
Role, Context, Instruction, Input Data, Constraints, Output Format. Each contributes to precision. Missing context forces the model to guess; missing output format means you will need to reformat the result before you can use it. Omitting the role means the model will not apply the domain expertise that produces accurate NZ-specific content.
Q2. When would you choose few-shot prompting over zero-shot?
When you need the output in a specific format, or when the task is complex enough that examples clarify the expected result better than any description. Few-shot works especially well for generating multiple test cases in your team's exact format — provide 2–3 examples from the first user story and the model will match that format for all subsequent stories in the sprint.
Q3. What is prompt chaining and why is it useful for test analysis?
Prompt chaining breaks a complex task into sequential steps where each step's output feeds the next. It is useful for test analysis because models perform better on focused tasks. Analysing ambiguities, then writing acceptance criteria, then reviewing them for testability are three separate cognitive tasks. Combining them in a single prompt forces the model to multitask — doing each as a separate chained step produces better output at every stage.
Q4. How would you evaluate whether an AI-generated test case suite is high quality?
Check for correctness against business rules, completeness across equivalence partitions, format compliance, absence of hallucinated field names or values, and no redundant cases. A practical approach: spot-check 10–15% of cases against the spec. If the hallucination rate in that sample is above 10%, the prompt needs more input data and constraints before re-running.
Q5. What is the difference between a system prompt and a user prompt?
A system prompt sets the AI's persistent persona and constraints — it runs before every user message. A user prompt is the per-request instruction. In test tooling, a system prompt might set "you are a tester specialised in NZ financial services; always use NZ English; never invent field names not present in the specification" while user prompts send individual test analysis or test design tasks against that persistent context. System prompts are configured once; user prompts change with each task.
13 Interview Prep
These questions appear in CT-GenAI-focused interviews and in general senior tester interviews at NZ organisations that have adopted AI tooling.
Q: "Describe how you have used AI to improve your testing process."
Focus on a specific task: test case generation, defect analysis, or acceptance criteria review. Explain the prompt structure you used — role, context, instruction, format — the output you got, and how you validated it. Be honest about the review step. Interviewers are impressed by testers who understand AI limitations and have a process for catching hallucinations, not testers who claim AI does everything perfectly.
Q: "What techniques do you use to get consistent output from an AI model for testing?"
Few-shot prompting establishes the format pattern. System prompts set the persistent context for a session or tool. Prompt chaining breaks complex tasks into reliable steps. And I save prompts that work — reusing a validated prompt is far more consistent than writing a new one each time. For a team, I would store approved prompts in the wiki so everyone generates output in the same format.
Q: "How would you explain the risk of AI hallucination to a project manager who wants to use AI for all test case generation?"
I would explain that AI confidently generates plausible-sounding content that may be factually wrong. In testing, this means test cases with invented field names, wrong status codes, or incorrect business rules — tests that pass but test the wrong thing. The mitigation is a mandatory review step: AI generates the draft, a tester who knows the system verifies it against the spec. This is still significantly faster than writing from scratch, but it is not zero-effort. The risk of skipping the review is a test suite that passes while bugs remain.