Test with AI · CT-GenAI Ch 2

Prompt Engineering for Testing

The AI is not the problem. The prompt is. Master the deliberate craft of structuring instructions and you will generate better test artefacts in seconds than most testers produce in hours.

Test with AI CT-GenAI Ch 2 — GenAI-2.1.1 to 2.3.2 ~45 min read · ~90 min with exercises

1 The Hook

An NZ insurance company gave their test team access to an AI assistant for test case generation. Two weeks later, the lead tester reported: "it doesn't work — the test cases are terrible." The manager sat with her and watched her type: "write test cases for the claims form." The AI produced five generic test cases. No claim types. No validation rules. No NZ-specific fields. Just five variations of "enter valid data, click submit, verify success."

The problem was not the AI. The problem was the prompt.

When the manager rewrote the prompt — including the actual claims fields, the validation rules, the NZ-specific business logic around ACC levies, and a specific output format — the AI produced 23 high-quality test cases in 40 seconds. The only difference was the prompt. The AI was the same. The model was the same. The company's subscription had not changed. The tester's access had not changed.

This is the most important lesson in the CT-GenAI syllabus: AI does not fail because it is a bad tool. It produces poor output because it receives poor input. The model has no knowledge of your system, your business rules, or your team's standards unless you tell it. Every gap in your prompt is a gap the model fills with a plausible-sounding guess — and in testing, a plausible-sounding guess is worse than no test case at all.

Prompt engineering is not a dark art reserved for AI specialists. It is a testable, learnable skill built on the same discipline that makes you a good tester: precision, completeness, and clarity of specification. You have already been writing prompts your entire career — you called them acceptance criteria.

2 The Rule

A prompt is an instruction to an AI model. Prompt engineering is the deliberate craft of structuring prompts to get accurate, relevant, and useful output. For testers, this is the most practical skill in the CT-GenAI syllabus — it directly determines the quality of every AI-assisted test artefact you produce.

A well-engineered prompt is not longer; it is more precise. It eliminates ambiguity, supplies context the model cannot infer, and specifies the exact output format your workflow needs. The difference between a vague prompt and a structured one is not effort — it is discipline.

3 The Analogy

Analogy

Writing a prompt is like writing an acceptance criterion.

A vague acceptance criterion — "the form should work" — produces useless test cases. A specific one — "when a user submits an IRD number with fewer than 8 digits, the system must display the error message 'IRD number must be 8 or 9 digits'" — produces precise, verifiable tests. The same discipline that makes you a good tester makes you a good prompter. Specificity is the tool. Ambiguity is the enemy. You already know this. You just have not applied it to AI interactions yet.

4 Prompt Structure

CT-GenAI-2.1.1 defines six components of a well-structured prompt. Each component eliminates a different class of ambiguity. Together they give the model everything it needs to produce output you can actually use.

Component What it does Example
Role Sets the AI's expertise and perspective "You are a senior tester specialising in NZ financial services"
Context The system, technology, and business domain "The system is an ANZ online banking portal built on React with a .NET backend"
Instruction The specific task to perform "Generate boundary value test cases for the password reset flow"
Input data The spec, user story, or field definitions Paste the actual user story or field list here
Constraints NZ-specific rules, format restrictions, limits "Passwords must meet NZ government NZISM guidelines, minimum 16 characters"
Output format How you want the result delivered "Output as a Gherkin Feature file with Given/When/Then"

Before and After: ANZ Password Reset

Vague prompt

Write test cases for the password reset.

Structured prompt

Role: You are a senior tester specialising in NZ financial services.
Context: The system is the ANZ NZ online banking portal. The password reset flow is accessible from the login page and allows registered customers to reset their password via their registered email or registered NZ mobile number.
Instruction: Generate boundary value and negative test cases for the password reset flow.
Input data: Rules: minimum 10 characters, must contain 1 uppercase, 1 number, 1 special character; NZ mobile must be in 02x format; reset link expires after 15 minutes; maximum 3 reset attempts per hour.
Constraints: Include NZ-specific inputs (NZ mobile format, NZ email domains). Do not include any test cases that require access to the email server.
Output format: A numbered table with columns: Test ID, Test Description, Input, Expected Result, Test Type.

The structured prompt takes 90 seconds to write. It produces test cases you can use immediately without reformatting or fact-checking every field name.

5 Core Prompting Techniques

CT-GenAI-2.1.2 and 2.1.3 cover five prompting techniques. Each suits a different testing task. Choosing the wrong one is like choosing the wrong test design technique — you get output, but not the output you need.

Zero-shot prompting

No examples given. The model uses only its training knowledge and your instructions. Best for simple, well-defined tasks where the output format is universally understood.

Example: "Generate 5 boundary value test cases for a login form with a username field that accepts 6–30 characters."

Use when: quick, one-off tasks; simple test case generation; exploring what the model knows about a standard technology.

One-shot prompting

One example provided to show the required format. The model adapts to match it. Useful when your team uses a non-standard test case format that the model would not guess correctly.

Use when: you need a specific output structure and one example is enough to demonstrate it.

Few-shot prompting

Two to three examples establish a clear pattern. The model learns format, tone, level of detail, and domain vocabulary from your examples. Best for generating test cases in your team's exact format across many user stories.

Use when: generating multiple test artefacts consistently; the format is complex; first-draft quality must be high to save review time.

Prompt chaining

Break a complex task into sequential steps. Each step's output feeds the next. The model performs better on focused tasks than on instructions that require simultaneous multi-step reasoning.

Example sequence for a user story:
Step 1 → Analyse the user story and identify ambiguities and missing acceptance criteria
Step 2 → (feed Step 1 output) Write 5 specific acceptance criteria for the identified gaps
Step 3 → (feed Step 2 output) Generate Gherkin test scenarios for each acceptance criterion

Use when: complex test analysis; multi-step reasoning; output of one stage must be reviewed before proceeding.

Meta-prompting

Ask the model to improve your prompt before using it. Useful when you know your prompt is incomplete but are not sure what is missing.

Example: "Here is my prompt: [your prompt]. Improve it to produce better test cases for NZ financial services testing. Explain what you added and why."

Use when: onboarding to a new domain; building a reusable prompt for a new test area; your first draft is generating poor output.

System prompts vs user prompts

A system prompt sets the AI's persistent persona and constraints — it runs before every user message. A user prompt is the per-request instruction. In test tooling, a system prompt might say "You are a tester specialising in NZ government digital services. Always use NZ English. Never invent field names that are not in the specification." User prompts then send individual test tasks against that persistent context.

Use system prompts when: setting up a shared AI tool for a team; configuring a testing chatbot or IDE plugin; ensuring consistent persona across a long session.

Pro tip: For repetitive test generation tasks — writing test cases for each of 20 user stories — use few-shot prompting with 2–3 examples from your first story. The model will match your team's exact test case format for all subsequent stories. Invest five minutes in a good example set and save hours of reformatting.

6 Applying AI to Test Analysis

CT-GenAI-2.2.1 covers using AI to support test analysis: identifying missing acceptance criteria, ambiguous requirements, and untestable specifications before test design begins. This is where prompt chaining delivers the most value.

Worked Example: Kiwisaver Contribution Rate Change

User story: "As a Kiwisaver member, I want to change my contribution rate online so that my contributions reflect my current financial situation."

This story is incomplete. It is missing contribution rate options, effective date rules, confirmation requirements, and employer notification logic. A tester who writes test cases directly from this story will miss at least half the scenarios.

Prompt 1 — Analyse for gaps

You are a senior test analyst specialising in NZ financial services. Analyse the following user story and identify: 1. Ambiguous terms that need clarification 2. Missing acceptance criteria 3. Implicit business rules that should be made explicit 4. Edge cases that are not addressed User story: "As a Kiwisaver member, I want to change my contribution rate online so that my contributions reflect my current financial situation." Format your response as a numbered list under each of the four headings above.

Example output (Step 1)

Ambiguous terms: 1. "Change my contribution rate" - allowable rates not specified (3%, 4%, 6%, 8%, 10% per KiwiSaver Act) 2. "Online" - web only, or also mobile app? Missing acceptance criteria: 3. What happens when a rate change is submitted mid-pay-period? 4. Is employer notification required, and when? 5. Is there a cooling-off period or waiting period before the new rate takes effect? Implicit business rules: 6. KiwiSaver Act 2006 restricts contribution rates to prescribed values 7. Members on contributions holiday cannot change rate without ending the holiday first Edge cases: 8. Member attempts to set the same rate as their current rate 9. Member submits change when employer payroll has already been processed for the current period

Prompt 2 — Generate acceptance criteria for the gaps

Based on the gaps identified above, write 5 specific, testable acceptance criteria for the Kiwisaver contribution rate change feature. Each criterion must: - Start with "Given/When/Then" or "The system must/shall" - Reference specific contribution rate values (3%, 4%, 6%, 8%, 10%) - Be verifiable by a tester without access to backend systems Use the gaps identified in the previous analysis as your source material.

Prompt 3 — Review the acceptance criteria

Review the following acceptance criteria for the Kiwisaver contribution rate change feature. For each criterion, assess: - Is it testable? (Can a tester verify it without business analyst support?) - Is it unambiguous? (Would two testers write the same test case from it?) - Is it complete? (Does it specify both the action and the expected system response?) [Paste Step 2 output here] Flag any criteria that fail one or more checks and explain why.

Human review is essential at Step 2

The model's Step 2 output may contain hallucinated business rules — plausible-sounding Kiwisaver rules that do not exist in the KiwiSaver Act 2006 or your specific provider's implementation. A tester with domain knowledge must review the generated acceptance criteria against the actual spec before they are used for test design. Skipping this review is the most common mistake in AI-assisted test analysis.

7 Applying AI to Test Design

CT-GenAI-2.2.2 and 2.2.3 cover AI-assisted test case generation, test suite prioritisation, and keyword-driven script generation. Each requires a different prompting strategy.

Test Case Generation: Few-shot with Gherkin

User story: "As a taxpayer, I want to submit my IR3 tax return online so that my tax position is assessed by IRD."

Provide two Gherkin examples to establish format, then ask for five more:

You are a senior test analyst for a NZ government digital services team. Generate 5 Gherkin scenarios for the IRD IR3 tax return submission feature. Follow the exact format of the examples below. --- EXAMPLE 1 --- Feature: IR3 Tax Return Submission Scenario: Successful submission with employment income only Given a registered taxpayer with an active RealMe login And the current tax year is 2025 When the taxpayer enters total employment income of $75,000 And selects "No" to all additional income sources And clicks "Submit Return" Then the system displays a confirmation with reference number And the return status changes to "Submitted" in the taxpayer's dashboard --- EXAMPLE 2 --- Scenario: Submission blocked when income total is zero Given a registered taxpayer with an active RealMe login And the taxpayer enters $0 for all income fields When the taxpayer clicks "Submit Return" Then the system displays the error "Total income cannot be zero for an IR3 return" And the return status remains "Draft" --- END EXAMPLES --- Now generate 5 more scenarios covering: 1. Rental income with property address validation 2. Self-employment income requiring GST number 3. Submission after the 7 July deadline 4. Session timeout mid-completion 5. Duplicate submission attempt

Example output (partial)

Scenario: Rental income requires valid NZ property address Given a registered taxpayer with an active RealMe login And the taxpayer selects "Yes" to rental income When the taxpayer enters a property address without a valid NZ postcode And clicks "Save and Continue" Then the system displays "Please enter a valid NZ property address including suburb and postcode" And the taxpayer remains on the rental income section

Test Suite Prioritisation via Prompt Chaining

Step 1: Generate the test cases (use any technique above).
Step 2: Rank by risk.

Here are 20 test cases for the IRD IR3 submission feature: [paste list] Rank them by testing risk using these criteria: - Business impact if this scenario fails (High/Medium/Low) - Likelihood of failure based on complexity - Regulatory risk (IRD compliance, privacy) Output a prioritised list with a one-sentence risk rationale for each.

Step 3: Identify dependencies.

From the prioritised list above, identify test cases that have ordering dependencies (i.e., Test B cannot run unless Test A has passed). List each dependency as "TC-01 must precede TC-04 because [reason]".

Keyword-driven Script Generation

Convert the following manual test case into a Robot Framework keyword-driven test script. Use only standard Selenium2Library keywords. Manual test: Title: Lodge Kiwisaver contribution change Steps: 1. Open browser to https://kiwisaver.example.co.nz 2. Log in with test credentials (username: test_member_01, password: use env var) 3. Navigate to "My Contributions" 4. Select contribution rate 6% 5. Click "Save Changes" 6. Verify confirmation message "Your contribution rate has been updated to 6%" 7. Verify new rate is displayed on the contributions summary page Output as a complete Robot Framework .robot file.

8 Applying AI to Regression and Monitoring

CT-GenAI-2.2.4 and 2.2.5 address using AI to maintain regression suites and interpret test run data — two of the highest-effort, lowest-value activities in a mature test team's calendar.

Regression Suite Maintenance

Regression suites grow by accretion. Tests are added but rarely removed. After two years, a 600-test suite may contain 150 redundant or outdated cases. AI can identify candidates for removal:

You are a test architect reviewing a regression suite for an NZ banking application. Here is a list of 50 regression test cases with their descriptions and last-modified dates: [paste test list] Identify: 1. Tests with overlapping coverage (where two tests appear to test the same condition) 2. Tests that reference features that have been deprecated or removed (based on the descriptions) 3. Tests that test implementation details rather than business behaviour 4. Tests that have not been modified in over 12 months and may be testing stable, low-risk functionality For each finding, explain your reasoning and suggest whether to retire, merge, or retain the test.

Test Run Reporting for Stakeholders

Most stakeholders cannot interpret a JUnit XML report. AI can translate raw test results into plain-English executive summaries:

You are a test manager writing a sprint test summary for a non-technical product owner. Here are the test run results for Sprint 47: Total tests: 312 Passed: 287 Failed: 18 Skipped: 7 New failures (not present in Sprint 46): 6 Failures resolved from Sprint 46: 4 Environment: UAT (uat.bnz-digital.example.co.nz) Sprint focus: Online home loan pre-approval flow Write a 3-paragraph plain-English summary covering: 1. Overall quality signal (is this sprint ready to release?) 2. The most significant new failures and their likely user impact 3. What the team should prioritise before the release sign-off Do not use technical jargon. The audience is a product owner and a business analyst.

Selecting the Right Technique

Scenario Best technique
Quick, one-off task Zero-shot
Need specific output format Few-shot
Complex multi-step analysis Prompt chaining
Repetitive task across many items Meta-prompt + few-shot
Setting up a shared team tool System prompt

9 Evaluating and Refining Prompts

CT-GenAI-2.3.1 and 2.3.2 cover evaluating AI output and iteratively improving prompts. The first response is a starting point. Professional prompt engineering is a refinement loop.

Metrics for Evaluating AI Test Output

Metric What to check
Correctness Does the output match your actual business rules and field definitions?
Completeness Has it covered all equivalence partitions from the spec?
Format compliance Does the output match the format you requested?
Hallucination rate What percentage of test cases reference invented field names, values, or rules?
Relevance Are all generated cases testable, non-redundant, and within scope?

Iterative Refinement: Three-round Loop

Round 1 — Baseline

Prompt: "Write test cases for the IRD GST return form."

Output problem: Generic test cases with made-up field names. No NZ GST rules. Wrong output format.

Round 2 — Add context and constraints

Add: Role (NZ tax systems tester), Context (IRD myIR portal, 2-monthly GST filers), Input data (GST return fields from IRD website), Constraints (NZ GST rate 15%, return period must be 2 months), Output format (table with Test ID, Condition, Input, Expected Result).

Output improvement: Correct field names, correct GST rate, correct format. But missing boundary cases for the filing deadline.

Round 3 — Add missing coverage

Add: "Include test cases for: submission on the due date (28th of the month after period end), submission 1 day before due date, submission 1 day after due date, and submission when the due date falls on a weekend."

Output: Complete coverage including boundary dates, with accurate NZ business rules throughout.

Pro tip: Save your best prompts. A prompt that works well for generating IRD validation test cases will work again when you test the next form with IRD validation. Build a personal prompt library in a shared document or your team's wiki. A validated, reusable prompt is far more consistent than writing a new one each time — and far easier to hand to a new team member.

10 Common Mistakes

🚫 Writing vague prompts and blaming the AI

The AI cannot read your mind. It has no knowledge of your system unless you tell it. Every missing piece of context is a gap the model will fill with a plausible-sounding guess. If the output is poor, the first question is: what did I not tell it? Add the role, context, and constraints that were missing and run it again before concluding the tool does not work.

🚫 Accepting first-draft output without review

Prompt engineering is iterative. The first response is a starting point, not a finished product. Review every draft for hallucinated field names, incorrect business rules, incomplete equivalence partition coverage, and format issues before copying anything into a test management tool. A 10-minute review is not optional — it is the step that makes AI-generated test cases safe to use.

🚫 Using the same prompt for every task

Different test tasks require different prompting strategies. Test analysis (prompt chaining to analyse ambiguity → write criteria → review) is a fundamentally different task from test case generation (few-shot to match format) or test reporting (single structured prompt with raw data). Using a test case prompt for test analysis, or a zero-shot approach where few-shot is needed, produces technically valid output that is practically useless.

🚫 Forgetting the output format instruction

Without a specified format, the model chooses its own — often a conversational paragraph or a format that requires significant reformatting before it fits your test management tool. Always specify: table with named columns, Gherkin Feature file, JSON array, numbered list, or Robot Framework .robot file. The format instruction is not optional; it is what converts AI output into a usable work product.

11 Now You Try

Three graded exercises. Each builds on the technique covered in this page. Run your prompt, read the AI feedback, then check the model answer.

🤖 Exercise 1 of 3 — Weak vs Strong Prompt

Below is a weak prompt for a Kiwisaver contribution rate change feature. It will produce generic, unusable output. Your task: rewrite it in the textarea using the 6-component structure (Role, Context, Instruction, Input data, Constraints, Output format). Then run your version and compare the output quality.

Weak prompt (read only):
Write test cases for the Kiwisaver contribution rate change.

Your structured rewrite (edit below, then run):

Show model answer
Role: You are a senior tester specialising in NZ financial services and KiwiSaver products.
Context: The system is a KiwiSaver provider's online member portal. The contribution rate change feature allows members to change their contribution rate from the current rate to any of the prescribed rates under the KiwiSaver Act 2006: 3%, 4%, 6%, 8%, or 10%.
Instruction: Generate boundary value and negative test cases for the contribution rate change feature.
Input data:
- Valid rates: 3%, 4%, 6%, 8%, 10% (no others are valid under the KiwiSaver Act 2006)
- Members on a contributions holiday cannot change their rate without ending the holiday first
- Rate changes take effect from the next pay period after the employer is notified
- Members cannot select a rate identical to their current rate
- Maximum 1 rate change per calendar month
Constraints: Include NZ-specific rules only. Do not invent rates or rules not listed above.
Output format: A numbered table with columns: Test ID | Test Description | Input | Expected Result | Test Type (Positive/Negative/Boundary)
🤖 Exercise 2 of 3 — Build a Few-Shot Prompt

You are testing an IRD myGST return submission form. One Gherkin example is given below. Add a second and third example to complete the few-shot pattern, then ask for 5 more scenarios covering specific edge cases. Run your completed prompt and check whether the AI matches your format.

Show model answer examples 2 & 3
--- EXAMPLE 2 ---
Scenario: Submission blocked after filing deadline
  Given a registered GST-registered business with a 2-monthly filing period
  And the GST period ended 31 March
  And today's date is 1 June (32 days after the 28 April deadline)
  When the business attempts to submit the return
  Then the system displays "The filing deadline for this period has passed. Contact IRD to file a late return."
  And the Submit button is disabled

--- EXAMPLE 3 ---
Scenario: Submission blocked when GST number is invalid format
  Given a user attempts to register a new GST return
  And they enter a GST number of "12345" (fewer than 8 digits)
  When they click "Continue"
  Then the system displays "Please enter a valid NZ GST number (8 or 9 digits)"
  And the form does not advance to the return entry screen
🤖 Exercise 3 of 3 — Write a Prompt Chain Step

Prompt chaining breaks complex test analysis into steps. Below is a user story for a RealMe authentication flow. Write Step 1 of a 3-step chain: a prompt that analyses this story for ambiguities, missing acceptance criteria, and edge cases — without jumping to test case generation. The AI should give you raw analysis material that you would then refine in Step 2.

User story:
"As a citizen, I want to log in to the government portal using RealMe so that my identity is verified and I can access my personal records."
Show model answer — full 3-step chain
STEP 1 — Analyse for gaps (run this first):
You are a senior test analyst for NZ government digital services.

Analyse the following user story and identify:
1. Ambiguous terms requiring clarification
2. Missing acceptance criteria (what the story does not specify)
3. Implicit business rules (rules that must apply but are unstated)
4. Edge cases and failure modes not addressed

User story: "As a citizen, I want to log in to the government portal using RealMe so that my identity is verified and I can access my personal records."

Do NOT generate test cases. Output only a structured analysis under the four headings above.

---
STEP 2 — Write acceptance criteria (feed Step 1 output):
Based on the gaps identified in the analysis above, write 6 specific, testable acceptance criteria for the RealMe login feature. Each criterion must: start with "Given/When/Then" or "The system must/shall", reference specific RealMe assurance levels (AL1, AL2) where relevant, and be verifiable by a tester without backend access.

---
STEP 3 — Review the criteria (feed Step 2 output):
Review the acceptance criteria above. For each, assess: Is it testable? Is it unambiguous? Does it specify both the trigger and the expected system response? Flag any that fail and explain why.

12 Self-Check

Click each question to reveal the answer.

Q1. What are the 6 components of a well-structured prompt?

Role, Context, Instruction, Input Data, Constraints, Output Format. Each contributes to precision. Missing context forces the model to guess; missing output format means you will need to reformat the result before you can use it. Omitting the role means the model will not apply the domain expertise that produces accurate NZ-specific content.

Q2. When would you choose few-shot prompting over zero-shot?

When you need the output in a specific format, or when the task is complex enough that examples clarify the expected result better than any description. Few-shot works especially well for generating multiple test cases in your team's exact format — provide 2–3 examples from the first user story and the model will match that format for all subsequent stories in the sprint.

Q3. What is prompt chaining and why is it useful for test analysis?

Prompt chaining breaks a complex task into sequential steps where each step's output feeds the next. It is useful for test analysis because models perform better on focused tasks. Analysing ambiguities, then writing acceptance criteria, then reviewing them for testability are three separate cognitive tasks. Combining them in a single prompt forces the model to multitask — doing each as a separate chained step produces better output at every stage.

Q4. How would you evaluate whether an AI-generated test case suite is high quality?

Check for correctness against business rules, completeness across equivalence partitions, format compliance, absence of hallucinated field names or values, and no redundant cases. A practical approach: spot-check 10–15% of cases against the spec. If the hallucination rate in that sample is above 10%, the prompt needs more input data and constraints before re-running.

Q5. What is the difference between a system prompt and a user prompt?

A system prompt sets the AI's persistent persona and constraints — it runs before every user message. A user prompt is the per-request instruction. In test tooling, a system prompt might set "you are a tester specialised in NZ financial services; always use NZ English; never invent field names not present in the specification" while user prompts send individual test analysis or test design tasks against that persistent context. System prompts are configured once; user prompts change with each task.

13 Interview Prep

These questions appear in CT-GenAI-focused interviews and in general senior tester interviews at NZ organisations that have adopted AI tooling.

Q: "Describe how you have used AI to improve your testing process."

Focus on a specific task: test case generation, defect analysis, or acceptance criteria review. Explain the prompt structure you used — role, context, instruction, format — the output you got, and how you validated it. Be honest about the review step. Interviewers are impressed by testers who understand AI limitations and have a process for catching hallucinations, not testers who claim AI does everything perfectly.

Q: "What techniques do you use to get consistent output from an AI model for testing?"

Few-shot prompting establishes the format pattern. System prompts set the persistent context for a session or tool. Prompt chaining breaks complex tasks into reliable steps. And I save prompts that work — reusing a validated prompt is far more consistent than writing a new one each time. For a team, I would store approved prompts in the wiki so everyone generates output in the same format.

Q: "How would you explain the risk of AI hallucination to a project manager who wants to use AI for all test case generation?"

I would explain that AI confidently generates plausible-sounding content that may be factually wrong. In testing, this means test cases with invented field names, wrong status codes, or incorrect business rules — tests that pass but test the wrong thing. The mitigation is a mandatory review step: AI generates the draft, a tester who knows the system verifies it against the spec. This is still significantly faster than writing from scratch, but it is not zero-effort. The risk of skipping the review is a test suite that passes while bugs remain.