Test with AI · CT-GenAI Ch 2

Prompt Engineering for Testing

The AI is not the problem. The prompt is. Master the deliberate craft of structuring instructions and you will generate better test artefacts in seconds than most testers produce in hours.

Test with AI CT-GenAI Ch 2 — GenAI-2.1.1 to 2.3.2 ~45 min read · ~90 min with exercises

1 The Hook

An NZ insurance company gave their test team access to an AI assistant for test case generation. Two weeks later, the lead tester reported: "it doesn't work — the test cases are terrible." The manager sat with her and watched her type: "write test cases for the claims form." The AI produced five generic test cases. No claim types. No validation rules. No NZ-specific fields. Just five variations of "enter valid data, click submit, verify success."

The problem was not the AI. The problem was the prompt.

When the manager rewrote the prompt — including the actual claims fields, the validation rules, the NZ-specific business logic around CoverNZ levies, and a specific output format — the AI produced 23 high-quality test cases in 40 seconds. The only difference was the prompt. The AI was the same. The model was the same. The company's subscription had not changed. The tester's access had not changed.

This is the most important lesson in the CT-GenAI syllabus: AI does not fail because it is a bad tool. It produces poor output because it receives poor input. The model has no knowledge of your system, your business rules, or your team's standards unless you tell it. Every gap in your prompt is a gap the model fills with a plausible-sounding guess — and in testing, a plausible-sounding guess is worse than no test case at all.

Prompt engineering is not a dark art reserved for AI specialists. It is a testable, learnable skill built on the same discipline that makes you a good tester: precision, completeness, and clarity of specification. You have already been writing prompts your entire career — you called them acceptance criteria.

From the field

A Wellington-based test team at a Crown entity were rolling out an AI assistant to help generate test cases for their online licensing portal. After two sprints they declared it "useless" and reverted to manual writing. When a lead tester reviewed their prompt history, every prompt was a single sentence: "write tests for the renewal form". No fields. No validation rules. No mention of the RealMe authentication step or the NZ Business Number format. The model had no choice but to invent plausible-sounding inputs — none of which existed in the actual system. The turnaround came when the team adopted the NZ government's Algorithm Assessment Framework principle of explicit context: treat the AI as a new contractor on day one who has never seen your system and has no access to your documentation unless you paste it in. Once they embedded the actual field spec and Revenue NZ number validation rules directly in the prompt, first-draft quality jumped from roughly 30% usable to over 80% within a week. The discipline that fixed their prompts was the same discipline they already applied to writing acceptance criteria: specify everything, assume nothing.

2 The Rule

A prompt is an instruction to an AI model. Prompt engineering is the deliberate craft of structuring prompts to get accurate, relevant, and useful output. For testers, this is the most practical skill in the CT-GenAI syllabus — it directly determines the quality of every AI-assisted test artefact you produce.

A well-engineered prompt is not longer; it is more precise. It eliminates ambiguity, supplies context the model cannot infer, and specifies the exact output format your workflow needs. The difference between a vague prompt and a structured one is not effort — it is discipline.

3 The Analogy

Analogy

Writing a prompt is like writing an acceptance criterion.

A vague acceptance criterion — "the form should work" — produces useless test cases. A specific one — "when a user submits an Revenue NZ number with fewer than 8 digits, the system must display the error message 'Revenue NZ number must be 8 or 9 digits'" — produces precise, verifiable tests. The same discipline that makes you a good tester makes you a good prompter. Specificity is the tool. Ambiguity is the enemy. You already know this. You just have not applied it to AI interactions yet.

4 Prompt Structure

CT-GenAI-2.1.1 defines six components of a well-structured prompt. Each component eliminates a different class of ambiguity. Together they give the model everything it needs to produce output you can actually use.

Component	What it does	Example
Role	Sets the AI's expertise and perspective	"You are a senior tester specialising in NZ financial services"
Context	The system, technology, and business domain	"The system is an Harbour Bank online banking portal built on React with a .NET backend"
Instruction	The specific task to perform	"Generate boundary value test cases for the password reset flow"
Input data	The spec, user story, or field definitions	Paste the actual user story or field list here
Constraints	NZ-specific rules, format restrictions, limits	"Passwords must meet NZ government NZISM guidelines, minimum 16 characters"
Output format	How you want the result delivered	"Output as a Gherkin Feature file with Given/When/Then"

Before and After: Harbour Bank Password Reset

Vague prompt

Write test cases for the password reset.

Structured prompt

Role: You are a senior tester specialising in NZ financial services.
Context: The system is the Harbour Bank online banking portal. The password reset flow is accessible from the login page and allows registered customers to reset their password via their registered email or registered NZ mobile number.
Instruction: Generate boundary value and negative test cases for the password reset flow.
Input data: Rules: minimum 10 characters, must contain 1 uppercase, 1 number, 1 special character; NZ mobile must be in 02x format; reset link expires after 15 minutes; maximum 3 reset attempts per hour.
Constraints: Include NZ-specific inputs (NZ mobile format, NZ email domains). Do not include any test cases that require access to the email server.
Output format: A numbered table with columns: Test ID, Test Description, Input, Expected Result, Test Type.

The structured prompt takes 90 seconds to write. It produces test cases you can use immediately without reformatting or fact-checking every field name.

5 Core Prompting Techniques

CT-GenAI-2.1.2 and 2.1.3 cover five prompting techniques. Each suits a different testing task. Choosing the wrong one is like choosing the wrong test design technique — you get output, but not the output you need.

Zero-shot prompting

No examples given. The model uses only its training knowledge and your instructions. Best for simple, well-defined tasks where the output format is universally understood.

Example: "Generate 5 boundary value test cases for a login form with a username field that accepts 6–30 characters."

Use when: quick, one-off tasks; simple test case generation; exploring what the model knows about a standard technology.

One-shot prompting

One example provided to show the required format. The model adapts to match it. Useful when your team uses a non-standard test case format that the model would not guess correctly.

Use when: you need a specific output structure and one example is enough to demonstrate it.

Few-shot prompting

Two to three examples establish a clear pattern. The model learns format, tone, level of detail, and domain vocabulary from your examples. Best for generating test cases in your team's exact format across many user stories.

Use when: generating multiple test artefacts consistently; the format is complex; first-draft quality must be high to save review time.

Prompt chaining

Break a complex task into sequential steps. Each step's output feeds the next. The model performs better on focused tasks than on instructions that require simultaneous multi-step reasoning.

Example sequence for a user story:
Step 1 → Analyse the user story and identify ambiguities and missing acceptance criteria
Step 2 → (feed Step 1 output) Write 5 specific acceptance criteria for the identified gaps
Step 3 → (feed Step 2 output) Generate Gherkin test scenarios for each acceptance criterion

Use when: complex test analysis; multi-step reasoning; output of one stage must be reviewed before proceeding.

Meta-prompting

Ask the model to improve your prompt before using it. Useful when you know your prompt is incomplete but are not sure what is missing.

Example: "Here is my prompt: [your prompt]. Improve it to produce better test cases for NZ financial services testing. Explain what you added and why."

Use when: onboarding to a new domain; building a reusable prompt for a new test area; your first draft is generating poor output.

System prompts vs user prompts

A system prompt sets the AI's persistent persona and constraints — it runs before every user message. A user prompt is the per-request instruction. In test tooling, a system prompt might say "You are a tester specialising in NZ government digital services. Always use NZ English. Never invent field names that are not in the specification." User prompts then send individual test tasks against that persistent context.

Use system prompts when: setting up a shared AI tool for a team; configuring a testing chatbot or IDE plugin; ensuring consistent persona across a long session.

Pro tip: For repetitive test generation tasks — writing test cases for each of 20 user stories — use few-shot prompting with 2–3 examples from your first story. The model will match your team's exact test case format for all subsequent stories. Invest five minutes in a good example set and save hours of reformatting.

6 Applying AI to Test Analysis

CT-GenAI-2.2.1 covers using AI to support test analysis: identifying missing acceptance criteria, ambiguous requirements, and untestable specifications before test design begins. This is where prompt chaining delivers the most value.

Worked Example: Kiwisaver Contribution Rate Change

User story: "As a Kiwisaver member, I want to change my contribution rate online so that my contributions reflect my current financial situation."

This story is incomplete. It is missing contribution rate options, effective date rules, confirmation requirements, and employer notification logic. A tester who writes test cases directly from this story will miss at least half the scenarios.

Prompt 1 — Analyse for gaps

You are a senior test analyst specialising in NZ financial services.

Analyse the following user story and identify:
1. Ambiguous terms that need clarification
2. Missing acceptance criteria
3. Implicit business rules that should be made explicit
4. Edge cases that are not addressed

User story: "As a Kiwisaver member, I want to change my contribution rate online
so that my contributions reflect my current financial situation."

Format your response as a numbered list under each of the four headings above.

Example output (Step 1)

Ambiguous terms:
1. "Change my contribution rate" - allowable rates not specified (3%, 4%, 6%, 8%, 10% per KiwiSaver Act)
2. "Online" - web only, or also mobile app?

Missing acceptance criteria:
3. What happens when a rate change is submitted mid-pay-period?
4. Is employer notification required, and when?
5. Is there a cooling-off period or waiting period before the new rate takes effect?

Implicit business rules:
6. KiwiSaver Act 2006 restricts contribution rates to prescribed values
7. Members on contributions holiday cannot change rate without ending the holiday first

Edge cases:
8. Member attempts to set the same rate as their current rate
9. Member submits change when employer payroll has already been processed for the current period

Prompt 2 — Generate acceptance criteria for the gaps

Based on the gaps identified above, write 5 specific, testable acceptance criteria
for the Kiwisaver contribution rate change feature.

Each criterion must:
- Start with "Given/When/Then" or "The system must/shall"
- Reference specific contribution rate values (3%, 4%, 6%, 8%, 10%)
- Be verifiable by a tester without access to backend systems

Use the gaps identified in the previous analysis as your source material.

Prompt 3 — Review the acceptance criteria

Review the following acceptance criteria for the Kiwisaver contribution rate
change feature. For each criterion, assess:
- Is it testable? (Can a tester verify it without business analyst support?)
- Is it unambiguous? (Would two testers write the same test case from it?)
- Is it complete? (Does it specify both the action and the expected system response?)

[Paste Step 2 output here]

Flag any criteria that fail one or more checks and explain why.

Human review is essential at Step 2

The model's Step 2 output may contain hallucinated business rules — plausible-sounding Kiwisaver rules that do not exist in the KiwiSaver Act 2006 or your specific provider's implementation. A tester with domain knowledge must review the generated acceptance criteria against the actual spec before they are used for test design. Skipping this review is the most common mistake in AI-assisted test analysis.

7 Applying AI to Test Design

CT-GenAI-2.2.2 and 2.2.3 cover AI-assisted test case generation, test suite prioritisation, and keyword-driven script generation. Each requires a different prompting strategy.

Test Case Generation: Few-shot with Gherkin

User story: "As a taxpayer, I want to submit my IR3 tax return online so that my tax position is assessed by Revenue NZ."

Provide two Gherkin examples to establish format, then ask for five more:

You are a senior test analyst for a NZ government digital services team.

Generate 5 Gherkin scenarios for the Revenue NZ IR3 tax return submission feature.
Follow the exact format of the examples below.

--- EXAMPLE 1 ---
Feature: IR3 Tax Return Submission

  Scenario: Successful submission with employment income only
    Given a registered taxpayer with an active RealMe login
    And the current tax year is 2025
    When the taxpayer enters total employment income of $75,000
    And selects "No" to all additional income sources
    And clicks "Submit Return"
    Then the system displays a confirmation with reference number
    And the return status changes to "Submitted" in the taxpayer's dashboard

--- EXAMPLE 2 ---
  Scenario: Submission blocked when income total is zero
    Given a registered taxpayer with an active RealMe login
    And the taxpayer enters $0 for all income fields
    When the taxpayer clicks "Submit Return"
    Then the system displays the error "Total income cannot be zero for an IR3 return"
    And the return status remains "Draft"

--- END EXAMPLES ---

Now generate 5 more scenarios covering:
1. Rental income with property address validation
2. Self-employment income requiring GST number
3. Submission after the 7 July deadline
4. Session timeout mid-completion
5. Duplicate submission attempt

Example output (partial)

  Scenario: Rental income requires valid NZ property address
    Given a registered taxpayer with an active RealMe login
    And the taxpayer selects "Yes" to rental income
    When the taxpayer enters a property address without a valid NZ postcode
    And clicks "Save and Continue"
    Then the system displays "Please enter a valid NZ property address including suburb and postcode"
    And the taxpayer remains on the rental income section

Test Suite Prioritisation via Prompt Chaining

Step 1: Generate the test cases (use any technique above).
Step 2: Rank by risk.

Here are 20 test cases for the Revenue NZ IR3 submission feature: [paste list]

Rank them by testing risk using these criteria:
- Business impact if this scenario fails (High/Medium/Low)
- Likelihood of failure based on complexity
- Regulatory risk (Revenue NZ compliance, privacy)

Output a prioritised list with a one-sentence risk rationale for each.

Step 3: Identify dependencies.

From the prioritised list above, identify test cases that have ordering dependencies
(i.e., Test B cannot run unless Test A has passed).
List each dependency as "TC-01 must precede TC-04 because [reason]".

Keyword-driven Script Generation

Convert the following manual test case into a Robot Framework keyword-driven
test script. Use only standard Selenium2Library keywords.

Manual test:
Title: Lodge Kiwisaver contribution change
Steps:
1. Open browser to https://kiwisaver.example.co.nz
2. Log in with test credentials (username: test_member_01, password: use env var)
3. Navigate to "My Contributions"
4. Select contribution rate 6%
5. Click "Save Changes"
6. Verify confirmation message "Your contribution rate has been updated to 6%"
7. Verify new rate is displayed on the contributions summary page

Output as a complete Robot Framework .robot file.

8 Applying AI to Regression and Monitoring

CT-GenAI-2.2.4 and 2.2.5 address using AI to maintain regression suites and interpret test run data — two of the highest-effort, lowest-value activities in a mature test team's calendar.

Regression Suite Maintenance

Regression suites grow by accretion. Tests are added but rarely removed. After two years, a 600-test suite may contain 150 redundant or outdated cases. AI can identify candidates for removal:

You are a test architect reviewing a regression suite for an NZ banking application.

Here is a list of 50 regression test cases with their descriptions and last-modified dates:
[paste test list]

Identify:
1. Tests with overlapping coverage (where two tests appear to test the same condition)
2. Tests that reference features that have been deprecated or removed (based on the descriptions)
3. Tests that test implementation details rather than business behaviour
4. Tests that have not been modified in over 12 months and may be testing stable, low-risk functionality

For each finding, explain your reasoning and suggest whether to retire, merge, or retain the test.

Test Run Reporting for Stakeholders

Most stakeholders cannot interpret a JUnit XML report. AI can translate raw test results into plain-English executive summaries:

You are a test manager writing a sprint test summary for a non-technical product owner.

Here are the test run results for Sprint 47:
Total tests: 312
Passed: 287
Failed: 18
Skipped: 7
New failures (not present in Sprint 46): 6
Failures resolved from Sprint 46: 4
Environment: UAT (uat.bnz-digital.example.co.nz)
Sprint focus: Online home loan pre-approval flow

Write a 3-paragraph plain-English summary covering:
1. Overall quality signal (is this sprint ready to release?)
2. The most significant new failures and their likely user impact
3. What the team should prioritise before the release sign-off

Do not use technical jargon. The audience is a product owner and a business analyst.

Selecting the Right Technique

Scenario	Best technique
Quick, one-off task	Zero-shot
Need specific output format	Few-shot
Complex multi-step analysis	Prompt chaining
Repetitive task across many items	Meta-prompt + few-shot
Setting up a shared team tool	System prompt

9 Evaluating and Refining Prompts

CT-GenAI-2.3.1 and 2.3.2 cover evaluating AI output and iteratively improving prompts. The first response is a starting point. Professional prompt engineering is a refinement loop.

Metrics for Evaluating AI Test Output

Metric	What to check
Correctness	Does the output match your actual business rules and field definitions?
Completeness	Has it covered all equivalence partitions from the spec?
Format compliance	Does the output match the format you requested?
Hallucination rate	What percentage of test cases reference invented field names, values, or rules?
Relevance	Are all generated cases testable, non-redundant, and within scope?

Iterative Refinement: Three-round Loop

Round 1 — Baseline

Prompt: "Write test cases for the Revenue NZ GST return form."

Output problem: Generic test cases with made-up field names. No NZ GST rules. Wrong output format.

Round 2 — Add context and constraints

Add: Role (NZ tax systems tester), Context (Revenue NZ myIR portal, 2-monthly GST filers), Input data (GST return fields from Revenue NZ website), Constraints (NZ GST rate 15%, return period must be 2 months), Output format (table with Test ID, Condition, Input, Expected Result).

Output improvement: Correct field names, correct GST rate, correct format. But missing boundary cases for the filing deadline.

Round 3 — Add missing coverage

Add: "Include test cases for: submission on the due date (28th of the month after period end), submission 1 day before due date, submission 1 day after due date, and submission when the due date falls on a weekend."

Output: Complete coverage including boundary dates, with accurate NZ business rules throughout.

Pro tip: Save your best prompts. A prompt that works well for generating Revenue NZ validation test cases will work again when you test the next form with Revenue NZ validation. Build a personal prompt library in a shared document or your team's wiki. A validated, reusable prompt is far more consistent than writing a new one each time — and far easier to hand to a new team member.

From the field

A Wellington-based team building a Benefits NZ digital form assumed that zero-shot prompting would be sufficient for test case generation — the form fields were standard, the rules seemed obvious. What they discovered after the first sprint review was that the AI had generated technically valid test cases that missed every NZ-specific business rule: income thresholds referenced Australian dollar amounts, Revenue NZ number format validation was wrong, and benefit eligibility edge cases that existed only in NZ social security legislation were absent entirely. The AI had no knowledge of NZ legislative context and the team had not told it. They rebuilt their prompt library with mandatory constraint blocks — NZ legislation references, NZ field formats, NZ-specific business rules — and embedded those blocks in a shared system prompt that every team member's tool session loaded automatically. The lesson generalises: domain context the model cannot infer must be explicitly supplied, and on government projects, that context is always jurisdiction-specific.

10 Common Mistakes

🚫 Writing vague prompts and blaming the AI

The AI cannot read your mind. It has no knowledge of your system unless you tell it. Every missing piece of context is a gap the model will fill with a plausible-sounding guess. If the output is poor, the first question is: what did I not tell it? Add the role, context, and constraints that were missing and run it again before concluding the tool does not work.

🚫 Accepting first-draft output without review

Prompt engineering is iterative. The first response is a starting point, not a finished product. Review every draft for hallucinated field names, incorrect business rules, incomplete equivalence partition coverage, and format issues before copying anything into a test management tool. A 10-minute review is not optional — it is the step that makes AI-generated test cases safe to use.

🚫 Using the same prompt for every task

Different test tasks require different prompting strategies. Test analysis (prompt chaining to analyse ambiguity → write criteria → review) is a fundamentally different task from test case generation (few-shot to match format) or test reporting (single structured prompt with raw data). Using a test case prompt for test analysis, or a zero-shot approach where few-shot is needed, produces technically valid output that is practically useless.

🚫 Forgetting the output format instruction

Without a specified format, the model chooses its own — often a conversational paragraph or a format that requires significant reformatting before it fits your test management tool. Always specify: table with named columns, Gherkin Feature file, JSON array, numbered list, or Robot Framework .robot file. The format instruction is not optional; it is what converts AI output into a usable work product.

Senior engineer insight

The biggest shift in how I think about prompt engineering came when I stopped treating prompts as questions and started treating them as specifications. When our team integrated an AI assistant for test case generation on a NZ government digital services project, the prompts that failed were the ones written by testers — experienced, skilled testers — who had never written a formal specification in their lives. They knew what they wanted but could not articulate it without ambiguity. The discipline of writing a structured prompt exposed gaps in their understanding of the system that would have produced incomplete test coverage regardless of the tool.

When we mandated the six-component structure as a team standard, prompt quality and output quality both improved within a sprint — not because the AI changed, but because the testers had to think harder before asking.

The most common mistake: teams invest in AI tooling but not in prompt engineering discipline, then blame the tool when generic prompts produce generic output.

Why teams fail here

They judge the technique by the first response. Zero-shot on a complex domain-specific task will almost always produce generic output. Teams try it once, declare it broken, and abandon prompt engineering entirely — before reaching few-shot or chaining, which is where the real productivity gains live.
They paste the user story and nothing else. A user story is a conversation starter, not a specification. Without field names, validation rules, and NZ-specific constraints (Revenue NZ numbers, RealMe assurance levels, NZISM password requirements), the model fills the gaps with generic, plausible-sounding content that does not match the actual system.
They skip the output format instruction and then blame the AI. Without a specified format, the model defaults to conversational prose or an ad-hoc table structure. The team then spends more time reformatting the output than they would have spent writing test cases manually — and concludes AI is slower, not that their prompt was incomplete.
They use the same prompt for every type of testing task. Test analysis (finding gaps in a spec) requires prompt chaining. Test case generation benefits from few-shot examples. Test run summarisation needs structured data input. Using a generation prompt for analysis, or a zero-shot approach where a chain is needed, produces output that is technically valid but practically useless for the next step in the workflow.

Key takeaway

The quality of your prompt is the quality of your test cases — garbage in, garbage out has never been more literal, and the AI will always sound confident either way.

11 Now You Try

Three graded exercises. Each builds on the technique covered in this page. Run your prompt, read the AI feedback, then check the model answer.

🤖 Exercise 1 of 3 — Weak vs Strong Prompt

Below is a weak prompt for a Kiwisaver contribution rate change feature. It will produce generic, unusable output. Your task: rewrite it in the textarea using the 6-component structure (Role, Context, Instruction, Input data, Constraints, Output format). Then run your version and compare the output quality.

Weak prompt (read only):
Write test cases for the Kiwisaver contribution rate change.

Your structured rewrite (edit below, then run):

Show model answer

Role: You are a senior tester specialising in NZ financial services and KiwiSaver products.
Context: The system is a KiwiSaver provider's online member portal. The contribution rate change feature allows members to change their contribution rate from the current rate to any of the prescribed rates under the KiwiSaver Act 2006: 3%, 4%, 6%, 8%, or 10%.
Instruction: Generate boundary value and negative test cases for the contribution rate change feature.
Input data:
- Valid rates: 3%, 4%, 6%, 8%, 10% (no others are valid under the KiwiSaver Act 2006)
- Members on a contributions holiday cannot change their rate without ending the holiday first
- Rate changes take effect from the next pay period after the employer is notified
- Members cannot select a rate identical to their current rate
- Maximum 1 rate change per calendar month
Constraints: Include NZ-specific rules only. Do not invent rates or rules not listed above.
Output format: A numbered table with columns: Test ID | Test Description | Input | Expected Result | Test Type (Positive/Negative/Boundary)

🤖 Exercise 2 of 3 — Build a Few-Shot Prompt

You are testing an Revenue NZ myGST return submission form. One Gherkin example is given below. Add a second and third example to complete the few-shot pattern, then ask for 5 more scenarios covering specific edge cases. Run your completed prompt and check whether the AI matches your format.

You are a senior QA analyst for an NZ government digital services team.

Generate Gherkin test scenarios for the Revenue NZ myGST return submission form.
Follow the exact format of the examples below.

--- EXAMPLE 1 (given — do not change) ---
Scenario: Successful GST return submission for 2-monthly filer
  Given a registered GST-registered business with a 2-monthly filing period
  And the GST period is 1 February to 31 March
  When the business enters total sales of $50,000 and GST collected of $7,500
  And clicks "Submit Return"
  Then the system displays a confirmation with Revenue NZ reference number
  And the return status changes to "Filed" in the business dashboard

--- EXAMPLE 2 (write your own here) ---
[Your second Gherkin scenario — try: submission after the filing deadline]

--- EXAMPLE 3 (write your own here) ---
[Your third Gherkin scenario — try: GST number format validation]

--- END EXAMPLES ---

Now generate 5 more scenarios covering:
1. Zero sales for the period (nil return)
2. GST amount that doesn't match 15% of sales (inconsistency error)
3. Session timeout mid-completion
4. Duplicate submission attempt for the same period
5. Submission when bank account for refund is not on file

Show model answer examples 2 & 3

--- EXAMPLE 2 ---
Scenario: Submission blocked after filing deadline
  Given a registered GST-registered business with a 2-monthly filing period
  And the GST period ended 31 March
  And today's date is 1 June (32 days after the 28 April deadline)
  When the business attempts to submit the return
  Then the system displays "The filing deadline for this period has passed. Contact Revenue NZ to file a late return."
  And the Submit button is disabled

--- EXAMPLE 3 ---
Scenario: Submission blocked when GST number is invalid format
  Given a user attempts to register a new GST return
  And they enter a GST number of "12345" (fewer than 8 digits)
  When they click "Continue"
  Then the system displays "Please enter a valid NZ GST number (8 or 9 digits)"
  And the form does not advance to the return entry screen

🤖 Exercise 3 of 3 — Write a Prompt Chain Step

Prompt chaining breaks complex test analysis into steps. Below is a user story for a RealMe authentication flow. Write Step 1 of a 3-step chain: a prompt that analyses this story for ambiguities, missing acceptance criteria, and edge cases — without jumping to test case generation. The AI should give you raw analysis material that you would then refine in Step 2.

User story:
"As a citizen, I want to log in to the government portal using RealMe so that my identity is verified and I can access my personal records."

Show model answer — full 3-step chain

STEP 1 — Analyse for gaps (run this first):
You are a senior test analyst for NZ government digital services.

Analyse the following user story and identify:
1. Ambiguous terms requiring clarification
2. Missing acceptance criteria (what the story does not specify)
3. Implicit business rules (rules that must apply but are unstated)
4. Edge cases and failure modes not addressed

User story: "As a citizen, I want to log in to the government portal using RealMe so that my identity is verified and I can access my personal records."

Do NOT generate test cases. Output only a structured analysis under the four headings above.

---
STEP 2 — Write acceptance criteria (feed Step 1 output):
Based on the gaps identified in the analysis above, write 6 specific, testable acceptance criteria for the RealMe login feature. Each criterion must: start with "Given/When/Then" or "The system must/shall", reference specific RealMe assurance levels (AL1, AL2) where relevant, and be verifiable by a tester without backend access.

---
STEP 3 — Review the criteria (feed Step 2 output):
Review the acceptance criteria above. For each, assess: Is it testable? Is it unambiguous? Does it specify both the trigger and the expected system response? Flag any that fail and explain why.

Why teams fail here

Skipping the human review step after AI generation — teams treat AI output as final, then discover hallucinated field names, incorrect business rules, or missing equivalence partitions when a defect reaches UAT. The review is not optional; it is the step that makes the workflow safe.
Using the same prompting technique for every task — applying zero-shot prompting to complex test analysis (where prompt chaining is needed) or few-shot prompting to a one-off quick task produces technically valid but practically unusable output.
Omitting NZ-specific constraints from prompts — models default to Australian or US regulatory context when jurisdiction is not specified; Revenue NZ numbers, NZ mobile formats, KiwiSaver Act rates, NZISM password rules, and RealMe assurance levels must all be supplied explicitly.
Not specifying the output format — without a format instruction the model chooses conversational prose or a format incompatible with the team's test management tool, adding reformatting overhead that negates the time saving.
Treating prompts as one-shot rather than iterative — writing a single prompt, getting mediocre output, and concluding the tool does not work, rather than applying the three-round refinement loop: run, identify the gap, add what was missing, run again.
Not building a team prompt library — each tester reinvents prompts independently, producing inconsistent formats across the team; a shared, validated prompt library for common tasks eliminates this and makes the quality gain permanent.

12 Self-Check

Click each question to reveal the answer.

Q1. What are the 6 components of a well-structured prompt?

Role, Context, Instruction, Input Data, Constraints, Output Format. Each contributes to precision. Missing context forces the model to guess; missing output format means you will need to reformat the result before you can use it. Omitting the role means the model will not apply the domain expertise that produces accurate NZ-specific content.

Q2. When would you choose few-shot prompting over zero-shot?

When you need the output in a specific format, or when the task is complex enough that examples clarify the expected result better than any description. Few-shot works especially well for generating multiple test cases in your team's exact format — provide 2–3 examples from the first user story and the model will match that format for all subsequent stories in the sprint.

Q3. What is prompt chaining and why is it useful for test analysis?

Prompt chaining breaks a complex task into sequential steps where each step's output feeds the next. It is useful for test analysis because models perform better on focused tasks. Analysing ambiguities, then writing acceptance criteria, then reviewing them for testability are three separate cognitive tasks. Combining them in a single prompt forces the model to multitask — doing each as a separate chained step produces better output at every stage.

Q4. How would you evaluate whether an AI-generated test case suite is high quality?

Check for correctness against business rules, completeness across equivalence partitions, format compliance, absence of hallucinated field names or values, and no redundant cases. A practical approach: spot-check 10–15% of cases against the spec. If the hallucination rate in that sample is above 10%, the prompt needs more input data and constraints before re-running.

Q5. What is the difference between a system prompt and a user prompt?

A system prompt sets the AI's persistent persona and constraints — it runs before every user message. A user prompt is the per-request instruction. In test tooling, a system prompt might set "you are a tester specialised in NZ financial services; always use NZ English; never invent field names not present in the specification" while user prompts send individual test analysis or test design tasks against that persistent context. System prompts are configured once; user prompts change with each task.

13 Interview Prep

These questions appear in CT-GenAI-focused interviews and in general senior tester interviews at NZ organisations that have adopted AI tooling.

Q: "Describe how you have used AI to improve your testing process."

Focus on a specific task: test case generation, defect analysis, or acceptance criteria review. Explain the prompt structure you used — role, context, instruction, format — the output you got, and how you validated it. Be honest about the review step. Interviewers are impressed by testers who understand AI limitations and have a process for catching hallucinations, not testers who claim AI does everything perfectly.

Q: "What techniques do you use to get consistent output from an AI model for testing?"

Few-shot prompting establishes the format pattern. System prompts set the persistent context for a session or tool. Prompt chaining breaks complex tasks into reliable steps. And I save prompts that work — reusing a validated prompt is far more consistent than writing a new one each time. For a team, I would store approved prompts in the wiki so everyone generates output in the same format.

Q: "How would you explain the risk of AI hallucination to a project manager who wants to use AI for all test case generation?"

I would explain that AI confidently generates plausible-sounding content that may be factually wrong. In testing, this means test cases with invented field names, wrong status codes, or incorrect business rules — tests that pass but test the wrong thing. The mitigation is a mandatory review step: AI generates the draft, a tester who knows the system verifies it against the spec. This is still significantly faster than writing from scratch, but it is not zero-effort. The risk of skipping the review is a test suite that passes while bugs remain.

Key takeaway

The AI is not the variable — the prompt is; discipline your prompts with the same rigour you bring to acceptance criteria and you will produce test artefacts faster, more consistently, and with fewer gaps than anything you can write from scratch.

← LLM Infrastructure All AI Testing topics Next: Adopting GenAI →