Adopting GenAI in Your Test Organisation
Enthusiasm is not a strategy. Successful GenAI adoption requires approved tools, defined use cases, trained testers, documented workflows, and measurable outcomes — not just access to an AI tool on Monday morning.
1 The Hook
An NZ SaaS company's CTO made an announcement on a Monday morning: the company was adopting AI for testing, starting immediately. The team was told to find tools, get productive, and report back in a fortnight. The intent was genuine — the industry was changing, competitors were using AI, and the CTO wanted the company to keep pace.
By Wednesday, half the test team was using different AI tools. None of them had been reviewed by IT security. Two of the tools had terms of service that permitted training on user inputs. One tester had used a browser-based AI assistant to generate integration test cases and, to get relevant output, had pasted the client's proprietary API documentation into the prompt. A developer on another project had used a public code assistant to help write a test script, inadvertently including environment variables from the test configuration file in the context window.
The productivity gains were real. Test cases were generated faster. Some testers were saving two hours a day. But when the security team conducted a post-incident review three weeks later — triggered by an unrelated event — they found a trail of sensitive data flowing to unapproved services. The CTO's announcement had created a situation where 12 testers were making individual decisions about data handling that should have been made once, carefully, by the right people.
The company spent six weeks cleaning up the compliance exposure. Client contracts had to be reviewed for data handling clauses. Legal had to assess privacy risk. One client relationship was damaged when the client discovered their API documentation had been processed by a tool with unfavourable data terms. The productivity gains from the first fortnight were substantially offset by the remediation work in the following month.
The CTO's heart was in the right place. The problem was the absence of strategy. GenAI adoption without governance does not eliminate risk — it creates new risk while adding capability. This module is about doing the adoption correctly: with a strategy that enables the team to move fast without creating exposure they cannot see.
2 The Rule
Successful GenAI adoption in testing requires a deliberate strategy: approved tools, defined use cases, trained testers, documented workflows, and measurable outcomes. Shadow AI — using unapproved tools outside organisational controls — creates security, legal, and quality risks that can outweigh any productivity gains.
The word "deliberate" is doing important work here. It does not mean slow or bureaucratic. A one-page approved tool list, a simple data classification rule, and a two-hour prompt engineering workshop can be in place in a week. That is a strategy. What it prevents is the alternative: a dozen testers making ad-hoc decisions under pressure, each one creating a small exposure that adds up to a large one.
The goal is to enable fast, safe adoption — not to slow down capability adoption out of caution, and not to rush capability adoption at the expense of compliance. Both failure modes are real, and both have costs. Strategy is what keeps you between them.
3 The Analogy
Adopting GenAI without a strategy is like giving every tester a company credit card and saying "buy whatever tools you need."
Some will buy exactly what the team needs. Others will buy conflicting tools, duplicate subscriptions, or tools that expose the company to liability. A few will buy subscriptions they stop using after a week. The spending is not malicious — every individual decision seems reasonable in the moment. But the aggregate result is incoherent and expensive to unwind. A strategy is not bureaucracy — it is the difference between 20 testers rowing in the same direction and 20 testers rowing in random directions. The boat moves either way; only one version gets anywhere useful.
4 Shadow AI
CT-GenAI-5.1.1 defines shadow AI as the use of AI tools not approved by organisational security, compliance, or IT governance. In testing, shadow AI is particularly high-risk because testers routinely handle materials that carry significant sensitivity: system designs, acceptance criteria, client requirements, authentication flows, and defect histories that may contain real system details.
Data security
Unapproved tools may lack enterprise security controls, logging, encryption at rest, or data residency guarantees. Consumer-grade AI tools are typically optimised for general use, not for handling confidential business information. Once data is sent to an external service, your organisation has no control over how it is stored, processed, or retained.
Compliance risk
Tools that train on user inputs may breach client NDAs, the NZ Privacy Act 2020, or data processing agreements with enterprise clients. If a tester pastes a client's system specifications into a public AI tool that uses inputs for training, the client's intellectual property may become part of a public model. The legal exposure is real and difficult to undo.
IP exposure
Proprietary test plans, system architecture documentation, and business logic pasted into public tools may end up influencing training data. Even if the tool does not explicitly train on inputs, the data has left your organisation's control. For NZ companies working with government clients or in regulated industries, this is a significant contract risk.
Quality inconsistency
Ten testers using ten different AI tools produce ten different quality levels and output formats. Without a shared approach, the team cannot build on each other's prompts, cannot establish quality benchmarks, and cannot assess whether AI adoption is actually improving outcomes. You need consistency to measure improvement.
Vendor risk
Teams that adopt unapproved AI tools organically can become dependent on tools that later change their pricing, modify their terms, or shut down. A team whose test case generation workflow depends on a specific consumer tool has taken on a dependency without an organisational decision having been made about that risk.
Several NZ government agencies have published guidance requiring risk assessment before AI tools are used with government data. The GCDO (Government Chief Digital Officer) provides a framework for evaluating AI tools against government information security classifications. The principle is the same for private sector organisations: know what you are approving before you approve it, and apply that governance consistently across the team.
5 GenAI Strategy
CT-GenAI-5.1.2 and 5.1.3 cover the elements of a deliberate GenAI testing strategy and the criteria for selecting models. A strategy does not need to be complex — it needs to be explicit and consistently applied.
Key strategy elements
1. Approved tool list
Which AI tools are cleared for use, by whom, and on what basis? Include: tool name, data handling terms, approved data classifications, and who signed off. Review the list quarterly. New tools must go through the approval process before use, not after.
2. Use case catalogue
Which test tasks can use AI? Which are prohibited? Examples: test case generation from user stories (approved); final sign-off on security test coverage (prohibited — requires human judgment); automated defect triage summaries (approved with review gate); generating test cases for PII-handling features using real data (prohibited — use synthetic data only).
3. Data classification rules
What data can be sent to AI tools? A simple three-tier rule works for most teams: Never (PII, health data, government-classified data, real authentication credentials); With care (internal specifications with client names removed); Freely (synthetic test data, public API documentation, generic acceptance criteria).
4. Quality gates
What review process applies to AI-generated artefacts before they enter the test suite? At minimum: peer review by a tester who knows the system, traceability check against requirements, and sign-off in the test management tool noting AI provenance. For high-risk test areas, require a senior tester to review and approve.
5. Model selection criteria
How do you choose which model to use for a given task? Document the criteria so individual testers are not making these decisions ad hoc. Key factors are covered below.
Selecting LLMs and SLMs (CT-GenAI-5.1.3)
Not all tasks require the same model. Routing tasks to the right model reduces cost and latency without sacrificing quality. Decision criteria:
- Task complexity: Generating boilerplate test cases from a clear user story is a low-complexity task suitable for a smaller, faster model. Deriving equivalence partitions from ambiguous requirements benefits from a larger reasoning model.
- Cost per token vs quality trade-off: For high-volume pipeline tasks, the cost difference between model tiers compounds quickly. Benchmark quality on a sample before committing to a more expensive model for scale.
- Data residency requirements: If your organisation cannot send data to offshore cloud services, an on-premise small language model (SLM) may be required, with a quality trade-off accepted.
- Context window needed: Generating test cases from a 150-page functional specification requires a model with a large context window. Shorter tasks can use smaller-context models.
- Latency requirements: Interactive tools where testers are waiting for a response need fast models. Batch pipelines that run overnight can use slower, more thorough models.
Model Routing Matrix for Testers
| Task Type | Model Tier | Examples |
|---|---|---|
| Mechanical / Boilerplate | Fast / SLM | Reformatting test cases, generating synthetic data, drafting happy-path Gherkin. |
| Creative / Analytical | Large (General) | Identifying edge cases from vague specs, reviewing acceptance criteria for ambiguity. |
| Complex Logic / Code | Reasoning | Deriving complex business rule combinations, refactoring legacy test frameworks, debugging flaky async tests. |
Pro tip: Start with the lowest-tier model that could plausibly do the task. Escalate only if the output quality or reasoning logic shows clear gaps.
6 Adoption Phases
CT-GenAI-5.1.4 describes a structured approach to rolling out GenAI in a test organisation. Skipping phases is how teams create the shadow AI problems described above — going straight from zero to scale without the governance foundations in place.
Phase 1
Exploration
Run a small, controlled pilot. One or two testers, one approved tool, one well-defined use case. Measure carefully: time per test case before and after, review time, hallucination rate, tester satisfaction. Document what works and what does not. The goal of exploration is learning, not productivity. Do not scale anything during this phase. A Wellington fintech running this phase might have two testers generating acceptance criteria from user stories for one product team across three sprints, tracking time savings and output quality before presenting findings.
Phase 2
Formalisation
Take the learnings from exploration and build the governance foundations: approved tool list, use case catalogue, data classification rules, quality gates. Train the team — not just on how to use the tool, but on prompt engineering, output evaluation, and data handling. Version-control the prompts that worked in the pilot. Only after formalisation is complete should you expand to more testers. This phase typically takes two to four weeks.
Phase 3
Scale
Roll out the approved use cases across the team or organisation. Integrate AI into CI/CD pipelines where it adds value. Monitor quality metrics. Run a shared prompt library where testers can access and contribute validated prompts. Establish a feedback loop for reporting AI output quality issues. This is the phase where productivity gains become substantial and measurable.
Phase 4
Optimisation
Review outcomes quarterly against the original metrics. Retire use cases that did not deliver value. Invest in RAG or fine-tuning for high-value tasks that justified the additional infrastructure. Evaluate new models as they become available. Keep the governance foundations updated as tools and team practices evolve. Optimisation is an ongoing discipline, not a final destination.
7 Skills and Change Management
Essential skills for testing with GenAI (CT-GenAI-5.2.1)
AI tools do not replace skill — they require a different skill set. Testers who use AI effectively have developed competencies that are not automatically present just because AI access is available.
- Prompt engineering: The ability to write prompts that produce high-quality, relevant output. This includes providing role, context, constraints, and output format, and knowing how to iterate when the first result is not right.
- AI output evaluation: Critical review of AI-generated test artefacts: spotting hallucinated field names, invented business rules, and plausible-but-wrong expected values. This requires domain knowledge of the system under test — which the model does not have.
- Basic LLM architecture understanding: Enough to explain context windows, hallucination, and data handling to a sceptical security team or a non-technical manager. Testers do not need to understand transformers, but they need to be able to explain why AI output requires review.
- Data privacy awareness: Understanding what data classification rules apply, which information can be sent to which tools, and how to use synthetic or anonymised data for AI-assisted tasks involving sensitive systems.
- Test artefact documentation: How to record AI provenance in test records — noting which artefacts were AI-assisted, which model and prompt version produced them, and what review process was applied.
Building AI capability in test teams (CT-GenAI-5.2.2)
Skill development does not happen through access alone. Practical approaches that work:
- Prompt engineering workshop: A two-hour hands-on session where testers write prompts for real test tasks, review each other's outputs, and discuss what improved quality. One workshop changes outcomes more than weeks of unsupported access.
- Shared prompt library: A team knowledge base — in Confluence, Notion, or a Git repository — where validated prompts are stored with notes on when they work and when they do not. Reduces duplicated experimentation.
- Peer review in retrospectives: Include AI output quality as a standing agenda item in sprint retrospectives for the first three months. What worked? What needed heavy revision? What should be added to the prompt library?
- CT-GenAI certification: The ISTQB Certified Tester AI Testing (CT-AI) and CT-GenAI certifications map directly to this module. Setting team certification as a goal creates a structured learning pathway and a common vocabulary.
How test processes shift with GenAI (CT-GenAI-5.2.3)
GenAI does not replace testing processes — it shifts the effort distribution within them. Understanding this prevents unrealistic expectations in both directions.
Test design
AI accelerates the first draft. A tester who previously spent 60 minutes writing test cases for a feature can now produce a first draft in 10 minutes and spend 50 minutes reviewing, enriching, and adding context-specific edge cases the AI did not generate. The total time may or may not decrease, but the output often improves because human effort shifts from mechanical writing to thoughtful review.
Test estimation
Effort distribution changes. Less time writing; more time reviewing AI output, managing prompt quality, and performing exploratory testing on areas the AI may have missed. Estimation models that treat test case writing as a primary time driver need updating. A team that has not recalibrated its estimates after AI adoption may find it is consistently finishing early or underestimating review time.
Test reporting
AI can automate the production of routine summary reports — sprint test summaries, defect pattern analyses, coverage overviews — freeing testers to focus on the insight work: interpreting what the data means, identifying risk, and communicating quality signals to stakeholders who need them.
Test roles
The test analyst role evolves toward "AI prompt author and output validator." Mechanical test case writing shrinks as a proportion of the job. Risk analysis, exploratory testing strategy, stakeholder communication, and AI output quality assurance grow. Testers who invest in these higher-order skills become more valuable, not less, as AI adoption matures.
8 Common Mistakes
🚫 Adopting AI tools before establishing governance
What happens: Tools spread organically across the team. Different testers make different data handling decisions. Sensitive information reaches unapproved services. When the security or compliance team eventually investigates, the remediation work is far more expensive than the governance would have been.
Correction: Governance does not need to be heavy. A one-page approved tool list and a simple data classification rule eliminates most shadow AI risk. Produce these before the first tool is deployed to the wider team. The exploration phase can happen before governance is fully formalised — as long as it stays small and controlled.
🚫 Measuring success by speed alone
What happens: A team reports that AI generates test cases 10 times faster. Leadership declares success and expands the programme. No one measures output quality. Six months later, someone notices the defect detection rate has fallen. The AI was generating test cases that looked right but missed the edge cases a human would have spotted. Speed without quality is worse than no speed improvement at all.
Correction: Measure quality alongside speed: hallucination rate (how often does AI output need correction?), review time (how long does it take to verify AI output?), and defect detection rate (are AI-assisted tests finding as many defects as human-written ones?). Set baselines in the exploration phase before making expansion decisions.
🚫 Expecting testers to adopt AI without training
What happens: A team is given access to an AI tool and told to use it. Without prompt engineering training, most testers write poor prompts and receive poor output. They conclude the tool is not useful. A few enthusiasts figure it out independently. The adoption fragments: some testers using AI effectively, most not, no shared knowledge base.
Correction: A two-hour hands-on prompt engineering workshop — run before or at the same time as tool access — changes outcomes dramatically. Pair it with a shared prompt library so good prompts are preserved. Do not measure tool success until the team has been trained.
🚫 Not documenting AI provenance in test records
What happens: The test plan says the feature was fully tested. It does not say that 40% of the test cases were AI-generated and reviewed only by the tester who generated them. An external auditor asks how coverage was achieved. The team cannot answer clearly. In a regulated industry — banking, health, government — this is a compliance gap, not just a documentation gap.
Correction: The test plan should document which artefacts were AI-assisted, which model and prompt version was used, and what review process was applied. This does not need to be onerous — a field in the test case record and a paragraph in the test plan covers it. Set the expectation before the first AI-generated artefact enters a regulated test suite.
9 Now You Try
You are the QA lead at a 15-person NZ financial services test team. The CTO has approved a 3-month AI testing pilot. Design your pilot plan below — use cases, data rules, success metrics, training approach, and what you will do after 3 months. The AI will evaluate it against CT-GenAI Chapter 5 criteria.
Show model answer prompt
Review this AI testing pilot plan for an NZ financial services test team and provide structured feedback: CONTEXT: - 15-person QA team at NZ financial services company - CTO has approved a 3-month pilot with budget for one approved AI tool - Must comply with NZ Privacy Act 2020 and internal data classification policy MY PILOT PLAN: Use cases: (1) Test case generation from user stories, (2) Acceptance criteria review for ambiguity, (3) Defect triage summary generation Data rules: Synthetic data only — no real customer names, IRD numbers, bank accounts, or transaction data in any prompt Success metrics: Time per test case (before vs after), hallucination rate (% of AI test cases requiring correction), sprint test coverage achieved Training: 2-hour prompt engineering workshop before pilot starts. Shared prompt library in Confluence from week 2. After 3 months: Present to CTO with data. Formalise top 2 use cases. Retire anything with less than 20% time saving or more than 15% hallucination rate. Evaluate against CT-GenAI Chapter 5. What is strong? What is missing? What governance elements should be added?
10 Self-Check
Click each question to reveal the answer.
Q1. What is shadow AI and why is it particularly risky in a testing context?
Shadow AI is use of AI tools not approved by organisational security and compliance. In testing, testers regularly handle sensitive materials: system specifications, authentication flows, acceptance criteria, and defect histories that may contain real system details. Unapproved tools may train on this input, creating data breach, IP exposure, and compliance risk under the NZ Privacy Act 2020. The risk is higher in testing than in many other roles precisely because the information testers work with is so detailed and sensitive.
Q2. What are the four phases of GenAI adoption in a test organisation?
Exploration (small controlled pilot, one use case, measure quality and time savings carefully); Formalisation (establish governance: approved tool list, data rules, quality gates, team training, prompt versioning); Scale (roll out approved use cases, integrate into CI/CD, shared prompt library, quality monitoring); Optimisation (quarterly review, retire underperforming use cases, invest in RAG or fine-tuning for high-value tasks).
Q3. Name three skills a tester needs to work effectively with GenAI.
Any three of: Prompt engineering (writing prompts that produce relevant, high-quality output); AI output evaluation (critical review to identify hallucinations, invented business rules, wrong expected values); data privacy awareness (understanding which data can go to which tools); basic LLM architecture understanding (enough to explain risks to stakeholders); test artefact documentation (recording AI provenance in test records).
Q4. How does AI adoption change the role of a test analyst?
The test analyst shifts from mechanical test case writing to "AI prompt author and output validator." The first draft of test cases is faster via AI; human effort shifts toward critical review, context-specific edge case identification, exploratory testing, and risk analysis. Effort distribution in estimation changes: less time writing, more time reviewing AI output. The role becomes more analytical and less mechanical, which increases the value of domain expertise and quality judgement.
Q5. Why should AI provenance be documented in test records?
Regulators and auditors may ask how test coverage was achieved. In regulated industries such as banking, health, and government, the test plan must account for how artefacts were produced and what review process was applied. Undocumented AI use is a compliance risk: if 40% of the test suite was AI-generated and reviewed only by a single tester, that needs to be stated and the review process documented. Documentation also enables quality tracking over time — if output from a particular model or prompt version is later found to be systematically wrong, you can identify which test runs were affected.
11 Interview Prep
Common interview questions on GenAI adoption for senior testing and QA lead roles.
Q: "How would you introduce AI-assisted testing to a team that has never used it?"
Start with a small, low-risk pilot on a single use case before anything is rolled out broadly. I would choose acceptance criteria review or test case generation for a non-critical feature — something with clear inputs and easily evaluated outputs. Run it for one sprint with two willing testers, measure quality and time savings honestly, and share the results with the team before deciding on broader adoption. Training comes first: a two-hour prompt engineering session before anyone touches the tool, with a shared prompt library seeded with three validated templates on day one. The goal of the pilot is learning, not productivity. The productivity comes after formalisation.
Q: "What would you include in an AI testing governance policy?"
Five core elements: an approved tool list with security sign-off noting which tools are cleared for which data classifications; data classification rules stating clearly what can and cannot be sent to AI tools (PII, client IP, and authentication data never); quality gate requirements for AI-generated artefacts, including who must review them before they enter the test suite; prompt versioning requirements for any AI used in automated or CI/CD pipelines; and a process for reporting concerns about AI output quality or unexpected behaviour. The policy does not need to be long — one page is sufficient for most teams — but it must be explicit and consistently applied before the first tool goes into wider use.
Q: "How has the role of the tester changed with the introduction of AI?"
The mechanical parts of the job — writing boilerplate test cases for obvious happy paths, formatting test case lists, producing sprint summary reports — are increasingly handled by AI, faster and with less effort. The genuinely valuable parts — understanding system risk, designing exploratory test strategies, evaluating AI output critically, identifying edge cases the AI did not consider, communicating quality signals to stakeholders — are more important than ever. AI does not replace good testing judgement. It amplifies it: a tester with strong domain knowledge and critical evaluation skills produces far better AI-assisted output than one who treats AI as a black box. The testers who will be most valued are those who invest in the higher-order skills, not those who become dependent on the tool.