Test with AI · Adopting GenAI

Adopting GenAI in Your Test Organisation

Q: Q2. What are the four phases of GenAI adoption in a test organisation?

Exploration (small controlled pilot, one use case, measure quality and time savings carefully); Formalisation (establish governance: approved tool list, data rules, quality gates, team training, prompt versioning); Scale (roll out approved use cases, integrate into CI/CD, shared prompt library, quality monitoring); Optimisation (quarterly review, retire underperforming use cases, invest in RAG or fine-tuning for high-value tasks).

Q: Q3. Name three skills a tester needs to work effectively with GenAI.

Any three of: Prompt engineering (writing prompts that produce relevant, high-quality output); AI output evaluation (critical review to identify hallucinations, invented business rules, wrong expected values); data privacy awareness (understanding which data can go to which tools); basic LLM architecture understanding (enough to explain risks to stakeholders); test artefact documentation (recording AI provenance in test records).

Enthusiasm is not a strategy. Successful GenAI adoption requires approved tools, defined use cases, trained testers, documented workflows, and measurable outcomes — not just access to an AI tool on Monday morning.

Test with AI CT-GenAI Ch 5 — GenAI-5.1.1 to 5.2.3 ~35 min read · ~60 min with exercises

1 The Hook

An NZ SaaS company's CTO made an announcement on a Monday morning: the company was adopting AI for testing, starting immediately. The team was told to find tools, get productive, and report back in a fortnight. The intent was genuine — the industry was changing, competitors were using AI, and the CTO wanted the company to keep pace.

By Wednesday, half the test team was using different AI tools. None of them had been reviewed by IT security. Two of the tools had terms of service that permitted training on user inputs. One tester had used a browser-based AI assistant to generate integration test cases and, to get relevant output, had pasted the client's proprietary API documentation into the prompt. A developer on another project had used a public code assistant to help write a test script, inadvertently including environment variables from the test configuration file in the context window.

The productivity gains were real. Test cases were generated faster. Some testers were saving two hours a day. But when the security team conducted a post-incident review three weeks later — triggered by an unrelated event — they found a trail of sensitive data flowing to unapproved services. The CTO's announcement had created a situation where 12 testers were making individual decisions about data handling that should have been made once, carefully, by the right people.

The company spent six weeks cleaning up the compliance exposure. Client contracts had to be reviewed for data handling clauses. Legal had to assess privacy risk. One client relationship was damaged when the client discovered their API documentation had been processed by a tool with unfavourable data terms. The productivity gains from the first fortnight were substantially offset by the remediation work in the following month.

The CTO's heart was in the right place. The problem was the absence of strategy. GenAI adoption without governance does not eliminate risk — it creates new risk while adding capability. This module is about doing the adoption correctly: with a strategy that enables the team to move fast without creating exposure they cannot see.

2 The Rule

Successful GenAI adoption in testing requires a deliberate strategy: approved tools, defined use cases, trained testers, documented workflows, and measurable outcomes. Shadow AI — using unapproved tools outside organisational controls — creates security, legal, and quality risks that can outweigh any productivity gains.

The word "deliberate" is doing important work here. It does not mean slow or bureaucratic. A one-page approved tool list, a simple data classification rule, and a two-hour prompt engineering workshop can be in place in a week. That is a strategy. What it prevents is the alternative: a dozen testers making ad-hoc decisions under pressure, each one creating a small exposure that adds up to a large one.

The goal is to enable fast, safe adoption — not to slow down capability adoption out of caution, and not to rush capability adoption at the expense of compliance. Both failure modes are real, and both have costs. Strategy is what keeps you between them.

3 The Analogy

Analogy

Adopting GenAI without a strategy is like giving every tester a company credit card and saying "buy whatever tools you need."

Some will buy exactly what the team needs. Others will buy conflicting tools, duplicate subscriptions, or tools that expose the company to liability. A few will buy subscriptions they stop using after a week. The spending is not malicious — every individual decision seems reasonable in the moment. But the aggregate result is incoherent and expensive to unwind. A strategy is not bureaucracy — it is the difference between 20 testers rowing in the same direction and 20 testers rowing in random directions. The boat moves either way; only one version gets anywhere useful.

4 Shadow AI

CT-GenAI-5.1.1 defines shadow AI as the use of AI tools not approved by organisational security, compliance, or IT governance. In testing, shadow AI is particularly high-risk because testers routinely handle materials that carry significant sensitivity: system designs, acceptance criteria, client requirements, authentication flows, and defect histories that may contain real system details.

Data security

Unapproved tools may lack enterprise security controls, logging, encryption at rest, or data residency guarantees. Consumer-grade AI tools are typically optimised for general use, not for handling confidential business information. Once data is sent to an external service, your organisation has no control over how it is stored, processed, or retained.

Compliance risk

Tools that train on user inputs may breach client NDAs, the NZ Privacy Act 2020, or data processing agreements with enterprise clients. If a tester pastes a client's system specifications into a public AI tool that uses inputs for training, the client's intellectual property may become part of a public model. The legal exposure is real and difficult to undo.

IP exposure

Proprietary test plans, system architecture documentation, and business logic pasted into public tools may end up influencing training data. Even if the tool does not explicitly train on inputs, the data has left your organisation's control. For NZ companies working with government clients or in regulated industries, this is a significant contract risk.

Quality inconsistency

Ten testers using ten different AI tools produce ten different quality levels and output formats. Without a shared approach, the team cannot build on each other's prompts, cannot establish quality benchmarks, and cannot assess whether AI adoption is actually improving outcomes. You need consistency to measure improvement.

Vendor risk

Teams that adopt unapproved AI tools organically can become dependent on tools that later change their pricing, modify their terms, or shut down. A team whose test case generation workflow depends on a specific consumer tool has taken on a dependency without an organisational decision having been made about that risk.

Several NZ government agencies have published guidance requiring risk assessment before AI tools are used with government data. The GCDO (Government Chief Digital Officer) provides a framework for evaluating AI tools against government information security classifications. The principle is the same for private sector organisations: know what you are approving before you approve it, and apply that governance consistently across the team.

5 GenAI Strategy

CT-GenAI-5.1.2 and 5.1.3 cover the elements of a deliberate GenAI testing strategy and the criteria for selecting models. A strategy does not need to be complex — it needs to be explicit and consistently applied.

Key strategy elements

1. Approved tool list

Which AI tools are cleared for use, by whom, and on what basis? Include: tool name, data handling terms, approved data classifications, and who signed off. Review the list quarterly. New tools must go through the approval process before use, not after.

2. Use case catalogue

Which test tasks can use AI? Which are prohibited? Examples: test case generation from user stories (approved); final sign-off on security test coverage (prohibited — requires human judgment); automated defect triage summaries (approved with review gate); generating test cases for PII-handling features using real data (prohibited — use synthetic data only).

3. Data classification rules

What data can be sent to AI tools? A simple three-tier rule works for most teams: Never (PII, health data, government-classified data, real authentication credentials); With care (internal specifications with client names removed); Freely (synthetic test data, public API documentation, generic acceptance criteria).

4. Quality gates

What review process applies to AI-generated artefacts before they enter the test suite? At minimum: peer review by a tester who knows the system, traceability check against requirements, and sign-off in the test management tool noting AI provenance. For high-risk test areas, require a senior tester to review and approve.

5. Model selection criteria

How do you choose which model to use for a given task? Document the criteria so individual testers are not making these decisions ad hoc. Key factors are covered below.

Selecting LLMs and SLMs (CT-GenAI-5.1.3)

Not all tasks require the same model. Routing tasks to the right model reduces cost and latency without sacrificing quality. Decision criteria:

Task complexity: Generating boilerplate test cases from a clear user story is a low-complexity task suitable for a smaller, faster model. Deriving equivalence partitions from ambiguous requirements benefits from a larger reasoning model.
Cost per token vs quality trade-off: For high-volume pipeline tasks, the cost difference between model tiers compounds quickly. Benchmark quality on a sample before committing to a more expensive model for scale.
Data residency requirements: If your organisation cannot send data to offshore cloud services, an on-premise small language model (SLM) may be required, with a quality trade-off accepted.
Context window needed: Generating test cases from a 150-page functional specification requires a model with a large context window. Shorter tasks can use smaller-context models.
Latency requirements: Interactive tools where testers are waiting for a response need fast models. Batch pipelines that run overnight can use slower, more thorough models.

Model Routing Matrix for Testers

Task Type	Model Tier	Examples
Mechanical / Boilerplate	Fast / SLM	Reformatting test cases, generating synthetic data, drafting happy-path Gherkin.
Creative / Analytical	Large (General)	Identifying edge cases from vague specs, reviewing acceptance criteria for ambiguity.
Complex Logic / Code	Reasoning	Deriving complex business rule combinations, refactoring legacy test frameworks, debugging flaky async tests.

Pro tip: Start with the lowest-tier model that could plausibly do the task. Escalate only if the output quality or reasoning logic shows clear gaps.

6 Adoption Phases

CT-GenAI-5.1.4 describes a structured approach to rolling out GenAI in a test organisation. Skipping phases is how teams create the shadow AI problems described above — going straight from zero to scale without the governance foundations in place.

Phase 1

Exploration

Run a small, controlled pilot. One or two testers, one approved tool, one well-defined use case. Measure carefully: time per test case before and after, review time, hallucination rate, tester satisfaction. Document what works and what does not. The goal of exploration is learning, not productivity. Do not scale anything during this phase. A Wellington fintech running this phase might have two testers generating acceptance criteria from user stories for one product team across three sprints, tracking time savings and output quality before presenting findings.

Phase 2

Formalisation

Take the learnings from exploration and build the governance foundations: approved tool list, use case catalogue, data classification rules, quality gates. Train the team — not just on how to use the tool, but on prompt engineering, output evaluation, and data handling. Version-control the prompts that worked in the pilot. Only after formalisation is complete should you expand to more testers. This phase typically takes two to four weeks.

Phase 3

Scale

Roll out the approved use cases across the team or organisation. Integrate AI into CI/CD pipelines where it adds value. Monitor quality metrics. Run a shared prompt library where testers can access and contribute validated prompts. Establish a feedback loop for reporting AI output quality issues. This is the phase where productivity gains become substantial and measurable.

Phase 4

Optimisation

Review outcomes quarterly against the original metrics. Retire use cases that did not deliver value. Invest in RAG or fine-tuning for high-value tasks that justified the additional infrastructure. Evaluate new models as they become available. Keep the governance foundations updated as tools and team practices evolve. Optimisation is an ongoing discipline, not a final destination.

7 Skills and Change Management

Essential skills for testing with GenAI (CT-GenAI-5.2.1)

AI tools do not replace skill — they require a different skill set. Testers who use AI effectively have developed competencies that are not automatically present just because AI access is available.

Prompt engineering: The ability to write prompts that produce high-quality, relevant output. This includes providing role, context, constraints, and output format, and knowing how to iterate when the first result is not right.
AI output evaluation: Critical review of AI-generated test artefacts: spotting hallucinated field names, invented business rules, and plausible-but-wrong expected values. This requires domain knowledge of the system under test — which the model does not have.
Basic LLM architecture understanding: Enough to explain context windows, hallucination, and data handling to a sceptical security team or a non-technical manager. Testers do not need to understand transformers, but they need to be able to explain why AI output requires review.
Data privacy awareness: Understanding what data classification rules apply, which information can be sent to which tools, and how to use synthetic or anonymised data for AI-assisted tasks involving sensitive systems.
Test artefact documentation: How to record AI provenance in test records — noting which artefacts were AI-assisted, which model and prompt version produced them, and what review process was applied.

Building AI capability in test teams (CT-GenAI-5.2.2)

Skill development does not happen through access alone. Practical approaches that work:

Prompt engineering workshop: A two-hour hands-on session where testers write prompts for real test tasks, review each other's outputs, and discuss what improved quality. One workshop changes outcomes more than weeks of unsupported access.
Shared prompt library: A team knowledge base — in Confluence, Notion, or a Git repository — where validated prompts are stored with notes on when they work and when they do not. Reduces duplicated experimentation.
Peer review in retrospectives: Include AI output quality as a standing agenda item in sprint retrospectives for the first three months. What worked? What needed heavy revision? What should be added to the prompt library?
CT-GenAI certification: The ISTQB Certified Tester AI Testing (CT-AI) and CT-GenAI certifications map directly to this module. Setting team certification as a goal creates a structured learning pathway and a common vocabulary.

How test processes shift with GenAI (CT-GenAI-5.2.3)

GenAI does not replace testing processes — it shifts the effort distribution within them. Understanding this prevents unrealistic expectations in both directions.

Test design

AI accelerates the first draft. A tester who previously spent 60 minutes writing test cases for a feature can now produce a first draft in 10 minutes and spend 50 minutes reviewing, enriching, and adding context-specific edge cases the AI did not generate. The total time may or may not decrease, but the output often improves because human effort shifts from mechanical writing to thoughtful review.

Test estimation

Effort distribution changes. Less time writing; more time reviewing AI output, managing prompt quality, and performing exploratory testing on areas the AI may have missed. Estimation models that treat test case writing as a primary time driver need updating. A team that has not recalibrated its estimates after AI adoption may find it is consistently finishing early or underestimating review time.

Test reporting

AI can automate the production of routine summary reports — sprint test summaries, defect pattern analyses, coverage overviews — freeing testers to focus on the insight work: interpreting what the data means, identifying risk, and communicating quality signals to stakeholders who need them.

Test roles

The test analyst role evolves toward "AI prompt author and output validator." Mechanical test case writing shrinks as a proportion of the job. Risk analysis, exploratory testing strategy, stakeholder communication, and AI output quality assurance grow. Testers who invest in these higher-order skills become more valuable, not less, as AI adoption matures.

Senior engineer insight

The teams that adopt GenAI fastest are not always the ones that get the most value from it. I have watched QA leads push AI into every part of their workflow in the first month, then spend the next two months unpicking AI-generated test cases that looked correct but were silently wrong because the model had no knowledge of their business rules. The turning point for most teams is when they stop asking "how do we use AI more" and start asking "how do we review AI output faster" — because that is where the real bottleneck moves to.

The most common mistake: treating AI output evaluation as a quick skim rather than a deliberate skill — the hallucinations that slip through are almost always in the expected values, not the test steps.

From the field

A Wellington government agency ran a GenAI pilot for test case generation under GCDO guidance, with a formal AI risk assessment completed before any tool was approved. The team assumed their main governance challenge would be the approval process — they expected IT security to be the bottleneck. What they discovered was that the harder problem was internal: testers with deep domain knowledge in the agency's legacy benefits system were producing excellent AI-assisted output, while newer testers were generating plausible-looking test cases that missed critical business rules specific to NZ Social Security Act obligations. The team's response was to create a two-tier review process — AI-generated cases for legacy workflows required sign-off from a senior analyst familiar with the business domain, not just any peer. The lesson that generalises: the quality of AI output in testing is bounded by the reviewer's domain knowledge, not the model's capability.

8 Common Mistakes

🚫 Adopting AI tools before establishing governance

What happens: Tools spread organically across the team. Different testers make different data handling decisions. Sensitive information reaches unapproved services. When the security or compliance team eventually investigates, the remediation work is far more expensive than the governance would have been.

Correction: Governance does not need to be heavy. A one-page approved tool list and a simple data classification rule eliminates most shadow AI risk. Produce these before the first tool is deployed to the wider team. The exploration phase can happen before governance is fully formalised — as long as it stays small and controlled.

🚫 Measuring success by speed alone

What happens: A team reports that AI generates test cases 10 times faster. Leadership declares success and expands the programme. No one measures output quality. Six months later, someone notices the defect detection rate has fallen. The AI was generating test cases that looked right but missed the edge cases a human would have spotted. Speed without quality is worse than no speed improvement at all.

Correction: Measure quality alongside speed: hallucination rate (how often does AI output need correction?), review time (how long does it take to verify AI output?), and defect detection rate (are AI-assisted tests finding as many defects as human-written ones?). Set baselines in the exploration phase before making expansion decisions.

🚫 Expecting testers to adopt AI without training

What happens: A team is given access to an AI tool and told to use it. Without prompt engineering training, most testers write poor prompts and receive poor output. They conclude the tool is not useful. A few enthusiasts figure it out independently. The adoption fragments: some testers using AI effectively, most not, no shared knowledge base.

Correction: A two-hour hands-on prompt engineering workshop — run before or at the same time as tool access — changes outcomes dramatically. Pair it with a shared prompt library so good prompts are preserved. Do not measure tool success until the team has been trained.

🚫 Not documenting AI provenance in test records

What happens: The test plan says the feature was fully tested. It does not say that 40% of the test cases were AI-generated and reviewed only by the tester who generated them. An external auditor asks how coverage was achieved. The team cannot answer clearly. In a regulated industry — banking, health, government — this is a compliance gap, not just a documentation gap.

Correction: The test plan should document which artefacts were AI-assisted, which model and prompt version was used, and what review process was applied. This does not need to be onerous — a field in the test case record and a paragraph in the test plan covers it. Set the expectation before the first AI-generated artefact enters a regulated test suite.

9 Now You Try

🤖 Live AI Prompt Lab — Pilot Plan Review

You are the QA lead at a 15-person NZ financial services test team. The CTO has approved a 3-month AI testing pilot. Design your pilot plan below — use cases, data rules, success metrics, training approach, and what you will do after 3 months. The AI will evaluate it against CT-GenAI Chapter 5 criteria.

Show model answer prompt

Review this AI testing pilot plan for an NZ financial services test team and provide structured feedback:

CONTEXT:
- 15-person QA team at NZ financial services company
- CTO has approved a 3-month pilot with budget for one approved AI tool
- Must comply with NZ Privacy Act 2020 and internal data classification policy

MY PILOT PLAN:
Use cases: (1) Test case generation from user stories, (2) Acceptance criteria review for ambiguity, (3) Defect triage summary generation

Data rules: Synthetic data only — no real customer names, Revenue NZ numbers, bank accounts, or transaction data in any prompt

Success metrics: Time per test case (before vs after), hallucination rate (% of AI test cases requiring correction), sprint test coverage achieved

Training: 2-hour prompt engineering workshop before pilot starts. Shared prompt library in Confluence from week 2.

After 3 months: Present to CTO with data. Formalise top 2 use cases. Retire anything with less than 20% time saving or more than 15% hallucination rate.

Evaluate against CT-GenAI Chapter 5. What is strong? What is missing? What governance elements should be added?

Key takeaway

GenAI does not give your test team a shortcut — it gives them a lever, and the strength of that lever depends entirely on the governance, training, and critical judgment the team brings to reviewing what the model produces.

Why teams fail here

Skipping formalisation and going straight to scale — running a two-week pilot and immediately rolling out to the whole team without approved tool lists, data rules, or quality gates. This is how the shadow AI problem gets institutionalised rather than prevented.
Measuring only speed, not quality — declaring the pilot a success because test cases were generated faster, without tracking hallucination rates, defect detection rates, or how much revision AI output required before it was usable.
No training before access — giving testers tool access and expecting organic skill development. Testers who write poor prompts conclude the tool is useless; the team fragments into enthusiasts and sceptics with no shared approach or prompt library.
Treating all use cases as equal risk — applying the same light-touch review to AI-generated test cases for a payment gateway as for a marketing content form. High-risk test areas need senior review gates; not everything the model generates carries the same consequence if it is wrong.
Failing to document AI provenance in regulated test artefacts — not recording which test cases were AI-assisted, which model version was used, or what review process was applied. In NZ's banking, health, and government sectors, this is a compliance gap that surfaces at the worst possible time: during an external audit.
Letting the pilot run indefinitely without a decision gate — a pilot with no defined end date and no success criteria becomes permanent shadow practice. Define upfront what metrics at what threshold mean you formalise, and what means you stop.

Enterprise reality

Large QA organisations rolling out AI tools across multiple teams and programmes

AI tool adoption requires an organisation-wide use policy — approved tool lists, data classification standards, and accepted use rules — before any team can begin using a tool. Without this, each team makes independent decisions and the organisation ends up with 20 different tools, 20 different data handling practices, and no way to enforce consistent governance.
Data classification determines which systems AI tools can access. In enterprise environments with multiple data sensitivity tiers — public, internal, confidential, restricted — not all test data is suitable for any AI tool. Regulated data (banking transaction records, health information, government-classified material) may be completely off-limits for cloud-based tools regardless of how the tool is marketed.
Productivity gains must be measurable and reported to justify the AI investment. Enterprise programmes need ROI evidence: time saved per test cycle, reduction in defect escape rates, cost per test case. Without baselines collected before rollout, teams cannot demonstrate value to the executives who approved the budget — and second-phase funding depends on those numbers.
Training and change management for AI tool adoption takes as long as the tooling work — and is frequently underestimated. Rolling out an approved tool to 80 testers across 12 programmes requires structured onboarding, prompt engineering workshops at team level, and a sustained change management effort to shift habits. Organisations that treat training as a one-day event discover that most testers revert to old workflows within a month.

How this has changed

The field moved fast. Here is what the evolution looked like for Adopting Generative AI in QA.

2022

ChatGPT launches, QA teams start using LLMs for test case generation and documentation. No standards exist. Teams experiment ad hoc.

2023

Prompt engineering for testing becomes a recognised skill. First QA-focused LLM tools ship (Testsigma AI, Applitools, Mabl AI). Hallucination is the dominant concern.

2024

AI coding assistants (Copilot, Cursor) enter QA workflows. AI agents start running test scripts autonomously. ISO/IEC begins work on AI testing standards. Governance becomes a concern.

2025

ISO/IEC TS 42119-2 published — first international standard for testing AI systems. Enterprise risk classification frameworks arrive. AI agent testing becomes its own discipline.

Now

Organisations treat AI as a regulated system with formal test obligations. QA teams own AI quality, not just AI-assisted testing.

10 Self-Check

Click each question to reveal the answer.

Q1. What is shadow AI and why is it particularly risky in a testing context?

Shadow AI is use of AI tools not approved by organisational security and compliance. In testing, testers regularly handle sensitive materials: system specifications, authentication flows, acceptance criteria, and defect histories that may contain real system details. Unapproved tools may train on this input, creating data breach, IP exposure, and compliance risk under the NZ Privacy Act 2020. The risk is higher in testing than in many other roles precisely because the information testers work with is so detailed and sensitive.

Q2. What are the four phases of GenAI adoption in a test organisation?

Exploration (small controlled pilot, one use case, measure quality and time savings carefully); Formalisation (establish governance: approved tool list, data rules, quality gates, team training, prompt versioning); Scale (roll out approved use cases, integrate into CI/CD, shared prompt library, quality monitoring); Optimisation (quarterly review, retire underperforming use cases, invest in RAG or fine-tuning for high-value tasks).

Q3. Name three skills a tester needs to work effectively with GenAI.

Any three of: Prompt engineering (writing prompts that produce relevant, high-quality output); AI output evaluation (critical review to identify hallucinations, invented business rules, wrong expected values); data privacy awareness (understanding which data can go to which tools); basic LLM architecture understanding (enough to explain risks to stakeholders); test artefact documentation (recording AI provenance in test records).

Q4. How does AI adoption change the role of a test analyst?

The test analyst shifts from mechanical test case writing to "AI prompt author and output validator." The first draft of test cases is faster via AI; human effort shifts toward critical review, context-specific edge case identification, exploratory testing, and risk analysis. Effort distribution in estimation changes: less time writing, more time reviewing AI output. The role becomes more analytical and less mechanical, which increases the value of domain expertise and quality judgement.

Q5. Why should AI provenance be documented in test records?

Regulators and auditors may ask how test coverage was achieved. In regulated industries such as banking, health, and government, the test plan must account for how artefacts were produced and what review process was applied. Undocumented AI use is a compliance risk: if 40% of the test suite was AI-generated and reviewed only by a single tester, that needs to be stated and the review process documented. Documentation also enables quality tracking over time — if output from a particular model or prompt version is later found to be systematically wrong, you can identify which test runs were affected.

11 Interview Prep

Common interview questions on GenAI adoption for senior testing and QA lead roles.

Q: "How would you introduce AI-assisted testing to a team that has never used it?"

Start with a small, low-risk pilot on a single use case before anything is rolled out broadly. I would choose acceptance criteria review or test case generation for a non-critical feature — something with clear inputs and easily evaluated outputs. Run it for one sprint with two willing testers, measure quality and time savings honestly, and share the results with the team before deciding on broader adoption. Training comes first: a two-hour prompt engineering session before anyone touches the tool, with a shared prompt library seeded with three validated templates on day one. The goal of the pilot is learning, not productivity. The productivity comes after formalisation.

Q: "What would you include in an AI testing governance policy?"

Five core elements: an approved tool list with security sign-off noting which tools are cleared for which data classifications; data classification rules stating clearly what can and cannot be sent to AI tools (PII, client IP, and authentication data never); quality gate requirements for AI-generated artefacts, including who must review them before they enter the test suite; prompt versioning requirements for any AI used in automated or CI/CD pipelines; and a process for reporting concerns about AI output quality or unexpected behaviour. The policy does not need to be long — one page is sufficient for most teams — but it must be explicit and consistently applied before the first tool goes into wider use.

Q: "How has the role of the tester changed with the introduction of AI?"

The mechanical parts of the job — writing boilerplate test cases for obvious happy paths, formatting test case lists, producing sprint summary reports — are increasingly handled by AI, faster and with less effort. The genuinely valuable parts — understanding system risk, designing exploratory test strategies, evaluating AI output critically, identifying edge cases the AI did not consider, communicating quality signals to stakeholders — are more important than ever. AI does not replace good testing judgement. It amplifies it: a tester with strong domain knowledge and critical evaluation skills produces far better AI-assisted output than one who treats AI as a black box. The testers who will be most valued are those who invest in the higher-order skills, not those who become dependent on the tool.

← LLM Infrastructure Back to Test with AI