New · AI Testing Engineer

AI Testing Engineer — Evaluation & Governance

Testing generative AI is its own craft. This track teaches the three skills that separate an AI testing engineer from a tester who happens to use AI.

You already know why AI testing differs from traditional testing, and how to test data, models, and fairness against the standard. This track goes a layer deeper into the systems teams actually ship today: retrieval-augmented generation, prompts that can be attacked, and agents that take actions on their own. Each lesson is hands-on, with three prompt-lab exercises that mark your work against a real LLM.

Start with Lesson 1 → Back to Test with AI

This track covers

RAG Evaluation Prompt-Injection Testing Agent Testing Faithfulness & Grounding Guardrails Human-in-the-Loop Metamorphic Testing Neural Network Coverage

Builds on

The CT-GenAI foundations and the ISO/IEC 42119 module. This track assumes you understand what a large language model is, the five AI failure modes, and risk-based AI test scope.

Who this is for

Senior testers, Test Leads, and QA engineers putting generative AI features into NZ production systems — chatbots, document assistants, and agents that act on a customer's behalf.

The 8 lessons

Evaluate. Attack. Govern.

Lesson 1

RAG Evaluation

Evaluating retrieval-augmented generation. Faithfulness, answer relevance, and context precision and recall. How to build a RAG eval set, and the confident-but-ungrounded failure that passes every smoke test.

~30 min read · ~70 min with exercises · AI Evaluation

Lesson 2

Prompt-Injection Testing

Defensive prompt-injection vulnerability testing. Direct and indirect injection, jailbreaks, and data exfiltration. Testing system-prompt robustness and writing the defensive test cases that catch it before an attacker does.

~30 min read · ~75 min with exercises · AI Evaluation

Lesson 3

Agent Testing

Testing AI agents and agentic systems. Multi-step tool-use validation, non-determinism, guardrails, and human-in-the-loop sign-off. Model benchmarking and deterministic-consistency checks for systems that take actions.

~35 min read · ~80 min with exercises · AI Evaluation

Lesson 4

Model Benchmarking

Benchmarking model quality the right way: task-specific eval sets, golden datasets and scoring rubrics, A/B comparison of models and prompts, accuracy-vs-latency-vs-cost trade-offs, and regression-gating a model upgrade before it ships.

~30 min read · ~75 min with exercises · AI Evaluation

Lesson 5

Deterministic & Consistency Testing

Testing non-deterministic output: temperature and seed control, semantic-equivalence assertions instead of exact-match, tolerance bands, repeat-N stability checks, and the flaky-assertion patterns that wreck a generative test suite.

~30 min read · ~75 min with exercises · AI Evaluation

Lesson 6

Human-in-the-Loop Sign-off

Governance sign-off frameworks: when a human must approve, confidence thresholds and escalation, audit trails, reviewer sampling, four-eyes approval, and a RACI for AI decisions under the Privacy Act 2020 and public-sector accountability.

~30 min read · ~75 min with exercises · AI Evaluation

Lesson 7

Metamorphic Testing Relations

Testing AI systems when no oracle exists. The four MR types — invariance, monotonicity, equivalence, consistency — how to spot spurious correlations like demographic bias, and how to assert MRs on probabilistic LLM output.

~30 min read · ~75 min with exercises · CT-AI v2.0

Lesson 8

Neural Network Coverage Metrics

Why 99% accuracy is not enough. Neuron Coverage, k-multisection (k-MNAC), Neuron Boundary Coverage, SNAC, and layer-wise analysis — how to identify which parts of a model your test suite has never exercised.

~30 min read · ~75 min with exercises · CT-AI v2.0

Testing approaches compared

Different parts of an AI system need different testing approaches. Use this to decide which combination applies to your system.

Approach	What it verifies	What it misses	Automation	When to use
Unit / function test	Individual helper functions, prompt builders, parsers	Model behaviour, reasoning quality	Full	Always — fast, cheap, catches regressions in glue code
Prompt regression test	That a known prompt still produces an acceptable output	Novel inputs, distribution shift	Full	Before every model or prompt change
API / integration test	Service behaviour, contract conformance, latency SLAs	Agent decision quality, multi-step reasoning	Full	Whenever the AI system exposes an API surface
Agent / end-to-end test	Multi-step task completion, tool use, goal achievement	Reasoning quality, edge cases, adversarial inputs	Partial	For any agent with real-world tool access or user impact
LLM-as-judge eval	Output quality at scale — relevance, faithfulness, tone	Reproducibility — judge model introduces its own variance	Full	When human review cannot scale to cover the eval dataset
Security / red team	Prompt injection, jailbreaks, data extraction, IDOR	Novel attack patterns not in the scanner library	Partial	Mandatory before any AI system with external-facing access
Human review	Context, judgement, cultural nuance, ethical concerns	Scale — cannot review every output in production	Manual	High-stakes decisions, regulatory sign-off, tone/cultural checks

Why this track

The systems you ship now are generative

Most AI features going live in NZ today are not custom-trained models. They are generative AI wired into a business: a retrieval system that answers customer questions from policy documents, a prompt that summarises a case file, an agent that books an appointment or files a form. These systems fail in ways a model-accuracy test never catches. They answer confidently from the wrong document. They follow instructions hidden in the data they read. They take an action no one signed off on.

This track teaches you to test for exactly those failures. Lesson 1 makes the output measurable — is the answer actually grounded in what the system retrieved? Lesson 2 takes the attacker's view defensively — what happens when the input is hostile? Lesson 3 governs the riskiest case of all — an AI that acts, not just answers.

Throughout, the context is NZ: a HealthNZ patient assistant, an Revenue NZ help bot, an Benefits NZ case agent, RealMe identity flows, and the Privacy Act 2020. If an AI feature shapes a decision about a person, you will be asked how it was tested. This track is how you answer.

Context

How we got here

AI testing techniques evolve in response to what teams actually ship. Each wave of system architecture introduced failure modes the previous testing approach could not catch.

2017–2020 — Traditional ML

Testing focused on accuracy, precision, recall, and AUC on static datasets. Inputs were structured; outputs were classifications. Exact-match assertions worked. Fairness was rarely measured systematically.

2021–2022 — Foundation models go mainstream

Outputs became free text. The same prompt produced different wording each run. Exact-match assertions became flaky by design. Teams needed semantic-equivalence assertions, repeat-N stability checks, and new tools for asserting on meaning rather than strings.

2023 — RAG patterns gain adoption

Organisations wired LLMs to their own knowledge bases. Retrieval quality, faithfulness, and document staleness became primary failure modes. Teams learned that retrieval recall and answer faithfulness are different metrics requiring different tests.

2024 — Agentic AI emerges

AI systems began taking actions: booking appointments, sending emails, filing forms, initiating payments. Wrong answers became wrong actions. Testing had to cover action safety, tool-use validation, unbounded execution loops, and human approval workflows.

2025 — Governance frameworks mature

ISO/IEC 42119, CT-AI v2.0, and the OWASP LLM Top 10 gave structure to what teams were discovering empirically. Metamorphic testing for fairness, neural network coverage, and prompt injection testing moved from research into standard practice for regulated systems.

2026 — Operational assurance

The focus shifts from “can the model do this?” to “can we trust it to keep doing it in production, at scale, under adversarial conditions, across demographic groups, with full audit trails?” This track teaches those skills.

Decision guide

Which technique, and when?

Match your situation to the technique. Most real systems need more than one.

Your situation	Start with
Output varies on the same input — test suite is flaky	Deterministic-Consistency Testing
No correct answer to assert against (no test oracle)	Metamorphic Testing
System answers from retrieved documents	RAG Evaluation
Users can type free-form input to the AI	Prompt Injection Testing (mandatory)
AI takes actions — pays, books, files, sends	Agent Testing + HITL Sign-Off
Comparing two models or prompt versions	Model Benchmarking
Decision affects individuals — needs human accountability	Human-in-the-Loop Sign-Off
Need to verify fairness across demographic groups	Metamorphic Testing (demographic MRs)
Test suite may have blind spots in model behaviour	Neural Network Coverage

Where this fits

CT-GenAI

GenAI Foundations

How large language models work, and what they can and cannot do for testers.

ISO/IEC 42119

Testing AI Systems

Data, model, fairness, drift, and audit-ready testing against the international standard.

CT-GenAI

Managing AI Risks

Hallucination, bias, data privacy under the NZ Privacy Act 2020, and non-determinism.

Track context

This track

AI Testing & Quality — testing AI systems themselves, not using AI to test

Best for

Senior testers, test leads, test managers, and developers working on AI products or integrating LLMs into production systems

Foundation

Senior Tester track
Risk-based testing and API testing foundations make this track significantly more effective.

Standard

ISO/IEC TS 42119-2:2025
The international standard for testing AI systems — covered in depth in the ISO module.

AI Testing Engineer — Evaluation & Governance

This track covers

Builds on

Who this is for

Evaluate. Attack. Govern.

RAG Evaluation

Prompt-Injection Testing

Agent Testing

Model Benchmarking

Deterministic & Consistency Testing

Human-in-the-Loop Sign-off

Metamorphic Testing Relations

Neural Network Coverage Metrics

Testing approaches compared

The systems you ship now are generative

How we got here

Which technique, and when?

Where this fits

GenAI Foundations

Testing AI Systems

Managing AI Risks

Related techniques