AI Testing Engineer — Evaluation & Governance
Testing generative AI is its own craft. This track teaches the three skills that separate an AI testing engineer from a tester who happens to use AI.
You already know why AI testing differs from traditional testing, and how to test data, models, and fairness against the standard. This track goes a layer deeper into the systems teams actually ship today: retrieval-augmented generation, prompts that can be attacked, and agents that take actions on their own. Each lesson is hands-on, with three prompt-lab exercises that mark your work against a real LLM.
Evaluate. Attack. Govern.
RAG Evaluation
Evaluating retrieval-augmented generation. Faithfulness, answer relevance, and context precision and recall. How to build a RAG eval set, and the confident-but-ungrounded failure that passes every smoke test.
~30 min read · ~70 min with exercises · AI Evaluation
Lesson 2Prompt-Injection Testing
Defensive prompt-injection vulnerability testing. Direct and indirect injection, jailbreaks, and data exfiltration. Testing system-prompt robustness and writing the defensive test cases that catch it before an attacker does.
~30 min read · ~75 min with exercises · AI Evaluation
Lesson 3Agent Testing
Testing AI agents and agentic systems. Multi-step tool-use validation, non-determinism, guardrails, and human-in-the-loop sign-off. Model benchmarking and deterministic-consistency checks for systems that take actions.
~35 min read · ~80 min with exercises · AI Evaluation
Lesson 4Model Benchmarking
Benchmarking model quality the right way: task-specific eval sets, golden datasets and scoring rubrics, A/B comparison of models and prompts, accuracy-vs-latency-vs-cost trade-offs, and regression-gating a model upgrade before it ships.
~30 min read · ~75 min with exercises · AI Evaluation
Lesson 5Deterministic & Consistency Testing
Testing non-deterministic output: temperature and seed control, semantic-equivalence assertions instead of exact-match, tolerance bands, repeat-N stability checks, and the flaky-assertion patterns that wreck a generative test suite.
~30 min read · ~75 min with exercises · AI Evaluation
Lesson 6Human-in-the-Loop Sign-off
Governance sign-off frameworks: when a human must approve, confidence thresholds and escalation, audit trails, reviewer sampling, four-eyes approval, and a RACI for AI decisions under the Privacy Act 2020 and public-sector accountability.
~30 min read · ~75 min with exercises · AI Evaluation
The systems you ship now are generative
Most AI features going live in NZ today are not custom-trained models. They are generative AI wired into a business: a retrieval system that answers customer questions from policy documents, a prompt that summarises a case file, an agent that books an appointment or files a form. These systems fail in ways a model-accuracy test never catches. They answer confidently from the wrong document. They follow instructions hidden in the data they read. They take an action no one signed off on.
This track teaches you to test for exactly those failures. Lesson 1 makes the output measurable — is the answer actually grounded in what the system retrieved? Lesson 2 takes the attacker's view defensively — what happens when the input is hostile? Lesson 3 governs the riskiest case of all — an AI that acts, not just answers.
Throughout, the context is NZ: a Te Whatu Ora patient assistant, an IRD help bot, an MSD case agent, RealMe identity flows, and the Privacy Act 2020. If an AI feature shapes a decision about a person, you will be asked how it was tested. This track is how you answer.
Where this fits
GenAI Foundations
How large language models work, and what they can and cannot do for testers.
ISO/IEC 42119Testing AI Systems
Data, model, fairness, drift, and audit-ready testing against the international standard.
CT-GenAIManaging AI Risks
Hallucination, bias, data privacy under the NZ Privacy Act 2020, and non-determinism.