New · AI Testing Engineer

AI Testing Engineer — Evaluation & Governance

Testing generative AI is its own craft. This track teaches the three skills that separate an AI testing engineer from a tester who happens to use AI.

You already know why AI testing differs from traditional testing, and how to test data, models, and fairness against the standard. This track goes a layer deeper into the systems teams actually ship today: retrieval-augmented generation, prompts that can be attacked, and agents that take actions on their own. Each lesson is hands-on, with three prompt-lab exercises that mark your work against a real LLM.

This track covers

RAG Evaluation Prompt-Injection Testing Agent Testing Faithfulness & Grounding Guardrails Human-in-the-Loop

Builds on

The CT-GenAI foundations and the ISO/IEC 42119 module. This track assumes you understand what a large language model is, the five AI failure modes, and risk-based AI test scope.

Who this is for

Senior testers, Test Leads, and QA engineers putting generative AI features into NZ production systems — chatbots, document assistants, and agents that act on a customer's behalf.

The 6 lessons

Evaluate. Attack. Govern.

Lesson 1

RAG Evaluation

Evaluating retrieval-augmented generation. Faithfulness, answer relevance, and context precision and recall. How to build a RAG eval set, and the confident-but-ungrounded failure that passes every smoke test.

~30 min read · ~70 min with exercises · AI Evaluation

Lesson 2

Prompt-Injection Testing

Defensive prompt-injection vulnerability testing. Direct and indirect injection, jailbreaks, and data exfiltration. Testing system-prompt robustness and writing the defensive test cases that catch it before an attacker does.

~30 min read · ~75 min with exercises · AI Evaluation

Lesson 3

Agent Testing

Testing AI agents and agentic systems. Multi-step tool-use validation, non-determinism, guardrails, and human-in-the-loop sign-off. Model benchmarking and deterministic-consistency checks for systems that take actions.

~35 min read · ~80 min with exercises · AI Evaluation

Lesson 4

Model Benchmarking

Benchmarking model quality the right way: task-specific eval sets, golden datasets and scoring rubrics, A/B comparison of models and prompts, accuracy-vs-latency-vs-cost trade-offs, and regression-gating a model upgrade before it ships.

~30 min read · ~75 min with exercises · AI Evaluation

Lesson 5

Deterministic & Consistency Testing

Testing non-deterministic output: temperature and seed control, semantic-equivalence assertions instead of exact-match, tolerance bands, repeat-N stability checks, and the flaky-assertion patterns that wreck a generative test suite.

~30 min read · ~75 min with exercises · AI Evaluation

Lesson 6

Human-in-the-Loop Sign-off

Governance sign-off frameworks: when a human must approve, confidence thresholds and escalation, audit trails, reviewer sampling, four-eyes approval, and a RACI for AI decisions under the Privacy Act 2020 and public-sector accountability.

~30 min read · ~75 min with exercises · AI Evaluation

Why this track

The systems you ship now are generative

Most AI features going live in NZ today are not custom-trained models. They are generative AI wired into a business: a retrieval system that answers customer questions from policy documents, a prompt that summarises a case file, an agent that books an appointment or files a form. These systems fail in ways a model-accuracy test never catches. They answer confidently from the wrong document. They follow instructions hidden in the data they read. They take an action no one signed off on.

This track teaches you to test for exactly those failures. Lesson 1 makes the output measurable — is the answer actually grounded in what the system retrieved? Lesson 2 takes the attacker's view defensively — what happens when the input is hostile? Lesson 3 governs the riskiest case of all — an AI that acts, not just answers.

Throughout, the context is NZ: a Te Whatu Ora patient assistant, an IRD help bot, an MSD case agent, RealMe identity flows, and the Privacy Act 2020. If an AI feature shapes a decision about a person, you will be asked how it was tested. This track is how you answer.

Related

Where this fits