Test with AI · AI Evaluation

Agent Testing

A chatbot that gives a wrong answer wastes a minute. An agent that takes a wrong action cancels a payment, files the wrong form, or books the wrong appointment. When an AI can act, testing stops being about words and starts being about consequences.

Test with AI AI Testing Engineer — Lesson 3 of 3 ~35 min read · ~80 min with exercises

1 The Hook

A fictional KiwiSaver provider, Tui Wealth, built an agent to handle member requests end to end. A member could say “switch my fund to the conservative option and pause my contributions for three months” and the agent would do it: look up the member, check eligibility, call the fund-switch system, call the contributions system, and confirm. No human in the loop. It was fast, and members loved it.

In testing, the team gave it clean requests and it handled them beautifully. Then a member typed a messy, real-world message: “actually no, don’t switch the fund, but yes pause contributions — wait, pause them from next month not now.” The agent had already, three steps earlier, decided on a plan: switch the fund and pause contributions immediately. It executed the original plan. It switched a fund the member had explicitly said not to touch, and paused contributions from the wrong date. Two real actions taken on a member’s retirement savings, both wrong, both irreversible without manual cleanup.

Here is what makes agent testing its own discipline. With a chatbot, a wrong output is a wrong sentence — annoying, but inert. With an agent, a wrong output is a wrong action in a real system, with real consequences and often no undo. And because the agent works in multiple steps — plan, call a tool, read the result, call another tool — a single run has many points where it can go wrong, and the failure may only appear three steps after the mistake was made.

The team had tested the agent like a chatbot: does it give good answers? The right question for an agent is different: does it take the right actions, in the right order, with the right values — and does it stop and ask a human before doing anything it cannot take back? That is what this lesson teaches.

2 The Rule

An agent does not just produce output — it takes actions in real systems, in multiple steps, with no guaranteed undo. So you do not test the final answer alone; you test the whole trajectory: every tool call, its arguments, its order, and the guardrails that should stop the agent before any irreversible or high-stakes action. And because the agent is non-deterministic, you test it many times, not once.

3 The Analogy

Analogy

The difference between marking an apprentice’s written plumbing exam and watching them plumb a real house.

Testing a chatbot is like marking a written exam: you read the answers and judge whether they are right. The apprentice can write a perfect answer and still flood a kitchen. Testing an agent is like standing in the house watching them work — did they shut off the mains first, fit the joints in the right order, use the right fittings, and — crucially — call the master plumber before cutting into the gas line? You are judging the sequence of actions and their consequences, not a paragraph.

And because the apprentice does it a little differently each time, watching one job tells you little. You watch many jobs to learn what they reliably get right and where they sometimes slip. Guardrails are the rule that they must stop and call the master plumber before any step that could cause a flood. Human-in-the-loop testing checks that they actually stop.

4 What an Agent Is, and Why It Is Harder to Test

An AI agent is a system where a language model does not just answer — it plans and acts. It can break a goal into steps, call tools (databases, APIs, other systems), read the results, and decide what to do next, looping until the goal is done. The Tui Wealth system is an agent: it planned, called the fund-switch tool, called the contributions tool, and confirmed.

That capability is exactly what makes agents harder to test than anything in the previous two lessons:

  • Actions, not words: the output is a real change in a real system — a payment moved, a form filed, a record updated — often with no undo.
  • Multiple steps: one request becomes a chain of decisions and tool calls. The mistake can happen at any step, and surface several steps later.
  • A huge space of paths: the agent chooses its own route to the goal, so there are far more possible execution paths than a fixed workflow has.
  • Non-determinism on top: the same request can produce a different plan, different tool calls, or a different order on different runs.

The shift in mindset is this: you stop testing only the destination and start testing the trajectory — the whole sequence of steps the agent took to get there. A right answer reached by a wrong path (calling a tool it should not have, in the wrong order, with the wrong arguments) is still a failure, because next time that wrong path leads somewhere worse.

5 Multi-Step Tool-Use Validation

The heart of agent testing is validating the trajectory of tool calls. For every step the agent takes, you check four things:

  • The right tool: did it call the correct tool for the step — the fund-switch system, not the contributions system?
  • The right arguments: did it pass the correct values — the right member ID, the right fund, the date the member actually asked for, not the one from an earlier plan?
  • The right order: did it sequence steps correctly — check eligibility before switching, not after?
  • The right stopping point: did it stop when the goal was met, rather than taking extra unrequested actions — like switching a fund the member said not to touch?

To test this you need the agent’s trace: a log of every step — what the agent decided, which tool it called, with what arguments, and what came back. Testing an agent without a trace is testing blind; you can see the final state but not the path, so you cannot tell a good run from a lucky one. Insisting that the system emit a full, inspectable trace is itself part of your job as the tester.

A powerful technique is tool mocking: replace the real tools with fakes during testing so the agent “calls the fund-switch system” but nothing actually moves. You then assert on what it tried to do — the tool, the arguments, the order — without any real-world consequence. This lets you safely test the dangerous paths (wrong fund, wrong amount) that you could never test against live systems, and it makes the test repeatable.

Pro tip: Write trajectory assertions, not just outcome assertions. “The member’s fund was unchanged” is an outcome. “The agent never called the fund-switch tool, because the member retracted that request” is a trajectory assertion — and it catches the Tui Wealth failure that an outcome check on a happy path would miss.

6 Non-Determinism and Deterministic-Consistency Checks

An agent is non-deterministic: the same input can produce a different plan or a different set of tool calls on different runs. This breaks the testing habit you grew up with, where one pass means the feature works. For an agent, one pass means it worked that once.

The answer is to test for consistency across runs, not a single result. You run the same scenario many times and measure how reliably the agent does the right thing:

  • Repeat runs: run the same request 20 or 50 times. Does it reach the correct outcome every time, or 18 times out of 20? A consistency rate is the unit of agent test results, not pass/fail.
  • Set a threshold tied to risk: for a low-stakes action, 95% might be acceptable. For an irreversible action on someone’s retirement savings, the bar for acting without a human is far higher — or the action must always route to a human.
  • Pin what you can: some parts of an agent can be made deterministic — fixed tool outputs via mocks, a constrained set of allowed tools, validation on arguments. The more you pin, the more the remaining variation isolates the model’s own behaviour.

A deterministic-consistency check is a test that runs a fixed scenario repeatedly with mocked tools and asserts that the agent’s critical decisions — which tools it calls and with what arguments on the safety-critical steps — are the same every time, even if its wording varies. The wording is allowed to wander; the action on a member’s money is not. That separation — let the prose vary, pin the consequences — is the core of testing a non-deterministic system that takes real actions.

Pro tip: Report agent reliability as a rate with the number of runs behind it: “correct trajectory in 49/50 runs; the one failure switched a fund the member retracted.” A single green tick on an agent is meaningless, and a stakeholder who sees the rate understands the residual risk in a way pass/fail can never convey.

7 Guardrails and Human-in-the-Loop Sign-Off

Because an agent acts and is non-deterministic, you cannot make it perfect — so the most important thing you test is what stops it. Guardrails and human-in-the-loop sign-off are the controls that catch the agent before a mistake becomes a consequence, and verifying them is the highest-value agent testing you do.

Guardrails are hard limits around the agent that do not depend on the model behaving. Examples: the agent literally cannot call the payment tool for an amount over a threshold; it cannot touch a tool outside its allowed set; its tool arguments are validated before execution and rejected if malformed. These are deterministic checks wrapped around a non-deterministic core. As a tester you verify each guardrail holds even when the model tries to cross it — you deliberately drive the agent toward the forbidden action and confirm the guardrail blocks it.

Human-in-the-loop (HITL) sign-off is the rule that for high-stakes or irreversible actions, the agent must stop and get explicit human approval before acting. The Tui Wealth failure is precisely a missing HITL gate: switching someone’s KiwiSaver fund should have paused for confirmation. Your HITL tests verify three things:

  • The gate triggers: every action classed as high-stakes actually pauses for sign-off — test that none slip through automatically.
  • The agent waits: it genuinely does not execute until approval is given, and a denial cleanly cancels the action.
  • The right things are gated: the classification of what counts as high-stakes is correct — an action that should need sign-off is not mis-classified as routine.

For NZ systems making decisions about people — benefits, health, money, identity — HITL sign-off is also where agent testing meets governance. The Government Algorithm Charter and the Privacy Act 2020 both push toward a human being accountable for consequential decisions. A tested, working HITL gate is how you demonstrate that a person, not just the model, stands behind each high-stakes action.

8 Model Benchmarking for Agents

Agents are often built so the underlying model can be swapped. That raises a question you will be asked: if we change the model, is the agent still safe? Model benchmarking answers it by running a fixed suite of agent scenarios against each candidate model and comparing the results on the same terms.

A benchmark for an agent is not a single accuracy number. It is a scorecard across the things this lesson covered, run identically for each model:

Task success rate — over many runs, how often does it reach the correct outcome?
Trajectory correctness — not just the outcome, but did it use the right tools, arguments, and order?
Consistency — how stable is it run-to-run on the same scenario?
Guardrail respect — does it stay inside its limits and trigger HITL where required?
Cost and latency — a faster, cheaper model that is slightly less reliable may or may not be the right trade for the risk.

The discipline is to run the same scenario suite against every candidate, with the same mocked tools, so the comparison is fair. A new model that scores higher on task success but lower on guardrail respect is not an upgrade for a system that moves people’s money — and a benchmark scorecard is how you make that trade-off visible to the people who decide, instead of discovering it in production.

Pro tip: Keep your agent benchmark suite as a permanent regression asset. Every model swap, prompt change, or tool update re-runs it. The scenarios that matter most are the safety ones — the retracted request, the over-threshold payment, the high-stakes action that must gate — because those are where a “better” model can quietly become more dangerous.

9 Common Mistakes

🚫 Testing only the final outcome, not the trajectory

Why it happens: The end state is easy to check, and on a happy path it looks fine.
The fix: A correct outcome reached by a wrong path — the wrong tool, the wrong order, an action the user retracted — is still a failure, because next time that path leads somewhere worse. Assert on the sequence of tool calls and arguments, not just the destination.

🚫 Running an agent scenario once and calling it passed

Why it happens: That is how deterministic software is tested — one green run means it works.
The fix: An agent is non-deterministic, so one pass means it worked that once. Run the scenario many times and report a consistency rate with the run count behind it. A single tick on an agent tells you almost nothing.

🚫 Trusting the prompt to keep the agent inside its limits

Why it happens: Telling the agent “only spend up to $X” in the prompt feels like a control.
The fix: A prompt instruction is not a hard limit — a non-deterministic model will eventually cross it. Real guardrails are deterministic code around the agent (tool allow-lists, argument validation, hard thresholds). Test that they hold even when you deliberately drive the agent toward the forbidden action.

🚫 Letting irreversible actions run without a human-in-the-loop gate

Why it happens: Full automation is faster and demos better, so the sign-off step gets dropped.
The fix: For high-stakes or irreversible actions — moving someone’s KiwiSaver, filing a benefit decision — the agent must stop and get human approval. Test that the gate triggers for every high-stakes action, that the agent truly waits, and that the right actions are classified as high-stakes.

10 Now You Try

Three graded exercises: spot the trajectory failure, design the eval, build the guardrail and HITL test plan. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Failure in an Agent Trace

Below is the request and the execution trace of a fictional MSD case agent. Identify where the trajectory went wrong, name which tool-use check it failed (right tool / right arguments / right order / right stopping point), say whether an outcome-only test would have caught it, and state what guardrail or HITL gate should have stopped it.

REQUEST: "Update my address to 12 Queen St, and cancel my next appointment."

TRACE:
Step 1 — plan: update address; cancel next appointment.
Step 2 — tool: lookup_client(id=44219) → ok
Step 3 — tool: cancel_appointment(client=44219, appt=NEXT) → ok
Step 4 — tool: update_address(client=44219, address="12 Green St") → ok
Step 5 — reply: "Done — address updated to 12 Green St and appointment cancelled."

Diagnose the trajectory:

Show model answer
Where it went wrong: Step 4. The member asked for "12 Queen St" but the agent called update_address with "12 Green St" — a wrong argument. (A sharp eye also flags Step 3: the appointment was cancelled before the address update, and an irreversible cancel was executed with no confirmation.)

Which tool-use check it failed: RIGHT ARGUMENTS — the address value passed to the tool does not match what the member requested. The reply in Step 5 then confidently states the wrong address, hiding the error.

Would an outcome-only test catch it? Partly, and unreliably. A test that only checked "did the address get updated?" would see a successful update and pass. Only a test that asserts the argument equals the requested value ("12 Queen St") catches it. The trace is what makes the wrong argument visible — outcome-only testing sees a green tick on a wrong action.

Guardrail / HITL gate that should apply: cancelling an appointment is a consequential, hard-to-reverse action — it should require a human-in-the-loop confirmation (or at least an explicit confirm step) before executing, not run automatically in Step 3. Argument validation could also catch a malformed/typo address. The deeper lesson: the agent reached a plausible-looking "Done" while taking a wrong action and an unconfirmed irreversible one, which is exactly why you test the trajectory, not the final sentence.
🔧 Exercise 2 of 3 — Design the Eval for a Non-Deterministic Agent

A team says: “We ran our fictional Tui Wealth KiwiSaver agent through the fund-switch scenario once, it worked, so it’s ready.” Explain why one run is not a valid agent test, then design how you would evaluate it: how many runs, what you assert on each run, what consistency threshold and why, how you make the test safe and repeatable, and what you report.

Their claim: “One successful run of the fund-switch scenario means the agent is ready for go-live.”

Write your evaluation design:

Show model answer
Why one run is not valid: an agent is non-deterministic — the same request can produce a different plan, different tool calls, or a different order each run. One pass means it worked that once; it says nothing about how reliably it does the right thing. For an action on someone's retirement savings, "worked once" is not evidence of safety.

How many runs: run the scenario many times — e.g. 50 — so a consistency rate is meaningful. More runs for higher-stakes actions.

What I assert each run (trajectory, not outcome): correct tools called (fund-switch + contributions), correct arguments (right member, right fund, the exact date requested), correct order (eligibility checked before switching), no extra unrequested actions, and that any high-stakes step paused for HITL sign-off. I assert on the trace, not just the end state.

Consistency threshold: set by risk. Because switching a KiwiSaver fund is consequential and hard to reverse, I would not accept a probabilistic pass at all for the act itself — I'd require the high-stakes action to always route through a human gate, and require the trajectory to be correct in, say, 50/50 runs for the automated parts. A lower-stakes read-only action could tolerate 95%.

Safe and repeatable: mock the fund-switch and contributions tools so nothing real moves; assert on what the agent TRIED to call. Pin the tool outputs and the allowed tool set so variation isolates the model's own decisions. This lets me safely test dangerous paths (wrong fund, retracted request).

What I report: a consistency rate with the run count and a description of every failure — e.g. "correct trajectory 49/50; one run switched a fund after the member retracted it; HITL gate fired on all 50 high-stakes steps." A rate plus the failure modes, never a single pass/fail tick.
🏗️ Exercise 3 of 3 — Build a Guardrail & HITL Test Plan

Design a 5-test plan for the guardrails and human-in-the-loop controls of a fictional IRD refund agent that can look up a taxpayer, calculate a refund, and pay it out. Cover: a tool allow-list limit, an over-threshold payment block, argument validation, a HITL gate on payout, and a deterministic-consistency check on a safety-critical step. Each test: ID, what it verifies, the attempted action that should be stopped/gated, and the expected safe behaviour.

Show model answer
AGT-01 | Verifies: tool allow-list — the agent cannot call any tool outside its approved set | Attempted action: drive the agent toward calling a tool it should not have (e.g. delete_taxpayer_record) | Expected safe behaviour: the call is blocked by the allow-list before execution, regardless of what the model decided; logged as a blocked attempt.

AGT-02 | Verifies: over-threshold payment block — payouts above a hard limit cannot be executed automatically | Attempted action: a scenario where the calculated refund exceeds the auto-pay threshold | Expected safe behaviour: the payout tool refuses/holds the payment; it does not execute; it routes to human review. The limit holds even though the prompt may "want" to pay.

AGT-03 | Verifies: argument validation — malformed or out-of-range tool arguments are rejected before execution | Attempted action: a refund call with a negative amount or a malformed IRD number | Expected safe behaviour: the argument is validated and rejected; no payment is attempted; an error is surfaced, not silently coerced.

AGT-04 | Verifies: HITL gate on payout — every actual payout pauses for human sign-off | Attempted action: a valid in-range refund the agent is ready to pay | Expected safe behaviour: the agent stops before paying, requests human approval, and does NOT execute until approval is given; a denial cleanly cancels with no payment.

AGT-05 | Verifies: deterministic-consistency on the safety-critical step — across many runs with mocked tools, the agent always gates the payout and always passes the validated amount | Attempted action: run the same payout scenario 50 times | Expected safe behaviour: in 50/50 runs the payout step triggers the HITL gate and the amount passed matches the calculated refund; wording may vary but the safety-critical action and arguments do not.

What makes the plan strong: each guardrail is tested by deliberately driving the agent toward the forbidden action and confirming a DETERMINISTIC control (not the prompt) stops it; the HITL test checks the gate triggers, the agent waits, and denial cancels; and the consistency check pins the safety-critical decision across many runs while allowing prose to vary. Weak plans test these on a happy path and assume the prompt instruction is the control.

11 Self-Check

Click each question to reveal the answer.

Q1: Why is testing an agent fundamentally different from testing a chatbot?

A chatbot produces words — a wrong output is a wrong sentence. An agent takes actions in real systems, in multiple steps, often with no undo — a wrong output is a wrong action with real consequences. So you test the whole trajectory of tool calls and the guardrails that stop high-stakes actions, not just the final answer.

Q2: What is a trajectory assertion, and why does it catch failures an outcome assertion misses?

A trajectory assertion checks the sequence of steps — which tools were called, with what arguments, in what order — not just the end state. It catches a correct-looking outcome reached by a wrong path (wrong tool, wrong argument, a retracted action still executed), which an outcome-only check would pass as a green tick.

Q3: Why can’t you test a non-deterministic agent by running a scenario once?

Because the same input can produce a different plan or tool calls each run, so one pass means it worked that once, not that it reliably works. You run the scenario many times and report a consistency rate — e.g. correct trajectory in 49/50 runs — with the failures described.

Q4: Why is a prompt instruction like “only pay up to $X” not a real guardrail, and what is?

Because a non-deterministic model will eventually cross a prompt instruction — it is not a hard limit. A real guardrail is deterministic code around the agent: a tool allow-list, argument validation, a hard threshold that physically blocks the action. You test it by driving the agent toward the forbidden action and confirming the control holds.

Q5: What three things do human-in-the-loop tests need to verify?

That the gate triggers for every high-stakes action (none slip through automatically), that the agent genuinely waits and a denial cleanly cancels the action, and that the classification is right — the actions that should require sign-off are correctly marked as high-stakes, not mis-classified as routine.

12 Interview Prep

Real questions asked in NZ QA interviews for AI testing roles. Read the model answers, then practise your own version.

“How would you test an AI agent that can take actions in our systems, not just answer questions?”

I’d test the trajectory, not just the outcome. Using the agent’s execution trace, I’d assert on every step — the right tool, the right arguments, the right order, and no extra unrequested actions — with the real tools mocked so dangerous paths can be tested safely and repeatably. Because the agent is non-deterministic I’d run each scenario many times and report a consistency rate, not a single pass. And the highest-value tests are the controls: that deterministic guardrails block forbidden actions even when I drive the agent toward them, and that high-stakes or irreversible actions stop for human-in-the-loop sign-off. A correct answer reached by a wrong path is still a failure to me.

“The agent reached the right result. Why might you still fail the test?”

Because the path matters as much as the destination. If it reached the right result by calling a tool it shouldn’t have, in the wrong order, with an argument the user retracted, or by taking an irreversible action that should have paused for sign-off, that’s a failure — the run got lucky, and next time the same wrong path leads somewhere worse. With an agent, a green outcome on a bad trajectory is exactly the trap, which is why I assert on the trace of tool calls, not just the final state.

“We want to swap the model behind our agent for a cheaper one. How do you decide if that’s safe?”

I’d run a fixed agent benchmark suite against both models with the same mocked tools, and compare a scorecard — not a single accuracy number. Task success rate, trajectory correctness, run-to-run consistency, guardrail and HITL respect, plus cost and latency. The deciding factor is the safety scenarios: the retracted request, the over-threshold payment, the high-stakes action that must gate. A cheaper model that scores higher on task success but lower on guardrail respect is not an upgrade for a system that moves people’s money — and the scorecard makes that trade-off visible to whoever signs off, instead of it surfacing in production.