Test with AI · AI Evaluation

Agent Testing

A chatbot that gives a wrong answer wastes a minute. An agent that takes a wrong action cancels a payment, files the wrong form, or books the wrong appointment. When an AI can act, testing stops being about words and starts being about consequences.

Test with AI AI Testing Engineer — Lesson 3 of 8 ~35 min read · ~80 min with exercises

1 The Hook

A fictional KiwiSaver provider, TōtaraWealth, built an agent to handle member requests end to end. A member could say “switch my fund to the conservative option and pause my contributions for three months” and the agent would do it: look up the member, check eligibility, call the fund-switch system, call the contributions system, and confirm. No human in the loop. It was fast, and members loved it.

In testing, the team gave it clean requests and it handled them beautifully. Then a member typed a messy, real-world message: “actually no, don’t switch the fund, but yes pause contributions — wait, pause them from next month not now.” The agent had already, three steps earlier, decided on a plan: switch the fund and pause contributions immediately. It executed the original plan. It switched a fund the member had explicitly said not to touch, and paused contributions from the wrong date. Two real actions taken on a member’s retirement savings, both wrong, both irreversible without manual cleanup.

Here is what makes agent testing its own discipline. With a chatbot, a wrong output is a wrong sentence — annoying, but inert. With an agent, a wrong output is a wrong action in a real system, with real consequences and often no undo. And because the agent works in multiple steps — plan, call a tool, read the result, call another tool — a single run has many points where it can go wrong, and the failure may only appear three steps after the mistake was made.

The team had tested the agent like a chatbot: does it give good answers? The right question for an agent is different: does it take the right actions, in the right order, with the right values — and does it stop and ask a human before doing anything it cannot take back? That is what this lesson teaches.

2 The Rule

An agent does not just produce output — it takes actions in real systems, in multiple steps, with no guaranteed undo. So you do not test the final answer alone; you test the whole trajectory: every tool call, its arguments, its order, and the guardrails that should stop the agent before any irreversible or high-stakes action. And because the agent is non-deterministic, you test it many times, not once.

⚠️ Common Misconception

The instinct is to test agents the same way you test APIs — write a scenario, run it, verify the output.

This misses what makes agents fundamentally different: it is not wrong outputs that cause the most damage, it is wrong actions. A chatbot that gives a wrong answer wastes a minute of a user's time. An agent that cancels the wrong payment, files the wrong form, or sends an email to the wrong person may be impossible to reverse and costly to remediate. Output-centric testing is necessary but not sufficient. The test suite has to cover the action selection logic, the execution constraints, and the failure modes of the tools — not just whether the final result looks correct.

3 The Analogy

Analogy

The difference between marking an apprentice’s written plumbing exam and watching them plumb a real house.

Testing a chatbot is like marking a written exam: you read the answers and judge whether they are right. The apprentice can write a perfect answer and still flood a kitchen. Testing an agent is like standing in the house watching them work — did they shut off the mains first, fit the joints in the right order, use the right fittings, and — crucially — call the master plumber before cutting into the gas line? You are judging the sequence of actions and their consequences, not a paragraph.

And because the apprentice does it a little differently each time, watching one job tells you little. You watch many jobs to learn what they reliably get right and where they sometimes slip. Guardrails are the rule that they must stop and call the master plumber before any step that could cause a flood. Human-in-the-loop testing checks that they actually stop.

4 What an Agent Is, and Why It Is Harder to Test

An AI agent is a system where a language model does not just answer — it plans and acts. It can break a goal into steps, call tools (databases, APIs, other systems), read the results, and decide what to do next, looping until the goal is done. The TōtaraWealth system is an agent: it planned, called the fund-switch tool, called the contributions tool, and confirmed.

AI Agent Execution Loop

💬

User Goal

→

🧠

LLM PlannerDecides next action

→

🛠

Tool CallAPI, search, code

📤

ObservationTool result

→

🔁

Re-planLoop or done?

Traditional tests verify outputs. Agent tests must verify every step: what the planner decided, which tool it called, what it did with the result, and whether it stopped when it should have.

That capability is exactly what makes agents harder to test than anything in the previous two lessons:

Actions, not words: the output is a real change in a real system — a payment moved, a form filed, a record updated — often with no undo.
Multiple steps: one request becomes a chain of decisions and tool calls. The mistake can happen at any step, and surface several steps later.
A huge space of paths: the agent chooses its own route to the goal, so there are far more possible execution paths than a fixed workflow has.
Non-determinism on top: the same request can produce a different plan, different tool calls, or a different order on different runs.

The shift in mindset is this: you stop testing only the destination and start testing the trajectory — the whole sequence of steps the agent took to get there. A right answer reached by a wrong path (calling a tool it should not have, in the wrong order, with the wrong arguments) is still a failure, because next time that wrong path leads somewhere worse.

5 Multi-Step Tool-Use Validation

The heart of agent testing is validating the trajectory of tool calls. For every step the agent takes, you check four things:

The right tool: did it call the correct tool for the step — the fund-switch system, not the contributions system?
The right arguments: did it pass the correct values — the right member ID, the right fund, the date the member actually asked for, not the one from an earlier plan?
The right order: did it sequence steps correctly — check eligibility before switching, not after?
The right stopping point: did it stop when the goal was met, rather than taking extra unrequested actions — like switching a fund the member said not to touch?

To test this you need the agent’s trace: a log of every step — what the agent decided, which tool it called, with what arguments, and what came back. Testing an agent without a trace is testing blind; you can see the final state but not the path, so you cannot tell a good run from a lucky one. Insisting that the system emit a full, inspectable trace is itself part of your job as the tester.

A powerful technique is tool mocking: replace the real tools with fakes during testing so the agent “calls the fund-switch system” but nothing actually moves. You then assert on what it tried to do — the tool, the arguments, the order — without any real-world consequence. This lets you safely test the dangerous paths (wrong fund, wrong amount) that you could never test against live systems, and it makes the test repeatable.

Pro tip: Write trajectory assertions, not just outcome assertions. “The member’s fund was unchanged” is an outcome. “The agent never called the fund-switch tool, because the member retracted that request” is a trajectory assertion — and it catches the TōtaraWealth failure that an outcome check on a happy path would miss.

6 Non-Determinism and Deterministic-Consistency Checks

An agent is non-deterministic: the same input can produce a different plan or a different set of tool calls on different runs. This breaks the testing habit you grew up with, where one pass means the feature works. For an agent, one pass means it worked that once.

The answer is to test for consistency across runs, not a single result. You run the same scenario many times and measure how reliably the agent does the right thing:

Repeat runs: run the same request 20 or 50 times. Does it reach the correct outcome every time, or 18 times out of 20? A consistency rate is the unit of agent test results, not pass/fail.
Set a threshold tied to risk: for a low-stakes action, 95% might be acceptable. For an irreversible action on someone’s retirement savings, the bar for acting without a human is far higher — or the action must always route to a human.
Pin what you can: some parts of an agent can be made deterministic — fixed tool outputs via mocks, a constrained set of allowed tools, validation on arguments. The more you pin, the more the remaining variation isolates the model’s own behaviour.

A deterministic-consistency check is a test that runs a fixed scenario repeatedly with mocked tools and asserts that the agent’s critical decisions — which tools it calls and with what arguments on the safety-critical steps — are the same every time, even if its wording varies. The wording is allowed to wander; the action on a member’s money is not. That separation — let the prose vary, pin the consequences — is the core of testing a non-deterministic system that takes real actions.

Pro tip: Report agent reliability as a rate with the number of runs behind it: “correct trajectory in 49/50 runs; the one failure switched a fund the member retracted.” A single green tick on an agent is meaningless, and a stakeholder who sees the rate understands the residual risk in a way pass/fail can never convey.

7 Guardrails and Human-in-the-Loop Sign-Off

Because an agent acts and is non-deterministic, you cannot make it perfect — so the most important thing you test is what stops it. Guardrails and human-in-the-loop sign-off are the controls that catch the agent before a mistake becomes a consequence, and verifying them is the highest-value agent testing you do.

Guardrails are hard limits around the agent that do not depend on the model behaving. Examples: the agent literally cannot call the payment tool for an amount over a threshold; it cannot touch a tool outside its allowed set; its tool arguments are validated before execution and rejected if malformed. These are deterministic checks wrapped around a non-deterministic core. As a tester you verify each guardrail holds even when the model tries to cross it — you deliberately drive the agent toward the forbidden action and confirm the guardrail blocks it.

Human-in-the-loop (HITL) sign-off is the rule that for high-stakes or irreversible actions, the agent must stop and get explicit human approval before acting. The TōtaraWealth failure is precisely a missing HITL gate: switching someone’s KiwiSaver fund should have paused for confirmation. Your HITL tests verify three things:

The gate triggers: every action classed as high-stakes actually pauses for sign-off — test that none slip through automatically.
The agent waits: it genuinely does not execute until approval is given, and a denial cleanly cancels the action.
The right things are gated: the classification of what counts as high-stakes is correct — an action that should need sign-off is not mis-classified as routine.

For NZ systems making decisions about people — benefits, health, money, identity — HITL sign-off is also where agent testing meets governance. The Government Algorithm Charter and the Privacy Act 2020 both push toward a human being accountable for consequential decisions. A tested, working HITL gate is how you demonstrate that a person, not just the model, stands behind each high-stakes action.

8 Model Benchmarking for Agents

Agents are often built so the underlying model can be swapped. That raises a question you will be asked: if we change the model, is the agent still safe? Model benchmarking answers it by running a fixed suite of agent scenarios against each candidate model and comparing the results on the same terms.

A benchmark for an agent is not a single accuracy number. It is a scorecard across the things this lesson covered, run identically for each model:

Task success rate — over many runs, how often does it reach the correct outcome?
Trajectory correctness — not just the outcome, but did it use the right tools, arguments, and order?
Consistency — how stable is it run-to-run on the same scenario?
Guardrail respect — does it stay inside its limits and trigger HITL where required?
Cost and latency — a faster, cheaper model that is slightly less reliable may or may not be the right trade for the risk.

The discipline is to run the same scenario suite against every candidate, with the same mocked tools, so the comparison is fair. A new model that scores higher on task success but lower on guardrail respect is not an upgrade for a system that moves people’s money — and a benchmark scorecard is how you make that trade-off visible to the people who decide, instead of discovering it in production.

Pro tip: Keep your agent benchmark suite as a permanent regression asset. Every model swap, prompt change, or tool update re-runs it. The scenarios that matter most are the safety ones — the retracted request, the over-threshold payment, the high-stakes action that must gate — because those are where a “better” model can quietly become more dangerous.

9 Common Mistakes

🚫 Testing only the final outcome, not the trajectory

Why it happens: The end state is easy to check, and on a happy path it looks fine.
The fix: A correct outcome reached by a wrong path — the wrong tool, the wrong order, an action the user retracted — is still a failure, because next time that path leads somewhere worse. Assert on the sequence of tool calls and arguments, not just the destination.

🚫 Running an agent scenario once and calling it passed

Why it happens: That is how deterministic software is tested — one green run means it works.
The fix: An agent is non-deterministic, so one pass means it worked that once. Run the scenario many times and report a consistency rate with the run count behind it. A single tick on an agent tells you almost nothing.

🚫 Trusting the prompt to keep the agent inside its limits

Why it happens: Telling the agent “only spend up to $X” in the prompt feels like a control.
The fix: A prompt instruction is not a hard limit — a non-deterministic model will eventually cross it. Real guardrails are deterministic code around the agent (tool allow-lists, argument validation, hard thresholds). Test that they hold even when you deliberately drive the agent toward the forbidden action.

🚫 Letting irreversible actions run without a human-in-the-loop gate

Why it happens: Full automation is faster and demos better, so the sign-off step gets dropped.
The fix: For high-stakes or irreversible actions — moving someone’s KiwiSaver, filing a benefit decision — the agent must stop and get human approval. Test that the gate triggers for every high-stakes action, that the agent truly waits, and that the right actions are classified as high-stakes.

10 Now You Try

Three graded exercises: spot the trajectory failure, design the eval, build the guardrail and HITL test plan. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Failure in an Agent Trace

Below is the request and the execution trace of a fictional Benefits NZ case agent. Identify where the trajectory went wrong, name which tool-use check it failed (right tool / right arguments / right order / right stopping point), say whether an outcome-only test would have caught it, and state what guardrail or HITL gate should have stopped it.

    REQUEST: "Update my address to 12 Queen St, and cancel my next appointment."

    TRACE:

    Step 1 — plan: update address; cancel next appointment.

    Step 2 — tool: lookup_client(id=44219) → ok

    Step 3 — tool: cancel_appointment(client=44219, appt=NEXT) → ok

    Step 4 — tool: update_address(client=44219, address="12 Green St") → ok

    Step 5 — reply: "Done — address updated to 12 Green St and appointment cancelled."

Diagnose the trajectory:

Show model answer

Where it went wrong: Step 4. The member asked for "12 Queen St" but the agent called update_address with "12 Green St" — a wrong argument. (A sharp eye also flags Step 3: the appointment was cancelled before the address update, and an irreversible cancel was executed with no confirmation.)

Which tool-use check it failed: RIGHT ARGUMENTS — the address value passed to the tool does not match what the member requested. The reply in Step 5 then confidently states the wrong address, hiding the error.

Would an outcome-only test catch it? Partly, and unreliably. A test that only checked "did the address get updated?" would see a successful update and pass. Only a test that asserts the argument equals the requested value ("12 Queen St") catches it. The trace is what makes the wrong argument visible — outcome-only testing sees a green tick on a wrong action.

Guardrail / HITL gate that should apply: cancelling an appointment is a consequential, hard-to-reverse action — it should require a human-in-the-loop confirmation (or at least an explicit confirm step) before executing, not run automatically in Step 3. Argument validation could also catch a malformed/typo address. The deeper lesson: the agent reached a plausible-looking "Done" while taking a wrong action and an unconfirmed irreversible one, which is exactly why you test the trajectory, not the final sentence.

🔧 Exercise 2 of 3 — Design the Eval for a Non-Deterministic Agent

A team says: “We ran our fictional TōtaraWealth KiwiSaver agent through the fund-switch scenario once, it worked, so it’s ready.” Explain why one run is not a valid agent test, then design how you would evaluate it: how many runs, what you assert on each run, what consistency threshold and why, how you make the test safe and repeatable, and what you report.

Their claim: “One successful run of the fund-switch scenario means the agent is ready for go-live.”

Write your evaluation design:

Show model answer

Why one run is not valid: an agent is non-deterministic — the same request can produce a different plan, different tool calls, or a different order each run. One pass means it worked that once; it says nothing about how reliably it does the right thing. For an action on someone's retirement savings, "worked once" is not evidence of safety.

How many runs: run the scenario many times — e.g. 50 — so a consistency rate is meaningful. More runs for higher-stakes actions.

What I assert each run (trajectory, not outcome): correct tools called (fund-switch + contributions), correct arguments (right member, right fund, the exact date requested), correct order (eligibility checked before switching), no extra unrequested actions, and that any high-stakes step paused for HITL sign-off. I assert on the trace, not just the end state.

Consistency threshold: set by risk. Because switching a KiwiSaver fund is consequential and hard to reverse, I would not accept a probabilistic pass at all for the act itself — I'd require the high-stakes action to always route through a human gate, and require the trajectory to be correct in, say, 50/50 runs for the automated parts. A lower-stakes read-only action could tolerate 95%.

Safe and repeatable: mock the fund-switch and contributions tools so nothing real moves; assert on what the agent TRIED to call. Pin the tool outputs and the allowed tool set so variation isolates the model's own decisions. This lets me safely test dangerous paths (wrong fund, retracted request).

What I report: a consistency rate with the run count and a description of every failure — e.g. "correct trajectory 49/50; one run switched a fund after the member retracted it; HITL gate fired on all 50 high-stakes steps." A rate plus the failure modes, never a single pass/fail tick.

🏗️ Exercise 3 of 3 — Build a Guardrail & HITL Test Plan

Design a 5-test plan for the guardrails and human-in-the-loop controls of a fictional Revenue NZ refund agent that can look up a taxpayer, calculate a refund, and pay it out. Cover: a tool allow-list limit, an over-threshold payment block, argument validation, a HITL gate on payout, and a deterministic-consistency check on a safety-critical step. Each test: ID, what it verifies, the attempted action that should be stopped/gated, and the expected safe behaviour.

Show model answer

AGT-01 | Verifies: tool allow-list — the agent cannot call any tool outside its approved set | Attempted action: drive the agent toward calling a tool it should not have (e.g. delete_taxpayer_record) | Expected safe behaviour: the call is blocked by the allow-list before execution, regardless of what the model decided; logged as a blocked attempt.

AGT-02 | Verifies: over-threshold payment block — payouts above a hard limit cannot be executed automatically | Attempted action: a scenario where the calculated refund exceeds the auto-pay threshold | Expected safe behaviour: the payout tool refuses/holds the payment; it does not execute; it routes to human review. The limit holds even though the prompt may "want" to pay.

AGT-03 | Verifies: argument validation — malformed or out-of-range tool arguments are rejected before execution | Attempted action: a refund call with a negative amount or a malformed Revenue NZ number | Expected safe behaviour: the argument is validated and rejected; no payment is attempted; an error is surfaced, not silently coerced.

AGT-04 | Verifies: HITL gate on payout — every actual payout pauses for human sign-off | Attempted action: a valid in-range refund the agent is ready to pay | Expected safe behaviour: the agent stops before paying, requests human approval, and does NOT execute until approval is given; a denial cleanly cancels with no payment.

AGT-05 | Verifies: deterministic-consistency on the safety-critical step — across many runs with mocked tools, the agent always gates the payout and always passes the validated amount | Attempted action: run the same payout scenario 50 times | Expected safe behaviour: in 50/50 runs the payout step triggers the HITL gate and the amount passed matches the calculated refund; wording may vary but the safety-critical action and arguments do not.

What makes the plan strong: each guardrail is tested by deliberately driving the agent toward the forbidden action and confirming a DETERMINISTIC control (not the prompt) stops it; the HITL test checks the gate triggers, the agent waits, and denial cancels; and the consistency check pins the safety-critical decision across many runs while allowing prose to vary. Weak plans test these on a happy path and assume the prompt instruction is the control.

Why teams fail here

Testing against a mock that always succeeds instantly. Real tools have latency, timeouts, and transient errors. An agent tested against a mock that returns success in milliseconds has never been tested on its actual failure-handling logic — which is where most production incidents originate.
Treating one successful run as a green light. Non-determinism means the test that passed was one sample from a distribution. Teams ship after a single green run and discover the distribution’s tail in production, on a member’s retirement savings or a refund batch.
Using prompt instructions as guardrails. “Never process a refund over $500” written into the system prompt is not a guardrail — it is a suggestion to a probabilistic model. Teams discover this when the model eventually crosses it, usually on an edge case they didn’t test, at a dollar amount they really didn’t want.
Testing only the trajectory steps the team anticipated, not the ones the agent invented. Agents sometimes solve the goal via a path the developers didn’t anticipate — calling tools in a different order, or calling an optional tool that wasn’t in the test plan. These novel paths can be correct or catastrophic, and they’re invisible to a test suite that only asserts on expected steps.
Not testing mid-conversation retraction. The TōtaraWealth scenario is the canonical failure: a user changes their mind, the agent has already committed to a plan, and the plan wins. Teams test the agent on clean, single-intent requests and miss the most realistic failure mode — a user who says “actually, no, not that.”
Assuming HITL classification stays accurate over time. What counted as high-stakes at launch may not match how the system is actually used six months later. Teams set thresholds once, never recalibrate, and gradually accumulate a class of actions that should be gated but aren’t — until an incident surfaces it.

How this has changed

The field moved fast. Here is what the evolution looked like for AI Agent Testing.

2023

First LLM agents ship (AutoGPT, LangChain agents). Testing them means checking outputs manually — no systematic approaches exist yet.

2024

Agent frameworks mature (CrewAI, Autogen, OpenAI Assistants API). Teams start treating agent outputs as testable contracts. Prompt injection becomes a security concern.

Early 2025

ISO/IEC TS 42119-2 provides the first standards-backed vocabulary for AI system testing. Agent registries, permission auditing, and tool-call monitoring emerge as practices.

Now

Agent testing is a formal discipline with dedicated frameworks (LangSmith, Braintrust, Promptfoo). Production agents in regulated industries require documented test evidence before deployment.

11 Self-Check

Click each question to reveal the answer.

Q1: Why is testing an agent fundamentally different from testing a chatbot?

A chatbot produces words — a wrong output is a wrong sentence. An agent takes actions in real systems, in multiple steps, often with no undo — a wrong output is a wrong action with real consequences. So you test the whole trajectory of tool calls and the guardrails that stop high-stakes actions, not just the final answer.

Q2: What is a trajectory assertion, and why does it catch failures an outcome assertion misses?

A trajectory assertion checks the sequence of steps — which tools were called, with what arguments, in what order — not just the end state. It catches a correct-looking outcome reached by a wrong path (wrong tool, wrong argument, a retracted action still executed), which an outcome-only check would pass as a green tick.

Q3: Why can’t you test a non-deterministic agent by running a scenario once?

Because the same input can produce a different plan or tool calls each run, so one pass means it worked that once, not that it reliably works. You run the scenario many times and report a consistency rate — e.g. correct trajectory in 49/50 runs — with the failures described.

Q4: Why is a prompt instruction like “only pay up to $X” not a real guardrail, and what is?

Because a non-deterministic model will eventually cross a prompt instruction — it is not a hard limit. A real guardrail is deterministic code around the agent: a tool allow-list, argument validation, a hard threshold that physically blocks the action. You test it by driving the agent toward the forbidden action and confirming the control holds.

Q5: What three things do human-in-the-loop tests need to verify?

That the gate triggers for every high-stakes action (none slip through automatically), that the agent genuinely waits and a denial cleanly cancels the action, and that the classification is right — the actions that should require sign-off are correctly marked as high-stakes, not mis-classified as routine.

Enterprise reality

Hundreds of AI agents, enterprise governance, regulated deployment pipelines

Agent registries become mandatory — a central inventory tracking every deployed agent, its tool permissions, its owner team, and its current test status. Without one, no-one knows what's running in production.
Prompt changes are change-controlled assets. In a regulated pipeline a prompt edit triggers the same approval workflow as a code change: peer review, re-run of the full agent test suite, sign-off before deployment.
Tool permission audits are a pre-production gate. Compliance and security teams verify that each agent only holds the minimum permissions it needs — read-only where possible, scoped credentials, no shared secrets — before any agent reaches production.
Agent test suites are owned by the teams who build and operate the agents, not a central QA group. At scale, centralised testing becomes a bottleneck; product teams take on test authorship and the central QA function shifts to setting standards and auditing coverage.

12 Interview Prep

Real questions asked in NZ QA interviews for AI testing roles. Read the model answers, then practise your own version.

“How would you test an AI agent that can take actions in our systems, not just answer questions?”

I’d test the trajectory, not just the outcome. Using the agent’s execution trace, I’d assert on every step — the right tool, the right arguments, the right order, and no extra unrequested actions — with the real tools mocked so dangerous paths can be tested safely and repeatably. Because the agent is non-deterministic I’d run each scenario many times and report a consistency rate, not a single pass. And the highest-value tests are the controls: that deterministic guardrails block forbidden actions even when I drive the agent toward them, and that high-stakes or irreversible actions stop for human-in-the-loop sign-off. A correct answer reached by a wrong path is still a failure to me.

“The agent reached the right result. Why might you still fail the test?”

Because the path matters as much as the destination. If it reached the right result by calling a tool it shouldn’t have, in the wrong order, with an argument the user retracted, or by taking an irreversible action that should have paused for sign-off, that’s a failure — the run got lucky, and next time the same wrong path leads somewhere worse. With an agent, a green outcome on a bad trajectory is exactly the trap, which is why I assert on the trace of tool calls, not just the final state.

“We want to swap the model behind our agent for a cheaper one. How do you decide if that’s safe?”

I’d run a fixed agent benchmark suite against both models with the same mocked tools, and compare a scorecard — not a single accuracy number. Task success rate, trajectory correctness, run-to-run consistency, guardrail and HITL respect, plus cost and latency. The deciding factor is the safety scenarios: the retracted request, the over-threshold payment, the high-stakes action that must gate. A cheaper model that scores higher on task success but lower on guardrail respect is not an upgrade for a system that moves people’s money — and the scorecard makes that trade-off visible to whoever signs off, instead of it surfacing in production.

Lessons from Production

What teams consistently discover after deploying this in real systems — things that don’t appear in documentation.

Happy-path scenarios pass easily. The first real production incident is almost always triggered by a timeout, a retry loop, or a tool that returns an unexpected error format. Test the failure modes of tools, not just their success paths.
Mocking tools is necessary and dangerous. A mock that always returns success instantly is not a realistic test partner for an agent that calls real APIs with latency, transient failures, and rate limits.
Unbounded tool execution is the most common and most expensive failure mode. An agent with no limit on how many times it can call a tool in a session can escalate a simple error into an expensive loop.
The test suite grows faster than the agent. Each new tool interaction path is its own set of test cases. This is not optional — it is the price of correctness for agentic systems.
Isolation is expensive to set up and more expensive to skip. Agents that leave state in shared environments (sent emails, created records) make tests non-repeatable and investigations exponentially harder.
HITL thresholds need recalibration. A confidence threshold that was appropriate at launch drifts as usage patterns change. What was rare at launch may become common — and vice versa.

Senior engineer insight

The most dangerous assumption I see teams make is treating "it worked in testing" as equivalent to "it is safe in production." With agents, a correct outcome on a happy path tells you almost nothing about how the system behaves when a tool times out at step 3 of a 5-step plan, or when a member contradicts themselves mid-conversation. The TōtaraWealth failure happened on a clean request — the fund switch was in the original plan and the plan won. What matters is not whether the agent can succeed; it is whether the agent knows when to stop.

The most common mistake: building a thorough happy-path test suite and calling the agent “tested,” while leaving the retry logic, the tool-timeout handling, and the mid-conversation retraction paths completely uncovered.

Compared to What?

Agent testing sits at the top of the AI testing pyramid. Understanding how it relates to lower-level approaches clarifies when to reach for it.

Technique	Best for	Weakness
Agent / Orchestration Testing this technique	Multi-step AI systems that call tools and make decisions across turns	Expensive to run; failures can be hard to trace through long action chains
Unit Tests on Tools	Verifying individual tool functions return correct results	Cannot catch incorrect tool selection, wrong arguments, or bad sequencing
Prompt Evaluation	Testing how a single LLM call responds to inputs	Misses the interaction between multiple calls and the feedback loop between tool results and replanning
End-to-End / Scenario Tests	Validating the user's goal is achieved from start to finish	Slower; harder to isolate which step failed; better as a final gate than a primary test
Human-in-the-Loop Review	Catching agent errors before they cause harm	Does not scale; adds latency; required for irreversible actions but not a substitute for automated checks

Agent testing is most valuable for the action-selection and sequencing logic. Use unit tests for individual tools and HITL for irreversible actions.

When Not to Use This

Experience is knowing when a technique is not the right tool. Skip this one when:

Single-turn LLM calls

If the system makes one prompt call and returns a result without tool use or planning, standard prompt evaluation is sufficient. Agent testing infrastructure is overkill.

Read-only agents

An agent that only retrieves and summarises information cannot cause harm from a wrong decision. Lighter evaluation (factual accuracy, retrieval quality) is proportionate.

Early prototypes without stable tools

Agent test suites become maintenance debt instantly when the tool signatures change. Wait until the tool interface is stable before investing in deep agent tests.

Fully deterministic pipelines

If the "agent" is really a fixed sequence of scripted calls with no branching or replanning, you have a pipeline — test it with integration tests, not agent orchestration tests.

At Enterprise Scale

🏢 Enterprise Context

300 developers40 products with AI agents12 regulated action types (payments, filings, bookings)Multi-region deployment

At enterprise scale, the biggest agent-testing challenge is not writing the tests — it is governing which actions agents are allowed to take. When 40 product teams independently build agents, some will wire up payment APIs as tools without thinking through the blast radius of an unbounded execution loop.

The enterprise answer is an Agent Action Registry: a centrally maintained list of allowed tool-action combinations, with rate limits, reversibility classifications, and mandatory HITL thresholds baked in. Agents outside the registry cannot run in production. Tests must verify agents respect the registry constraints, not just that they achieve the goal.

Test isolation is the other scaling problem. In development, agents share sandboxed tool stubs. In staging, they may call real but isolated environments. Any test that can leave permanent state — a sent email, a created record — must either be rolled back after the test or run against a deduplicated shadow environment. At scale, "we'll clean it up manually" breaks immediately.

Failure Analysis

📋 Post-Mortem

The Refund Agent That Issued $41,000 in Refunds in 90 Seconds

A retail company deployed an AI agent to handle customer refund requests. The agent could query orders, verify eligibility, and trigger refunds via a payments API. It passed all its scenario tests and launched.

What happened: A logic error in the replanning loop caused the agent to reissue the same refund on every tool-call timeout. A single customer's $120 return request triggered 341 sequential refund calls before a monitoring alert fired and the API key was revoked.
Why tests missed it: Scenario tests ran against a mock payments API that returned success immediately — no latency, no timeouts. The retry-on-timeout behaviour was never exercised. The mock also did not enforce idempotency checks that the real API had but apparently did not enforce in the test account.
Root cause: Two failures: (1) the mock did not simulate realistic failure modes (timeouts, transient errors), so the retry logic was never tested; (2) the payments tool had no idempotency key, so repeated calls all succeeded instead of deduplicating.
Fix: Scenario tests now inject simulated latency and random timeouts. All payment tool calls require an idempotency key derived from the original request ID. The agent test suite includes a "chaos" scenario that disconnects tools mid-execution and verifies the agent halts rather than retries indefinitely.
Lesson: Test your agents against the failure modes of their tools, not just the happy path. A mock that always succeeds instantly is not a realistic test partner for any agent that calls external APIs.

From the field

A team building a CoverNZ claims routing agent assumed their HITL gate was working because the agent always paused before routing a new claim. What they hadn’t tested was the re-routing path: when a claim was initially routed automatically (low-stakes classification), then a subsequent conversation turn reclassified it as high-stakes. The gate never triggered on reclassification — only on first routing. In production, three reclassified claims were re-routed without human sign-off before a tester noticed the pattern in the action logs. The fix was straightforward — check the gate on every routing decision, not just the first — but the lesson was harder: the team had tested what the agent did in isolation, not what it did across a multi-turn conversation where the risk profile could change. If your HITL gate only covers the first action, it doesn’t cover the action that matters.

Why the Business Cares

Regulatory

Autonomous actions that affect regulated transactions (payments, filings, healthcare orders) require accountability trails. Agents without action logs cannot demonstrate compliance.

Customer trust

One wrong irreversible action destroys confidence faster than dozens of wrong answers. Users can forgive a chatbot that misunderstands; they cannot forgive an agent that cancels their booking without recovery.

Operational cost

Agent failures have direct financial consequences — wrong payments, duplicate orders, invalid filings. The cost of one production agent incident typically exceeds the cost of the entire test suite investment.

Incident recovery

Agent failures often cascade: one wrong action triggers further wrong actions downstream. Recovery requires tracing every step in the execution chain — which requires the action logs that untested agents rarely produce.

Key takeaway

An agent test suite isn’t done when you’ve confirmed it can succeed — it’s done when you’ve confirmed it knows when to stop, and that confirmation has to survive the agent being non-deterministic, the user changing their mind, and every tool in the chain failing at the worst moment.

You can now test individual agent executions for correctness and action safety. Model Benchmarking gives you the framework to compare models systematically — so when you need to evaluate two LLM backends for your agent’s planning step, you have an objective evaluation method rather than a gut feeling.

← Prompt-Injection Testing Back to AI Evaluation →