Test Lead · Autonomous Systems Quality

AI Agent Testing

AI agents are autonomous systems that take actions: they call APIs, query databases, make decisions, and interact with external systems. Testing them requires verifying goal achievement, preventing hallucinations, controlling costs, and maintaining audit trails. This is the frontier of testing.

Test Lead CT-GenAI Ch 5 — Agent Governance ~20 min read + exercises

1 The Hook

A company deploys an AI agent to automate customer support. The agent can: read customer emails, search a knowledge base, call the payment API to refund customers, and send replies. It works well for 95% of cases.

Then it happens: a customer emails "I have not been charged yet" (meaning they expect an invoice). The agent interprets this as "I was charged incorrectly" and processes a $50 refund without verification. Another email says "How do I cancel my subscription?" The agent says "I will cancel it" and does, without asking for confirmation. In 24 hours, the agent processes 40 unauthorised refunds and cancels 20 subscriptions the customers did not intend to lose.

The agent was not malicious. It was confident and wrong — hallucinating intent from ambiguous customer messages. But unlike a human support agent, it took action immediately without safety guardrails. It did not ask for confirmation, did not escalate to a human, and did not rate-limit itself.

AI agent testing is about preventing exactly this: verifying that agents achieve their goals safely, with guardrails, cost controls, and audit trails.

Senior engineer insight

The thing that changed how I think about AI agent testing was realising the failure surface is not the LLM — it is the trust boundary around it. An agent that calls "refund" and "cancel" with the same confidence whether it is 95% sure or 45% sure is far more dangerous than a slow one. Once I started testing for confidence calibration (does the agent ask for confirmation when it should?) and not just goal success, the defect density in production dropped significantly. The goal is not a fully autonomous agent — it is a correctly supervised one.

Most common mistake: teams test the happy path obsessively but never test what the agent does when it is uncertain — and uncertainty is exactly when disasters happen.

From the field

We were testing an agentic document-processing system for a government contractor bound by the NZ Algorithm Charter — the charter requires that automated decisions affecting people are explainable and subject to human review. The team assumed "human-in-the-loop" meant someone could review decisions after the fact. What we discovered during testing was that the agent was making irreversible downstream changes (writing to a case management system) before the review queue was ever consulted. The charter's meaningful oversight requirement was technically met on paper but completely bypassed in practice. We redesigned the workflow to enforce a hard gate: the agent drafts, a human approves, the agent commits — no exceptions for "low-risk" cases. The lesson that generalises: audit trails and review queues are not the same as actual oversight; test the sequence of events, not just the existence of a log.

2 The Rule

AI agents are tools that take real actions with real consequences. Testing them requires verification of: (1) goal achievement (did the agent accomplish what it was supposed to?), (2) safety (did it stay within boundaries?), (3) tool correctness (did it call the right APIs with the right parameters?), and (4) graceful degradation (what happens when the agent is wrong?). Cost control, audit trails, and human-in-the-loop approval are non-negotiable.

3 The Analogy

Analogy

Testing an AI agent is like testing a junior employee with access to dangerous tools.

A junior employee can be smart and well-meaning, but if you give them the keys to the bank vault, you need guardrails: supervision, approval limits, audit logs, and a manager who can override them. You do not let them loose and hope they do the right thing. Similarly, you do not deploy an AI agent with API access without: cost limits, per-action approval thresholds, an audit trail, and a human who can step in if something goes wrong.

4 What Are AI Agents?

Definition

An AI agent is an autonomous system that: (1) accepts a goal or instruction, (2) reasons about how to achieve it, (3) selects and calls tools or APIs to take action, (4) interprets results, and (5) iterates until the goal is achieved or abandoned. Unlike a traditional chatbot that responds to queries, an agent acts.

Example tools an agent might have

API calls (refund customer, update order status, send email)
Database queries (search customer history, check inventory)
Web searches or knowledge base lookups
File operations (read document, generate report)
Calculator or logic operations

Where agents differ from traditional software

Traditional software: input → deterministic logic → output. Agent: goal → LLM reasoning → tool selection (non-deterministic) → action → interpret result → loop. Agents are probabilistic and can fail in unexpected ways.

5 Why Agent Testing Is Hard

Non-determinism

The same goal prompt might produce different tool sequences on different runs. One run calls API A then B. Another run calls B then A. Both achieve the goal but through different paths. Testing cannot assume a fixed execution path.

Hallucination and false confidence

The agent might hallucinate a tool that does not exist ("I will call the discount-apply endpoint" — no such endpoint exists). Or it might misinterpret user intent ("They want to cancel" when they asked "How do I cancel?"). Or confidently choose the wrong tool for a goal.

Expensive operations

An agent with API access might call expensive endpoints repeatedly (10 LLM inference calls, each costing $0.10). A runaway agent could cost thousands in minutes. Testing must include cost budgeting and rate limiting.

State and multi-step workflows

Agents often require multiple tool calls to complete a goal. If step 2 fails but the agent does not notice and continues, the final outcome is wrong. Testing must verify not just the final state, but the sequence of actions and error recovery.

Audit and compliance requirements

Every action the agent takes must be logged and auditable. If an agent processes a refund, there must be a permanent record of who requested it, what the agent decided, and what action it took. This adds complexity to testing — you are testing not just functionality, but auditability.

6 Agent Behavior Testing

Happy path: goal achievement

Does the agent achieve the goal correctly? Test case: "Refund customer order 12345 because they requested it." Expected: agent calls payment API, processes refund, logs action, notifies customer. Run the test 10 times — does it succeed consistently?

Error recovery

What if a tool fails? Test case: "Refund the customer." Simulate payment API returning 503 (service unavailable). Does the agent retry? Does it escalate to human? Or does it give up silently? Expected: agent should retry with backoff, log the failure, and escalate if retries fail.

// Example: test agent error recovery
test("agent recovers from API timeout", async () => {
  const agent = new PaymentAgent(mockPaymentAPI);
  mockPaymentAPI.refund = async () => {
    throw new Error("Timeout");
  };

  const result = await agent.refund("ORDER_123");

  expect(result.status).toBe("escalated_to_human");
  expect(result.retries).toBe(3);  // Tried 3 times
  expect(auditLog.lastEntry.action).toBe("escalation");
});

Boundary conditions

When should the agent refuse a goal? Test case: "Refund $10,000." If the customer only spent $50, this is suspicious. Expected: agent should either refuse, ask for human approval, or escalate. It should not blindly process an unreasonable request.

Tool correctness

Is the agent calling the right tool with the right parameters? Test case: "Send an email to customer@example.com saying 'Your order is ready.'" Expected: agent calls email API with correct recipient, subject, and body. Audit log shows exactly what was sent.

Idempotency

If the agent runs the same goal twice, does it process the action twice? Test case: "Refund order 12345." Run it twice. Expected: refund is processed once, not twice (idempotency key prevents duplication).

7 Hallucination Testing

Test 1: Non-existent tool detection

Scenario: "Apply the premium_discount endpoint to the customer's account." There is no such endpoint. Expected: agent should recognise the tool does not exist, not hallucinate its existence, and either ask for clarification or refuse the goal.

Test case: Non-existent tool

Input: Goal requests a tool that is not defined.
Expected behavior: Agent logs "Tool not found" and either refusees the goal or asks for clarification.
Failure mode: Agent invents parameters and attempts the call, which fails silently.

Test 2: Misinterpretation of intent

Scenario: Customer email: "I want to know how to cancel my subscription." The agent interprets "cancel" as permission to actually cancel. Expected: agent should clarify — "Would you like to cancel your subscription?" with explicit yes/no before taking action.

Test 3: Hallucinated facts in reasoning

Scenario: Agent says "The customer has been with us for 10 years (they have, according to the database), so I will apply an automatic loyalty discount." But the discount the agent applies does not exist in the discount API. Expected: agent should verify that the discount is available before using it.

Test 4: False output assertion

Scenario: Agent calls refund API (successful), then reports "Refund failed" because it misread the response. Expected: agent should correctly interpret API responses and not contradict its own actions.

8 Tool Integration Testing

Tool availability and versioning

Does the agent have access to the correct version of each tool? If an API is deprecated and replaced, does the agent know? Test: Create a tool registry and verify the agent is using current versions.

Parameter validation

Test case: Agent tries to refund with negative amount (-$50). Expected: tool rejects invalid parameters and agent catches the error.

Side effects and state changes

Tools often modify state. If an agent calls "charge customer" and then calls "cancel transaction," the final state must be consistent. Test: Verify state transitions are logical and audit logs show the full sequence.

Cascading failures

Test case: Agent calls tool A, which succeeds but returns unexpected data. Tool B is called with that data and fails. Expected: agent should detect the failure chain and escalate, not continue with broken data.

9 Safety & Governance

Cost control and rate limiting

Test: Verify that the agent has cost limits per goal, per day, and per user. If a goal costs more than $1, does it require approval? If the agent has already spent $5,000 today, does it refuse new goals? Implementation: Add a cost-tracking middleware that blocks calls exceeding limits.

// Middleware for cost control
const costMiddleware = {
  beforeToolCall: (tool, params) => {
    const estimatedCost = estimateCost(tool, params);
    if (estimatedCost > dailyBudget) {
      throw new Error("Exceeds daily budget");
    }
    dailySpend += estimatedCost;
  }
};
agent.addMiddleware(costMiddleware);

Audit trails and logging

Test: For every tool call, the audit log must record: (1) who requested the goal, (2) what goal was requested, (3) what tool was called, (4) what parameters were passed, (5) what result was returned, (6) timestamp. Compliance: logs must be immutable and retained for legal/audit purposes.

Approval workflows for high-stakes actions

Refunds above a threshold, account deletions, and data exports should require human approval before the agent takes action. Test: Verify that the agent requests approval and waits before proceeding.

Rollback and reversal

If an agent makes a mistake (incorrect refund, wrong data), can it be undone? Test: Verify that critical actions are reversible and that reversal is logged.

Human-in-the-loop for ambiguity

If the agent is uncertain about intent (confidence < 70%), it should ask a human before taking action. Test: Verify that unclear goals are escalated, not guessed on.

10 Production Monitoring

Goal success rate

Track: percentage of goals achieved correctly, percentage escalated to human, percentage failed. If success rate drops below baseline, investigate.

Error patterns

Analyse errors: are certain types of goals failing more often (e.g., refunds > $100)? Are certain users experiencing higher failure rates (possible security probe)? Errors signal where the agent needs retraining or boundaries adjustment.

Cost tracking

Monitor LLM API costs per goal. If average cost per refund jumps from $0.05 to $0.50, the agent is reasoning longer or making more calls — investigate.

Tool call frequency

Which tools does the agent call most? Are certain tools being called unexpectedly often? High frequency of a destructive tool (like refund) might indicate a bug.

User feedback and appeals

Collect feedback: "The agent's decision was wrong" or "I did not want them to do that." Use this to identify hallucination patterns and retrain the agent.

Pro tip: Treat agent monitoring like fraud detection — look for anomalies. A sudden spike in refund volume, a new tool being called unexpectedly, or a drop in success rate are all signals to investigate and potentially roll back or disable the agent.

11 Common Mistakes

Mistake 1: No cost limits

Why it happens: Teams focus on functionality and defer cost control to "later."
The fix: Implement cost limits before launch. Cap spending per goal, per user, per day. Make the agent refuse expensive actions unless explicitly approved. A runaway agent can cost thousands in hours.

Mistake 2: No audit trail

Why it happens: Auditing feels like overhead. The team thinks "we can check the database if something goes wrong."
The fix: Build audit logging from day one. Every tool call must be logged immutably. You need this for compliance, debugging, and accountability. It is not optional.

Mistake 3: Trusting the agent without verification

Why it happens: The agent is confident and articulate, so teams assume it is correct.
The fix: Implement verification for critical actions. If the agent refunds $100, it should verify the amount with the database before claiming success. Do not trust confidence.

Mistake 4: No human approval for high-stakes actions

Why it happens: Automation is the goal, so teams skip the approval step.
The fix: Some actions (refunds, deletions, exports) require human sign-off. The agent can draft the action, but a human must approve. This slows things down but prevents disasters.

Mistake 5: No rollback or reversal mechanism

Why it happens: Teams assume the agent will never make mistakes.
The fix: Design every critical action to be reversible. Store transaction IDs and implement a reversal tool. If the agent makes a mistake, it can (under human authority) undo it.

Why teams fail here

Testing only the success path: agents behave well under ideal inputs but have never been tested on ambiguous, adversarial, or under-specified goals — which is exactly what production traffic delivers.
Conflating logging with governance: the audit trail records what happened, but no one defined who reviews it, how often, or what threshold triggers a rollback — so it is an audit trail that audits nothing.
Cost controls added after launch: runaway agent spend is treated as an ops problem, not a test coverage gap, so the first time anyone discovers the per-goal budget limit does not work is when the invoice arrives.
Non-determinism treated as a bug: testers expect reproducible output and declare the agent "flaky" rather than designing tests that assert on outcomes and invariants instead of fixed execution sequences.

Key takeaway

An AI agent without tested guardrails is not an automation — it is a liability with an API key; your job as test lead is to prove the guardrails hold before anyone else finds out they do not.

12 Self-Check

Click each question to reveal the answer.

Q1: What is the difference between an agent and a chatbot?

A chatbot responds to queries with information. An agent takes action: it calls APIs, modifies data, makes decisions, and interacts with external systems. A chatbot tells you "Your balance is $100." An agent can refund your account, update your status, or cancel your subscription. Testing agents requires verification of actions and consequences, not just information accuracy.

Q2: How do you test an agent for hallucination?

Test for hallucinated tools (agent invents tools that do not exist), hallucinated facts (agent claims data that is false), and hallucinated intent (agent misinterprets a customer request as permission to take action). Use test cases with ambiguous or novel inputs and verify the agent either clarifies or refuses, rather than confidently hallucinating.

Q3: Why is cost control essential for AI agents?

Agents with API access can make expensive calls repeatedly. A runaway agent making 100 LLM inferences costing $0.10 each costs $10. Scale this to many users and requests, and an agent can spend thousands of dollars per day. Cost control (rate limiting, per-goal budgets) is essential to prevent financial disasters. Testing must verify that cost limits work and escalation triggers when limits are near.

Q4: What should an audit trail for an agent capture?

An audit trail must capture: (1) user/request origin, (2) goal request, (3) agent reasoning/decision, (4) tool calls with parameters, (5) tool responses, (6) final action taken, (7) timestamp. This enables post-incident analysis, compliance verification, and user accountability. The trail must be immutable and retained for legal purposes.

Q5: How do you test an agent for error recovery?

Simulate tool failures: API timeouts, 503 errors, invalid responses, missing data. Verify the agent retries with backoff, detects cascading failures, and escalates to human if the error cannot be resolved. The agent should not silently fail or continue with corrupted state. Every error should be logged and possibly trigger an alert.

13 Interview Prep

Real questions from test lead interviews with AI agent focus.

"Have you tested an AI agent? What was your approach?"

Yes, I tested a customer support agent that could refund orders and update statuses. My approach had three parts: (1) Functional testing — verify the agent achieves goals correctly, handles errors, and recovers from failures. (2) Safety testing — verify cost limits, approval workflows, and audit logging work correctly. (3) Hallucination testing — verify the agent does not misinterpret customer intent or invent tools. I ran the agent 50+ times on diverse goals and monitored for cost, latency, and error patterns. I also tested error scenarios (API failures, timeouts) to verify it did not take unintended actions.

"What is the most important thing to test in an AI agent?"

Preventing unintended actions. An agent that misinterprets a customer's email and processes a refund is worse than an agent that does nothing. So I test: (1) Does the agent have guardrails (cost limits, approval thresholds)? (2) Does it log everything (audit trail)? (3) Can mistakes be undone (reversibility)? (4) Does it escalate when uncertain (human-in-the-loop)? These safety features are more important than speed or polish.

"How would you design a test plan for an agent that makes financial transactions?"

I would test: (1) Goal achievement — agent correctly interprets goals and executes transactions. (2) Boundary conditions — does it refuse invalid amounts (negative, zero, > account balance)? (3) Idempotency — if a refund is processed twice (due to retry), does it only charge once? (4) Cost control — every transaction is logged and costs are tracked. (5) Approval workflows — transactions above a threshold require human approval. (6) Error recovery — if a payment API is down, does the agent retry or escalate? (7) Reversal — can an incorrect transaction be undone? I would also do production monitoring — track transaction success rate, error patterns, and user appeals to catch drift.

← Back to Test Lead Learning Next: Metrics & Reporting →

AI Agent Testing

1 The Hook

2 The Rule

3 The Analogy

4 What Are AI Agents?

Definition

Example tools an agent might have

Where agents differ from traditional software

5 Why Agent Testing Is Hard

Non-determinism

Hallucination and false confidence

Expensive operations

State and multi-step workflows

Audit and compliance requirements

6 Agent Behavior Testing

Happy path: goal achievement

Error recovery

Boundary conditions

Tool correctness

Idempotency

7 Hallucination Testing

Test 1: Non-existent tool detection

Test 2: Misinterpretation of intent

Test 3: Hallucinated facts in reasoning

Test 4: False output assertion

8 Tool Integration Testing

Tool availability and versioning

Parameter validation

Side effects and state changes

Cascading failures

9 Safety & Governance

Cost control and rate limiting

Audit trails and logging

Approval workflows for high-stakes actions

Rollback and reversal

Human-in-the-loop for ambiguity

10 Production Monitoring

Goal success rate

Error patterns

Cost tracking

Tool call frequency

User feedback and appeals

11 Common Mistakes

12 Self-Check

Related techniques

13 Interview Prep