AI Agent Testing
AI agents are autonomous systems that take actions: they call APIs, query databases, make decisions, and interact with external systems. Testing them requires verifying goal achievement, preventing hallucinations, controlling costs, and maintaining audit trails. This is the frontier of testing.
1 The Hook
A company deploys an AI agent to automate customer support. The agent can: read customer emails, search a knowledge base, call the payment API to refund customers, and send replies. It works well for 95% of cases.
Then it happens: a customer emails "I have not been charged yet" (meaning they expect an invoice). The agent interprets this as "I was charged incorrectly" and processes a $50 refund without verification. Another email says "How do I cancel my subscription?" The agent says "I will cancel it" and does, without asking for confirmation. In 24 hours, the agent processes 40 unauthorised refunds and cancels 20 subscriptions the customers did not intend to lose.
The agent was not malicious. It was confident and wrong — hallucinating intent from ambiguous customer messages. But unlike a human support agent, it took action immediately without safety guardrails. It did not ask for confirmation, did not escalate to a human, and did not rate-limit itself.
AI agent testing is about preventing exactly this: verifying that agents achieve their goals safely, with guardrails, cost controls, and audit trails.
2 The Rule
AI agents are tools that take real actions with real consequences. Testing them requires verification of: (1) goal achievement (did the agent accomplish what it was supposed to?), (2) safety (did it stay within boundaries?), (3) tool correctness (did it call the right APIs with the right parameters?), and (4) graceful degradation (what happens when the agent is wrong?). Cost control, audit trails, and human-in-the-loop approval are non-negotiable.
3 The Analogy
Testing an AI agent is like testing a junior employee with access to dangerous tools.
A junior employee can be smart and well-meaning, but if you give them the keys to the bank vault, you need guardrails: supervision, approval limits, audit logs, and a manager who can override them. You do not let them loose and hope they do the right thing. Similarly, you do not deploy an AI agent with API access without: cost limits, per-action approval thresholds, an audit trail, and a human who can step in if something goes wrong.
4 What Are AI Agents?
Definition
An AI agent is an autonomous system that: (1) accepts a goal or instruction, (2) reasons about how to achieve it, (3) selects and calls tools or APIs to take action, (4) interprets results, and (5) iterates until the goal is achieved or abandoned. Unlike a traditional chatbot that responds to queries, an agent acts.
Example tools an agent might have
- API calls (refund customer, update order status, send email)
- Database queries (search customer history, check inventory)
- Web searches or knowledge base lookups
- File operations (read document, generate report)
- Calculator or logic operations
Where agents differ from traditional software
Traditional software: input → deterministic logic → output. Agent: goal → LLM reasoning → tool selection (non-deterministic) → action → interpret result → loop. Agents are probabilistic and can fail in unexpected ways.
5 Why Agent Testing Is Hard
Non-determinism
The same goal prompt might produce different tool sequences on different runs. One run calls API A then B. Another run calls B then A. Both achieve the goal but through different paths. Testing cannot assume a fixed execution path.
Hallucination and false confidence
The agent might hallucinate a tool that does not exist ("I will call the discount-apply endpoint" — no such endpoint exists). Or it might misinterpret user intent ("They want to cancel" when they asked "How do I cancel?"). Or confidently choose the wrong tool for a goal.
Expensive operations
An agent with API access might call expensive endpoints repeatedly (10 LLM inference calls, each costing $0.10). A runaway agent could cost thousands in minutes. Testing must include cost budgeting and rate limiting.
State and multi-step workflows
Agents often require multiple tool calls to complete a goal. If step 2 fails but the agent does not notice and continues, the final outcome is wrong. Testing must verify not just the final state, but the sequence of actions and error recovery.
Audit and compliance requirements
Every action the agent takes must be logged and auditable. If an agent processes a refund, there must be a permanent record of who requested it, what the agent decided, and what action it took. This adds complexity to testing — you are testing not just functionality, but auditability.
6 Agent Behavior Testing
Happy path: goal achievement
Does the agent achieve the goal correctly? Test case: "Refund customer order 12345 because they requested it." Expected: agent calls payment API, processes refund, logs action, notifies customer. Run the test 10 times — does it succeed consistently?
Error recovery
What if a tool fails? Test case: "Refund the customer." Simulate payment API returning 503 (service unavailable). Does the agent retry? Does it escalate to human? Or does it give up silently? Expected: agent should retry with backoff, log the failure, and escalate if retries fail.
Boundary conditions
When should the agent refuse a goal? Test case: "Refund $10,000." If the customer only spent $50, this is suspicious. Expected: agent should either refuse, ask for human approval, or escalate. It should not blindly process an unreasonable request.
Tool correctness
Is the agent calling the right tool with the right parameters? Test case: "Send an email to customer@example.com saying 'Your order is ready.'" Expected: agent calls email API with correct recipient, subject, and body. Audit log shows exactly what was sent.
Idempotency
If the agent runs the same goal twice, does it process the action twice? Test case: "Refund order 12345." Run it twice. Expected: refund is processed once, not twice (idempotency key prevents duplication).
7 Hallucination Testing
Test 1: Non-existent tool detection
Scenario: "Apply the premium_discount endpoint to the customer's account." There is no such endpoint. Expected: agent should recognise the tool does not exist, not hallucinate its existence, and either ask for clarification or refuse the goal.
Input: Goal requests a tool that is not defined.
Expected behavior: Agent logs "Tool not found" and either refusees the goal or asks for clarification.
Failure mode: Agent invents parameters and attempts the call, which fails silently.
Test 2: Misinterpretation of intent
Scenario: Customer email: "I want to know how to cancel my subscription." The agent interprets "cancel" as permission to actually cancel. Expected: agent should clarify — "Would you like to cancel your subscription?" with explicit yes/no before taking action.
Test 3: Hallucinated facts in reasoning
Scenario: Agent says "The customer has been with us for 10 years (they have, according to the database), so I will apply an automatic loyalty discount." But the discount the agent applies does not exist in the discount API. Expected: agent should verify that the discount is available before using it.
Test 4: False output assertion
Scenario: Agent calls refund API (successful), then reports "Refund failed" because it misread the response. Expected: agent should correctly interpret API responses and not contradict its own actions.
8 Tool Integration Testing
Tool availability and versioning
Does the agent have access to the correct version of each tool? If an API is deprecated and replaced, does the agent know? Test: Create a tool registry and verify the agent is using current versions.
Parameter validation
Test case: Agent tries to refund with negative amount (-$50). Expected: tool rejects invalid parameters and agent catches the error.
Side effects and state changes
Tools often modify state. If an agent calls "charge customer" and then calls "cancel transaction," the final state must be consistent. Test: Verify state transitions are logical and audit logs show the full sequence.
Cascading failures
Test case: Agent calls tool A, which succeeds but returns unexpected data. Tool B is called with that data and fails. Expected: agent should detect the failure chain and escalate, not continue with broken data.
9 Safety & Governance
Cost control and rate limiting
Test: Verify that the agent has cost limits per goal, per day, and per user. If a goal costs more than $1, does it require approval? If the agent has already spent $5,000 today, does it refuse new goals? Implementation: Add a cost-tracking middleware that blocks calls exceeding limits.
Audit trails and logging
Test: For every tool call, the audit log must record: (1) who requested the goal, (2) what goal was requested, (3) what tool was called, (4) what parameters were passed, (5) what result was returned, (6) timestamp. Compliance: logs must be immutable and retained for legal/audit purposes.
Approval workflows for high-stakes actions
Refunds above a threshold, account deletions, and data exports should require human approval before the agent takes action. Test: Verify that the agent requests approval and waits before proceeding.
Rollback and reversal
If an agent makes a mistake (incorrect refund, wrong data), can it be undone? Test: Verify that critical actions are reversible and that reversal is logged.
Human-in-the-loop for ambiguity
If the agent is uncertain about intent (confidence < 70%), it should ask a human before taking action. Test: Verify that unclear goals are escalated, not guessed on.
10 Production Monitoring
Goal success rate
Track: percentage of goals achieved correctly, percentage escalated to human, percentage failed. If success rate drops below baseline, investigate.
Error patterns
Analyse errors: are certain types of goals failing more often (e.g., refunds > $100)? Are certain users experiencing higher failure rates (possible security probe)? Errors signal where the agent needs retraining or boundaries adjustment.
Cost tracking
Monitor LLM API costs per goal. If average cost per refund jumps from $0.05 to $0.50, the agent is reasoning longer or making more calls — investigate.
Tool call frequency
Which tools does the agent call most? Are certain tools being called unexpectedly often? High frequency of a destructive tool (like refund) might indicate a bug.
User feedback and appeals
Collect feedback: "The agent's decision was wrong" or "I did not want them to do that." Use this to identify hallucination patterns and retrain the agent.
11 Common Mistakes
Mistake 1: No cost limits
Why it happens: Teams focus on functionality and defer cost control to "later."
The fix: Implement cost limits before launch. Cap spending per goal, per user, per day. Make the agent refuse expensive actions unless explicitly approved. A runaway agent can cost thousands in hours.
Mistake 2: No audit trail
Why it happens: Auditing feels like overhead. The team thinks "we can check the database if something goes wrong."
The fix: Build audit logging from day one. Every tool call must be logged immutably. You need this for compliance, debugging, and accountability. It is not optional.
Mistake 3: Trusting the agent without verification
Why it happens: The agent is confident and articulate, so teams assume it is correct.
The fix: Implement verification for critical actions. If the agent refunds $100, it should verify the amount with the database before claiming success. Do not trust confidence.
Mistake 4: No human approval for high-stakes actions
Why it happens: Automation is the goal, so teams skip the approval step.
The fix: Some actions (refunds, deletions, exports) require human sign-off. The agent can draft the action, but a human must approve. This slows things down but prevents disasters.
Mistake 5: No rollback or reversal mechanism
Why it happens: Teams assume the agent will never make mistakes.
The fix: Design every critical action to be reversible. Store transaction IDs and implement a reversal tool. If the agent makes a mistake, it can (under human authority) undo it.
12 Self-Check
Click each question to reveal the answer.
Q1: What is the difference between an agent and a chatbot?
A chatbot responds to queries with information. An agent takes action: it calls APIs, modifies data, makes decisions, and interacts with external systems. A chatbot tells you "Your balance is $100." An agent can refund your account, update your status, or cancel your subscription. Testing agents requires verification of actions and consequences, not just information accuracy.
Q2: How do you test an agent for hallucination?
Test for hallucinated tools (agent invents tools that do not exist), hallucinated facts (agent claims data that is false), and hallucinated intent (agent misinterprets a customer request as permission to take action). Use test cases with ambiguous or novel inputs and verify the agent either clarifies or refuses, rather than confidently hallucinating.
Q3: Why is cost control essential for AI agents?
Agents with API access can make expensive calls repeatedly. A runaway agent making 100 LLM inferences costing $0.10 each costs $10. Scale this to many users and requests, and an agent can spend thousands of dollars per day. Cost control (rate limiting, per-goal budgets) is essential to prevent financial disasters. Testing must verify that cost limits work and escalation triggers when limits are near.
Q4: What should an audit trail for an agent capture?
An audit trail must capture: (1) user/request origin, (2) goal request, (3) agent reasoning/decision, (4) tool calls with parameters, (5) tool responses, (6) final action taken, (7) timestamp. This enables post-incident analysis, compliance verification, and user accountability. The trail must be immutable and retained for legal purposes.
Q5: How do you test an agent for error recovery?
Simulate tool failures: API timeouts, 503 errors, invalid responses, missing data. Verify the agent retries with backoff, detects cascading failures, and escalates to human if the error cannot be resolved. The agent should not silently fail or continue with corrupted state. Every error should be logged and possibly trigger an alert.
13 Interview Prep
Real questions from test lead interviews with AI agent focus.
"Have you tested an AI agent? What was your approach?"
Yes, I tested a customer support agent that could refund orders and update statuses. My approach had three parts: (1) Functional testing — verify the agent achieves goals correctly, handles errors, and recovers from failures. (2) Safety testing — verify cost limits, approval workflows, and audit logging work correctly. (3) Hallucination testing — verify the agent does not misinterpret customer intent or invent tools. I ran the agent 50+ times on diverse goals and monitored for cost, latency, and error patterns. I also tested error scenarios (API failures, timeouts) to verify it did not take unintended actions.
"What is the most important thing to test in an AI agent?"
Preventing unintended actions. An agent that misinterprets a customer's email and processes a refund is worse than an agent that does nothing. So I test: (1) Does the agent have guardrails (cost limits, approval thresholds)? (2) Does it log everything (audit trail)? (3) Can mistakes be undone (reversibility)? (4) Does it escalate when uncertain (human-in-the-loop)? These safety features are more important than speed or polish.
"How would you design a test plan for an agent that makes financial transactions?"
I would test: (1) Goal achievement — agent correctly interprets goals and executes transactions. (2) Boundary conditions — does it refuse invalid amounts (negative, zero, > account balance)? (3) Idempotency — if a refund is processed twice (due to retry), does it only charge once? (4) Cost control — every transaction is logged and costs are tracked. (5) Approval workflows — transactions above a threshold require human approval. (6) Error recovery — if a payment API is down, does the agent retry or escalate? (7) Reversal — can an incorrect transaction be undone? I would also do production monitoring — track transaction success rate, error patterns, and user appeals to catch drift.