Human-in-the-Loop Sign-Off
An AI can recommend a decision in milliseconds. It cannot be held accountable for one. When an AI decision shapes a person’s benefit, visa, or claim, someone has to be answerable for it — and “the model decided” is not an answer a tribunal, an Ombudsman, or the Privacy Act will accept.
1 The Hook
A government agency rolled out an assistant to help case managers decide a benefit-related entitlement. The design was reasonable on paper: the AI reads the file, scores the case, and recommends grant or decline; a human signs off before anything is sent. Everyone called it “human-in-the-loop” and felt covered. The model was accurate in testing. Sign-off was a tick-box on the screen.
Six months in, a declined applicant complained to the Ombudsman. The agency was asked a simple question: who made this decision, and on what basis? They went looking for the answer and could not give one. The case manager had clicked “approve recommendation” in two seconds, like the hundred before it. There was no record of why the model recommended decline — no inputs captured, no confidence score stored, no note from the reviewer. The “human in the loop” had become a human rubber stamp: present in the workflow, accountable for nothing, reviewing nothing.
Worse, the system escalated nothing. A high-confidence grant and a borderline, low-confidence decline went to the same one-click screen with no distinction. The cases that most needed a human had been treated exactly like the ones that did not. The loop existed; the judgement did not.
This is the failure that defines sign-off governance. A human in the workflow is not the same as a human accountable for the decision. Human-in-the-loop only means something if the human can actually see what they are approving, the riskiest cases are routed to them, and there is a record afterwards of who decided what, and why. Testing that is your job.
2 The Rule
A human in the workflow is not the same as a human accountable for the decision. Sign-off is real only when the reviewer can see the basis for the AI’s recommendation, the high-risk and low-confidence cases are routed to a human by design, and every decision leaves an audit trail of who approved what, on what evidence, and why. Anything less is a rubber stamp, and the Privacy Act 2020 will treat it as one.
3 The Analogy
A co-pilot signing the captain’s flight log without reading the instruments.
In a cockpit, the autopilot flies most of the route, but a human is accountable for the flight. That accountability is real only because the pilot can see the instruments, the system alerts them when something is outside normal limits, and there is a flight recorder afterwards. Take those away — blank instruments, no alerts, no recorder — and the pilot’s signature on the log is meaningless. They were present, not in command.
An AI sign-off is the same. The reviewer must see the basis for the recommendation (the instruments), the system must escalate the risky and uncertain cases to them (the alerts), and every decision must be recorded (the flight recorder). Remove any one and “human-in-the-loop” becomes a co-pilot signing a log they never read.
4 When a Human Must Approve
Not every AI output needs a human signature, and pretending otherwise is its own failure — it buries reviewers under low-risk approvals until they stop reading any of them. The first governance question is which decisions require sign-off, and the answer is risk-based: the higher the consequence to a person, the stronger the case for a human in command.
A workable tiering for NZ public-sector and regulated work:
- Mandatory human approval: decisions that materially affect a person’s rights, money, or status — declining a benefit, refusing a visa, flagging someone for investigation. The AI may recommend; a human must decide. These are also where the Privacy Act’s expectations bite hardest.
- Human approval on exception: routine decisions that are auto-handled when the system is confident, but routed to a person when confidence is low, the case is unusual, or a guardrail trips. Most volume flows through; the hard cases surface.
- Fully automated, monitored: low-consequence, high-volume actions — categorising a message, drafting a non-binding reply — where a wrong answer is cheap to correct. No per-case sign-off, but sampled review and monitoring still apply.
The tester’s job is to confirm the system actually enforces this tiering — that a mandatory-approval decision cannot be auto-actioned by any path, and that the “exception” route triggers on the conditions it is supposed to. A governance policy that the code does not enforce is a policy that does not exist.
5 Confidence Thresholds and Escalation
The exception tier needs a trigger, and the usual one is a confidence threshold: the system routes a case to a human when the model’s confidence falls below a set level, or when the consequence is high regardless of confidence. Below the line, a person decides; above it, the system may proceed. Setting that line is a governance decision, not a technical one — it trades reviewer workload against the risk of an unreviewed wrong decision.
Two failure modes sit on either side of the threshold, and a tester must probe both. Set it too low and almost nothing escalates — the human loop is decorative, and risky cases sail through unreviewed. Set it too high and everything escalates — reviewers are swamped, fall into rubber-stamping to clear the queue, and the threshold protects no one. The right line routes the genuinely uncertain and high-stakes cases to a human while letting the clearly-safe ones flow.
There is a subtler trap: a model’s confidence is not the same as its correctness. Generative systems are routinely confident and wrong — the whole RAG lesson turned on a confident, fluent, invented answer. So a confidence threshold alone is not enough. High-consequence decisions must escalate on consequence, not just on low confidence, precisely because the dangerous case is the one the model is sure about and wrong.
6 Four-Eyes Approval and Reviewer Sampling
For the highest-consequence decisions, one reviewer may not be enough. Four-eyes approval requires two independent people to agree before a decision is actioned — a long-standing control in banking and safety-critical work, applied here to the AI’s most serious recommendations. It guards against a single rushed click and against one reviewer’s blind spot. The cost is throughput, so it is reserved for the cases where a wrong decision is hardest to undo.
For the automated and exception tiers, you cannot review every case — that would defeat the automation — so you use reviewer sampling: a defined percentage of decisions are pulled for human review after the fact, to monitor that the system is still behaving. The sampling rate is a governance setting that should rise with risk and with any sign of drift: a 2% sample for a stable low-risk flow, a much higher rate for a new or high-consequence system, and a spike the moment monitoring shows the error rate climbing.
Crucially, sampling is not a substitute for sign-off on mandatory-approval decisions — those still need a human on every case. Sampling governs the cases you chose not to sign off individually. Confusing the two — using a 2% sample to “cover” decisions that legally require per-case approval — is one of the most common and most dangerous governance errors a tester can catch.
7 Audit Trails and the Privacy Act 2020
Everything above is unprovable without a record. An audit trail captures, for every AI-assisted decision: the inputs the model saw, the recommendation and its confidence, who reviewed it, what they decided, and when. It is what lets the agency answer the Ombudsman’s question — who decided this, and on what basis? — instead of going pale and silent like the agency in the hook.
In New Zealand this is not optional good practice; it is anchored in the Privacy Act 2020. The Act gives individuals the right to access information held about them (and the reasoning that affected them), expects agencies to keep personal information accurate and to be accountable for how it is used, and makes the agency — not the model, not the vendor — responsible. A person affected by an automated or AI-assisted decision can ask what information was used and how the decision was reached, and the agency must be able to answer. No audit trail, no answer, breach of accountability.
Inputs seen: [redacted PII refs], income field, residency flag
AI recommendation: Decline — confidence 0.62
Escalation: Routed to human (confidence below 0.75 threshold)
Reviewer: Case manager ID CM-204
Human decision: GRANT — overrode AI; note: “income evidence supports eligibility”
Four-eyes: Second approver ID CM-118 confirmed
Timestamp: 2026-08-14T09:41:22+12:00
Note what that record makes possible: the human overrode the AI and said why. A good audit trail does not just log that a human clicked — it captures their reasoning, including when they disagreed with the model. That override note is the single strongest piece of evidence that sign-off was real and not a rubber stamp.
8 RACI for AI Decisions
The last governance question is the one the hook agency could not answer: who is accountable? A RACI — Responsible, Accountable, Consulted, Informed — makes it explicit for each type of AI decision, and the non-negotiable rule is that the Accountable role is always a named human, never the AI. The model can be the tool that produces a recommendation; it can never be the accountable party, because it cannot answer to a tribunal, hold a delegation, or be held responsible.
For an AI-assisted benefit decision, a clear RACI reads: the case manager is Responsible for making the decision; a named manager or the delegated decision-maker is Accountable for it; the AI system is a tool used by the Responsible person, not a role-holder; policy and privacy leads are Consulted; the applicant and audit function are Informed. The point is that every AI decision traces to a named human who owns the outcome.
The tester’s contribution is to confirm the RACI is real, not paper. Does the audit trail actually capture the named Responsible reviewer for each decision? Is there genuinely an Accountable human for this decision type, or does the chain quietly dead-end at “the system”? An AI governance review that cannot name the accountable human for a high-consequence decision has found the most important defect on the page.
9 Common Mistakes
🚫 Treating a human in the workflow as proof of human accountability
Why it happens: A sign-off button on the screen looks like a human is in control.
The fix: A reviewer who cannot see the basis for the recommendation, gets no escalation, and leaves no record is a rubber stamp, not a decision-maker. Test that the human can see the evidence, the risky cases are routed to them, and the decision is recorded.
🚫 Escalating only on low confidence
Why it happens: A confidence score is easy to threshold, so it becomes the only trigger.
The fix: Models are routinely confident and wrong, so a confidence-only trigger lets a high-confidence wrong decision walk straight through. High-consequence decisions must escalate on consequence, not just on low confidence.
🚫 Using a sampling rate to “cover” mandatory-approval decisions
Why it happens: Sampling feels like review, and reviewing every case is expensive.
The fix: Sampling governs the cases you chose not to sign off individually; it cannot replace per-case approval where a decision materially affects a person. Confirm mandatory-approval decisions get a human on every case, not a 2% glance.
🚫 A governance design where no named human is accountable
Why it happens: “The AI decided” quietly fills the gap where an accountable person should be.
The fix: The AI can never be the Accountable role — it cannot answer to a tribunal or hold a delegation. Every AI decision must trace to a named human who owns the outcome, and the audit trail must record them.
10 Now You Try
Three graded exercises: spot the rubber-stamp, fix the escalation, build the sign-off framework. Write your answer, run it for AI feedback, then compare to the model answer.
A fictional government benefit agency describes its “human-in-the-loop” design below. Identify why this is a rubber stamp, not real sign-off, name the specific governance failures, and say what the Privacy Act 2020 would expect that is missing.
Diagnose it:
Show model answer
Why it is a rubber stamp: The reviewer sees only the recommendation, not the basis for it, so they cannot actually review anything — they can only agree. No information to judge means no judgement, just a click. A person clicking Approve is presence in the workflow, not accountability for the decision. Governance failures: - Visibility: the case manager sees grant/decline with no inputs, no reasoning, no confidence — they are approving blind. - Escalation: high-confidence and borderline cases go to the identical one-click screen, so the cases that most need human judgement are treated like the ones that don't. Nothing is routed by risk or confidence. - Record: no reasoning stored, no reviewer note — there is no audit trail, so the agency cannot later say who decided what or why. What the Privacy Act 2020 expects that's missing: An affected person can ask what information was used and how a decision affecting them was reached; the agency is accountable and must be able to answer. With no stored inputs, reasoning, or reviewer record, the agency cannot answer — a failure of accountability over personal information. What must change: Show the reviewer the basis (inputs, recommendation, confidence); route borderline/low-confidence and all high-consequence cases to genuine human decision (not a shared one-click screen); capture an audit trail per decision (inputs seen, recommendation + confidence, reviewer, decision, note, timestamp), including any override and why.
A fictional immigration triage tool escalates to a human only when model confidence is below 0.50. Explain why this escalation rule is unsafe, including the confidence-is-not-correctness trap, then redesign the escalation so the right cases reach a human, and describe the test you would run to prove it.
Write your critique, redesign, and test:
Show model answer
Why it is unsafe: A visa refusal materially affects a person's status, so it belongs in the mandatory-approval tier — a human must decide it regardless of confidence. This rule auto-actions refusals at confidence ≥ 0.50, meaning the highest-consequence decision is made by the model alone most of the time. The 0.50 line is also so low that almost nothing escalates, so the human loop is decorative. The confidence-is-not-correctness trap: A model's confidence is not its accuracy — generative systems are routinely confident and wrong. A refusal made at 0.92 confidence can be just as wrong as one at 0.40, but this rule waves the confident one straight through. The dangerous case is precisely the one the model is sure about and wrong, and a confidence-only trigger is blind to it. Redesigned escalation: Tier by consequence first. All visa refusals (and any decision materially affecting status) → mandatory human approval, regardless of confidence; consider four-eyes for refusals. Within lower-consequence decisions, route to a human on low confidence OR unusual/edge-case flags OR a tripped guardrail. Auto-action only low-consequence, clearly-confident cases, with sampling on top. The test: Feed the tool a high-confidence WRONG refusal case (not just a low-confidence one) and confirm it is still routed to a human and cannot be auto-actioned by any path; confirm every refusal hits mandatory approval; confirm the audit trail records the reviewer and decision. If a confident wrong refusal auto-actions, the escalation has failed.
Design a human-in-the-loop sign-off framework for a fictional ACC claim-decision assistant that recommends accept or decline. Cover: the approval tiers, the escalation triggers (beyond confidence), four-eyes and sampling, the audit-trail fields, and the RACI with a named Accountable human. Make it NZ-appropriate and Privacy-Act-aware.
Show model answer
1. Approval tiers: MANDATORY human approval — any claim decline and any decision materially affecting entitlement (the AI recommends, a human decides). ON-EXCEPTION — routine accepts auto-handled when confident, routed to a human when confidence is low, the case is unusual, or a guardrail trips. AUTOMATED-MONITORED — low-consequence actions like categorising correspondence, with sampled review only. 2. Escalation triggers: low model confidence (below a set threshold) OR high consequence regardless of confidence (all declines escalate) OR an edge-case/unusual-pattern flag OR a tripped guardrail. Consequence-based escalation is essential because a model can be confidently wrong. 3. Four-eyes + sampling: Four-eyes (two independent approvers) for the hardest-to-undo declines. Reviewer sampling on the automated and exception tiers — e.g. ~2–5% baseline, raised for a new system and spiking automatically when monitoring shows the error rate climbing. Sampling never replaces mandatory per-case approval on declines. 4. Audit-trail fields per decision: decision ID; inputs the model saw (PII handled per Privacy Act); recommendation + confidence; escalation reason; reviewer ID(s); human decision; whether the human overrode the AI and the reason note; four-eyes confirmation where applicable; timestamp. The override-with-reason is the key evidence sign-off was real. 5. RACI: Responsible = the case manager who makes the decision; Accountable = a named delegated decision-maker / manager (never the AI); Consulted = clinical, policy, and privacy leads; Informed = the claimant and the audit/assurance function. Every decision traces to a named accountable human, and the test is: "when this is wrong, whose name is on it?"
11 Self-Check
Click each question to reveal the answer.
Q1: What is the difference between a human in the workflow and a human accountable for the decision?
A human in the workflow is merely present — they click a button. A human accountable for the decision can see the basis for the recommendation, has the risky cases routed to them, and leaves a record of what they decided and why. Without visibility, escalation, and a record, the human is a rubber stamp, not a decision-maker.
Q2: Why is a low-confidence threshold not enough to decide what escalates to a human?
Because a model’s confidence is not its correctness — generative systems are routinely confident and wrong. A confidence-only trigger lets a high-confidence wrong decision walk straight through, which is the most dangerous case. High-consequence decisions must escalate on consequence, not just on low confidence.
Q3: How do four-eyes approval and reviewer sampling differ, and where does each apply?
Four-eyes approval requires two independent people to agree before a high-consequence decision is actioned — applied to the hardest-to-undo cases. Reviewer sampling pulls a defined percentage of decisions for after-the-fact review on the automated and exception tiers. Sampling monitors cases you chose not to sign off individually; it never replaces per-case approval on mandatory decisions.
Q4: What must an audit trail capture, and why does the Privacy Act 2020 make it non-optional?
It must capture the inputs the model saw, the recommendation and its confidence, who reviewed it, what they decided (including any override and why), and when. The Privacy Act 2020 lets a person ask what information was used and how a decision affecting them was reached, and holds the agency accountable — with no audit trail the agency cannot answer, which is a failure of accountability.
Q5: In a RACI for an AI decision, which role can the AI never hold, and why?
The AI can never be Accountable. The Accountable role must be a named human, because the model cannot answer to a tribunal, hold a delegation, or be held responsible for harm. The AI is a tool used by the Responsible person; every decision must trace to a named human who owns the outcome.
12 Interview Prep
Real questions asked in NZ QA interviews for AI testing roles. Read the model answers, then practise your own version.
“A team says their system is human-in-the-loop because a person approves every case. How do you test that claim?”
I’d test whether the human is actually accountable or just present. Three checks: can the reviewer see the basis for the recommendation — the inputs, the confidence, the reasoning — or are they approving blind? Are the high-consequence and low-confidence cases routed to them by design, or does everything go to the same one-click screen? And is there an audit trail recording who decided what and why, including overrides? If the reviewer can’t see the evidence, nothing is escalated by risk, and nothing is recorded, then “a person approves every case” describes a rubber stamp, not human-in-the-loop.
“How would you set and test an escalation threshold for an AI that makes decisions about people?”
I’d start from consequence, not confidence. Any decision that materially affects a person’s rights, money, or status goes to mandatory human approval regardless of how confident the model is — because confidence is not correctness, and the worst case is a confident, wrong decision. Below that, I’d escalate on low confidence, unusual cases, and tripped guardrails, tuned so the genuinely uncertain cases reach a human without swamping reviewers into rubber-stamping. To test it, I’d feed a high-confidence wrong case through and confirm it still escalates and can’t be auto-actioned — if it slips through, the threshold is protecting no one.
“An applicant complains and the agency can’t explain how the AI-assisted decision was made. What went wrong, and what’s your role?”
They have no audit trail, which under the Privacy Act 2020 is a failure of accountability — an affected person can ask what information was used and how the decision was reached, and the agency must be able to answer. My role as a tester is to catch that before go-live: confirm every AI-assisted decision records the inputs seen, the recommendation and confidence, the reviewer, the decision, any override and its reason, and a timestamp; and confirm the RACI names a real accountable human, not “the system”. The test I always run is “when this decision is wrong and someone is harmed, whose name is on it?” If the answer is the model’s or nobody’s, the governance has already failed.