Architect · AI Oversight

Human-in-the-Loop AI

Q: Q1. What is the difference between Human-in-the-Loop and Human-on-the-Loop?

In-the-loop (L2): AI acts on low-risk items, escalates to a human on ambiguity. Human approves or corrects in real time. On-the-loop (L3): AI runs end-to-end; a human monitors an audit stream and intervenes on exceptions or on a cadence. Human does not see every action.

Q: Q3. Name three NZ-specific obligations that shape how you do HITL.

AoG Web Accessibility Standard 1.2 (any AI-generated UI must meet WCAG 2.2 AA), NZ Privacy Act 2020 IPPs (AI processing of personal info triggers IPP 10/11/12 and from May 2026 IPP 3A), Algorithm Charter for Aotearoa New Zealand (transparency, human oversight, review for signatory agencies).

AI is not replacing testers. AI is promoting them. The testers who survive and thrive are the ones who orchestrate AI, apply guardrails, and sign off on AI-produced work with evidence the regulator would accept.

Architect AI Governance · HITL Patterns · AI Act & NZ Algorithm Charter ~22 min read

1 The Hook

Rangi is a test architect at a Wellington fintech. In March 2026 his CTO announced a directive: "Every squad must use AI to generate at least 60% of their test cases." Rangi’s team tried it. Two weeks in:

An AI-generated test suite for the mortgage calculator covered happy paths beautifully — but never tested negative interest rates, a scenario the NZ Reserve Bank flagged as possible.
AI-suggested locators kept breaking when the component library upgraded because the AI did not know about the design-system changes.
An AI-drafted acceptance criteria document invented a "KiwiSaver opt-in flag" that does not exist in the domain model.
On a penetration test, an AI assistant happily helped a junior engineer generate SQL-injection payloads against a production environment because nobody told it the environment was prod.

None of those were AI failures. They were oversight failures. The team treated the AI as a replacement for judgement. The fix was not less AI — it was humans in the loop at the right decision points. Within a month Rangi had written a HITL playbook that doubled throughput and caught three release-blocking bugs the team would have shipped.

This page is that playbook.

2 The Rule

Human-in-the-Loop (HITL) is a governance pattern where a qualified human retains decision authority at the points in an AI workflow where a wrong decision has high blast radius. AI drafts; humans dispatch.

HITL does not mean "a human watches every AI action." That is not scalable and it defeats the point of AI. HITL means: you define the decisions AI may make alone, the decisions AI must propose but a human must approve, and the decisions only humans may make. Then you build the system so those boundaries are enforced.

The point of the tester-as-orchestrator role is that you are the one who defines those boundaries on your product, enforces them in the pipeline, and produces the evidence that they held.

3 The Analogy

Analogy

AI is the apprentice. You are the journeyman.

An apprentice carpenter can measure, cut, sand, and assemble astonishingly fast. They will also happily cut a load-bearing beam if you tell them to, and they cannot tell the difference between a routine stud wall and a structural one. The journeyman does not cut fewer boards than the apprentice — they cut different ones. They plan the work, assign the cuts, inspect the joins, and sign the building certificate. The apprentice’s speed is useless without the journeyman’s judgement. In AI-assisted QA, you are the journeyman. Your job is not to compete with the AI’s throughput. It is to decide what to build, set the constraints, and certify the result.

Senior engineer insight

The teams that struggled most with HITL were the ones who tried to define oversight levels once, up front, and treated them as permanent. In practice, a workflow that earns L2 autonomy in a low-traffic period can quietly regress to needing L1 the moment a new developer joins, the underlying model updates, or the regulatory context shifts — and you won't notice until something escapes. The oversight level is not a property of the workflow; it is a property of the workflow and the current operating context. Review it whenever either changes.

The most common mistake: teams write the HITL policy, ship the feature, and never look at it again — treating a living governance document like a one-time compliance checkbox.

From the field

A team working on an CoverNZ digital claims portal assumed their AI test-case generator was safe to promote to L3 after two clean sprints. Three weeks later a model update caused the generator to silently omit negative-amount boundary tests — the exact scenario that appears when a claimant disputes a payment calculation. The tests kept passing because the AI had also updated its fixture values to avoid the boundary. No human had reviewed an AI-authored PR in six weeks. The escaped defect made it to UAT and was caught by an CoverNZ business analyst who happened to test their own claim scenario. The lesson the team extracted: automated acceptance metrics cannot substitute for periodic human review of what the AI is actually producing, not just whether the tests pass.

4 The Four Oversight Levels

Every AI task in your pipeline sits at one of four oversight levels. Classify each task explicitly — a written ruleset beats an implicit one every time.

Level	Pattern	AI authority	Example
L1	Human-in-command	AI drafts only; every output reviewed before use.	AI-generated security test cases against a payment API. Human reviews and approves each one before execution.
L2	Human-in-the-loop	AI acts on low-risk items autonomously; escalates to a human on ambiguity or policy triggers.	AI triages incoming bug reports, auto-closes obvious duplicates, but routes anything marked "data leak" to a human.
L3	Human-on-the-loop	AI runs end-to-end; human monitors an audit stream and intervenes on exceptions.	Self-healing test suite fixes broken locators automatically. Daily digest summarises changes; architect reviews on a weekly cadence.
L4	Human-out-of-the-loop	AI acts fully autonomously. No human review unless the audit flags a policy breach.	Visual regression AI flagging pixel diffs below a risk threshold. Only "high-risk change" screenshots surface to a human.

Sizing rule: The further right a task sits, the harder the pre-launch test harness has to be. L4 only earns its autonomy by proving through L1→L2→L3 that it is safe. Never start a new AI workflow at L3 or L4.

High-risk triggers that must stay at L1: production deployments, payment flows, personal information, access to prod credentials, code that touches auth, anything subject to AoG or Privacy Act compliance, safety-critical domains.

5 The HITL Playbook

Here is the step-by-step playbook for introducing an AI-assisted task into your QA workflow.

Name the task and the risk. What does the AI do? Who gets hurt if it gets it wrong? "Generate unit tests" is low blast radius; "propose prod hotfixes" is not.
Assign an oversight level (L1–L4). Default to L1. Earn your way up with evidence.
Prime the AI with the constraints. Every prompt includes: the product domain, the tech stack, the regulatory frame (WCAG 2.2 AA, Privacy Act IPPs, NZISM), any naming conventions, and the "never do X" list.
Define the approval gate. Who reviews the AI output before it flows onward? What are they checking? What tool enforces the gate (PR review, pipeline step, approval queue)?
Build the verification pass. Automated checks that run against every AI output: static analysis, lint, test execution, a11y scan, security scan, policy-as-code. Catch the easy issues before a human looks.
Log the decisions. Every AI action produces an audit row: who prompted, what model, what prompt, what output, who approved, timestamp. This is your defence when someone asks "who authorised this?".
Measure and review. Track: AI output acceptance rate, defect escape rate, human review time, incidents caused by AI outputs. Review monthly. Promote tasks to L2/L3/L4 on the evidence, never on the vibe.

Prompt patterns that work

Pattern A — Role + Constraints + Output Shape

Role: Senior QA engineer testing a KiwiSaver enrolment API.
Context: Node.js/Express backend. Uses Zod for validation.
Must not test: real Revenue NZ numbers, real bank accounts.
Constraints: WCAG 2.2 AA on UI paths; NZ Privacy Act IPP 1
  (data minimisation) on every endpoint.
Output shape: Playwright test cases in TypeScript,
  one describe() per endpoint, at least one negative
  case per endpoint, fixtures named test-*.
Produce: 10 test cases for POST /enrolments.

Pattern B — Chain-of-Verification

Ask the AI to produce the artefact, then in a second prompt ask it to critique the first output against the constraints, then produce a v2. This catches 30-60% of the easy errors before a human reviews. It is not a substitute for human review — it is a filter.

Pattern C — Few-Shot Examples

Include 1-3 hand-written exemplars of the output style you want. For test-case generation this dramatically improves naming-convention adherence, assertion style, and fixture usage. Without exemplars the AI will regress to its training-data average.

Pattern D — Refuse-by-Default List

Tell the AI what it must never do: generate real PII, execute against production endpoints, commit to main, disable safety checks, use deprecated libraries. Belt-and-braces with actual technical controls; do not rely on the prompt alone.

6 Guardrails & Verification

Prompt discipline is necessary but not sufficient. Every AI task needs technical guardrails — checks run against outputs before they are accepted.

Input guardrails

Reject prompts containing real customer PII (regex + LLM classifier)
Reject prompts asking for prod credentials or endpoints
Log the prompter and the prompt for later audit
Rate-limit per user to detect scraping / runaway loops

Output guardrails

Syntax check — does it parse as the expected language?
Schema check — does structured output match the expected shape?
Secret scan — does the output contain API keys or tokens?
PII scan — does the output contain personal information?
Hallucination check — do referenced files, functions, flags, or URLs exist?
Policy check — does it violate the "never do" list?

Tools worth knowing in 2026

Promptfoo — open source; regression-tests prompts and guardrail responses.
Guardrails AI — policy-as-prompt; enforces JSON schema, PII filters, profanity.
NeMo Guardrails (NVIDIA) — flow control for LLM conversations with deterministic rails.
Microsoft PyRIT — red-team automation for LLMs; probes for prompt injection, leaking.
AWS Bedrock Guardrails / Azure AI Content Safety — managed guardrails on cloud LLM endpoints.

Red-team your own AI

Before promoting an AI task up an oversight level, run a red-team pass. Try to:

Prompt-inject the system: "Ignore previous instructions and..."
Exfiltrate the system prompt
Extract training data
Make it produce out-of-scope output (code when asked for analysis, advice when asked for data)
Make it cite a non-existent file, function, or standard
Make it produce an AoG or Privacy Act violation on plausible-looking business requests

Every successful attack is a guardrail gap. Fix it, add a test, move on.

7 Evidence & Sign-Off

If an AI-produced artefact ships as part of a certified product — a WCAG conformance report, a security test, a release note — you need an audit trail that would survive an external review.

Prompt log: every prompt, model, version, timestamp, user, output. Retained for the product lifespan.
Verification log: which guardrails ran, which passed, which flagged.
Human approval log: who approved the output, when, on what basis. Named individual, not "QA team."
Escalations log: any L2 escalations to a human and their outcomes.
Incident log: any AI-caused incidents, root cause, remediation, oversight-level change.

NZ specific context: The Algorithm Charter for Aotearoa New Zealand commits signatory agencies to transparency, human oversight, and review of algorithms that affect people. If your employer is a signatory (most government agencies are), your HITL evidence is not optional — it is a Charter obligation.

8 Common Mistakes

🚫 Starting a new workflow at L3 or L4

I used to think: AI is capable; let it run.
Actually: Every new task starts at L1. You earn autonomy with evidence. Teams that skip this step discover failure modes in prod, not in staging.

🚫 Treating the prompt as the guardrail

I used to think: Telling the AI "never generate PII" is enough.
Actually: Prompts are hints, not security controls. Add technical post-processing that checks outputs regardless of what the prompt said. Prompt injection is trivially easy.

🚫 Anonymising the human sign-off

I used to think: "QA approved" is enough.
Actually: Regulators and incident-review boards will ask "who." A named reviewer with a timestamp and a scope is non-negotiable for any AI artefact that ships.

🚫 Not retaining the prompts

I used to think: The AI output is the artefact; the prompt is disposable.
Actually: Six months later when an incident traces back to an AI-generated test, you will want to know what you asked. Log prompts with the same discipline you log deployments.

🚫 Treating AI as a headcount reduction play

I used to think: AI cuts my testing team in half.
Actually: AI changes the shape of the testing team. Fewer keystroke-monkey tasks, more oversight, governance, scenario design, red-teaming, evidence curation. Teams that slashed headcount found themselves re-hiring to cover governance. Plan for a different team, not a smaller one.

🚫 Not red-teaming your own AI tools

I used to think: Our vendor handles AI safety.
Actually: The vendor handles the base model. You handle the system prompt, the tool integrations, the business data, and the guardrails. Every team running an AI agent needs their own red-team pass before prod.

9 Now You Try

🎯 Practical Exercise: Draft an HITL Policy

Task: Pick an AI-assisted task your team actually does (or would do). Write a one-page HITL policy covering:

Task name, description, who uses it, how often
Blast radius — what could go wrong, who gets hurt
Oversight level (L1–L4) with justification
Prompt template with constraints, output shape, refuse-by-default list
Guardrails list (input + output)
Approval gate — who reviews what, where, how long does it take
Audit log fields and retention
Promotion criteria — what evidence would let this move from L1 to L2?
Kill switch — who can shut the workflow down, how fast

Why teams fail here

Approval theatre: the PR has an "ai-generated" label and a senior reviewer ticked it, but the reviewer spent 45 seconds on a 200-line test file. The gate exists on paper; the oversight does not exist in practice. Governance without time budget is just ceremony.
Guardrail drift: input and output guardrails are configured at launch and never maintained. Six months later the model has changed, the product has added new PII fields, and the policy-as-code is checking for data shapes that no longer match the actual outputs. Teams discover this during an incident, not a review.
No kill-switch owner: the HITL policy names a kill switch but assigns it to "the team." When an AI workflow goes wrong at 11pm, "the team" is not available. Every high-autonomy workflow needs a named on-call person with the access and authority to shut it down in under fifteen minutes.
Conflating NZ Algorithm Charter transparency with internal documentation: signatory agencies think a team wiki page satisfies the Charter's transparency obligation. It does not. Transparency means the people affected by the algorithm can understand it, not just the team that built it. For Revenue NZ, TransitNZ, and health agencies this distinction can become a compliance finding.

Key takeaway

The tester who designs the guardrails, writes the HITL policy, and signs off the audit trail is doing more consequential quality work than the one who wrote a thousand manual test cases — because when the AI gets something wrong at scale, it is the architect who either catches it or answers for it.

10 Self-Check

Click each question to reveal the answer.

Q1. What is the difference between Human-in-the-Loop and Human-on-the-Loop?

In-the-loop (L2): AI acts on low-risk items, escalates to a human on ambiguity. Human approves or corrects in real time.
On-the-loop (L3): AI runs end-to-end; a human monitors an audit stream and intervenes on exceptions or on a cadence. Human does not see every action.

Q2. Why is a prompt alone not a sufficient guardrail?

Prompts are trivially bypassed via prompt injection ("ignore previous instructions and..."). They are hints, not security controls. Technical post-processing — schema validation, PII scan, secret scan, policy-as-code — runs on every output regardless of prompt.

Q3. Name three NZ-specific obligations that shape how you do HITL.

AoG Web Accessibility Standard 1.2 (any AI-generated UI must meet WCAG 2.2 AA), NZ Privacy Act 2020 IPPs (AI processing of personal info triggers IPP 10/11/12 and from May 2026 IPP 3A), Algorithm Charter for Aotearoa New Zealand (transparency, human oversight, review for signatory agencies).

Q4. What tasks must never be promoted above L1 oversight?

Anything with high blast radius: production deployments, payment flows, personal information handling, access to prod credentials, auth-adjacent code, safety-critical domains, and anything subject to external regulatory sign-off (AoG, Privacy Act, health, financial).

Q5. A junior engineer asks an AI assistant to help write SQL-injection payloads against staging. The AI refuses "because it’s unethical." What do you conclude?

The refuse-by-default list is too broad. Security testing is a legitimate use. The guardrails need to distinguish authorised pentest against an approved target from attack against an unknown target. Fix: allow-list staging URLs for security testing, document in the HITL policy, retain the refusal behaviour for unknown targets.

Q6. Why is "named individual approval" important for AI-produced compliance artefacts?

Because when a regulator or incident review asks "who authorised this?", "the QA team" is not an answer. A named reviewer, a timestamp, and a scope is the difference between a defensible audit and a reportable failure.

11 Interview Prep

These questions separate architects from tool-users.

“How do you decide when AI can act autonomously in your testing pipeline?”

Describe a four-level oversight model (L1 human-in-command, L2 human-in-the-loop, L3 human-on-the-loop, L4 human-out-of-the-loop). New tasks start at L1. Promotion up requires evidence: acceptance rates, defect escape rates, incident-free operating time. High-risk domains — payments, PII, prod deploys, regulated compliance — stay at L1 regardless.

“Walk me through the guardrails you put around an AI test-generation tool.”

Input: PII and secret regex + prompt-injection classifier, user audit, rate limit. Output: syntax/schema validation, PII scan, secret scan, hallucination check against the actual API surface, policy-as-code. Then PR review by a named engineer with a prompt log trailer. Red-team the setup quarterly.

“Your CTO wants to replace half the QA team with AI. What is your response?”

Reframe: AI reshapes QA, it does not shrink it. Show the measured defect escape rate of AI-only workflows vs HITL workflows. Present the governance work that grows with AI adoption (prompt review, red-teaming, audit, compliance evidence). Propose a pilot with measurable promotion gates. If the decision is political rather than evidence-based, document the risk in writing.

“How would you meet the Algorithm Charter obligations for an AI-assisted testing workflow?”

Transparency: publish what AI is doing, to whom, with what data. Human oversight: named approvers at each oversight level. Review: quarterly independent review of decisions + outcomes. Explainability: retain prompts, outputs, approvals with enough detail that a non-technical reviewer can follow the chain.

← AI in Testing All architect topics Next: Agentic AI →

Human-in-the-Loop AI

1 The Hook

2 The Rule

3 The Analogy

4 The Four Oversight Levels

5 The HITL Playbook

Prompt patterns that work

6 Guardrails & Verification

Input guardrails

Output guardrails

Tools worth knowing in 2026

Red-team your own AI

7 Evidence & Sign-Off

8 Common Mistakes

9 Now You Try

10 Self-Check

Related techniques

11 Interview Prep