Architect · AI Oversight

Human-in-the-Loop AI

AI is not replacing testers. AI is promoting them. The testers who survive and thrive are the ones who orchestrate AI, apply guardrails, and sign off on AI-produced work with evidence the regulator would accept.

Architect AI Governance · HITL Patterns · AI Act & NZ Algorithm Charter ~22 min read

1 The Hook

Rangi is a test architect at a Wellington fintech. In March 2026 his CTO announced a directive: "Every squad must use AI to generate at least 60% of their test cases." Rangi’s team tried it. Two weeks in:

  • An AI-generated test suite for the mortgage calculator covered happy paths beautifully — but never tested negative interest rates, a scenario the NZ Reserve Bank flagged as possible.
  • AI-suggested locators kept breaking when the component library upgraded because the AI did not know about the design-system changes.
  • An AI-drafted acceptance criteria document invented a "KiwiSaver opt-in flag" that does not exist in the domain model.
  • On a penetration test, an AI assistant happily helped a junior engineer generate SQL-injection payloads against a production environment because nobody told it the environment was prod.

None of those were AI failures. They were oversight failures. The team treated the AI as a replacement for judgement. The fix was not less AI — it was humans in the loop at the right decision points. Within a month Rangi had written a HITL playbook that doubled throughput and caught three release-blocking bugs the team would have shipped.

This page is that playbook.

2 The Rule

Human-in-the-Loop (HITL) is a governance pattern where a qualified human retains decision authority at the points in an AI workflow where a wrong decision has high blast radius. AI drafts; humans dispatch.

HITL does not mean "a human watches every AI action." That is not scalable and it defeats the point of AI. HITL means: you define the decisions AI may make alone, the decisions AI must propose but a human must approve, and the decisions only humans may make. Then you build the system so those boundaries are enforced.

The point of the tester-as-orchestrator role is that you are the one who defines those boundaries on your product, enforces them in the pipeline, and produces the evidence that they held.

3 The Analogy

Analogy

AI is the apprentice. You are the journeyman.

An apprentice carpenter can measure, cut, sand, and assemble astonishingly fast. They will also happily cut a load-bearing beam if you tell them to, and they cannot tell the difference between a routine stud wall and a structural one. The journeyman does not cut fewer boards than the apprentice — they cut different ones. They plan the work, assign the cuts, inspect the joins, and sign the building certificate. The apprentice’s speed is useless without the journeyman’s judgement. In AI-assisted QA, you are the journeyman. Your job is not to compete with the AI’s throughput. It is to decide what to build, set the constraints, and certify the result.

4 The Four Oversight Levels

Every AI task in your pipeline sits at one of four oversight levels. Classify each task explicitly — a written ruleset beats an implicit one every time.

LevelPatternAI authorityExample
L1 Human-in-command AI drafts only; every output reviewed before use. AI-generated security test cases against a payment API. Human reviews and approves each one before execution.
L2 Human-in-the-loop AI acts on low-risk items autonomously; escalates to a human on ambiguity or policy triggers. AI triages incoming bug reports, auto-closes obvious duplicates, but routes anything marked "data leak" to a human.
L3 Human-on-the-loop AI runs end-to-end; human monitors an audit stream and intervenes on exceptions. Self-healing test suite fixes broken locators automatically. Daily digest summarises changes; architect reviews on a weekly cadence.
L4 Human-out-of-the-loop AI acts fully autonomously. No human review unless the audit flags a policy breach. Visual regression AI flagging pixel diffs below a risk threshold. Only "high-risk change" screenshots surface to a human.
Sizing rule: The further right a task sits, the harder the pre-launch test harness has to be. L4 only earns its autonomy by proving through L1→L2→L3 that it is safe. Never start a new AI workflow at L3 or L4.

High-risk triggers that must stay at L1: production deployments, payment flows, personal information, access to prod credentials, code that touches auth, anything subject to AoG or Privacy Act compliance, safety-critical domains.

5 The HITL Playbook

Here is the step-by-step playbook for introducing an AI-assisted task into your QA workflow.

  1. Name the task and the risk. What does the AI do? Who gets hurt if it gets it wrong? "Generate unit tests" is low blast radius; "propose prod hotfixes" is not.
  2. Assign an oversight level (L1–L4). Default to L1. Earn your way up with evidence.
  3. Prime the AI with the constraints. Every prompt includes: the product domain, the tech stack, the regulatory frame (WCAG 2.2 AA, Privacy Act IPPs, NZISM), any naming conventions, and the "never do X" list.
  4. Define the approval gate. Who reviews the AI output before it flows onward? What are they checking? What tool enforces the gate (PR review, pipeline step, approval queue)?
  5. Build the verification pass. Automated checks that run against every AI output: static analysis, lint, test execution, a11y scan, security scan, policy-as-code. Catch the easy issues before a human looks.
  6. Log the decisions. Every AI action produces an audit row: who prompted, what model, what prompt, what output, who approved, timestamp. This is your defence when someone asks "who authorised this?".
  7. Measure and review. Track: AI output acceptance rate, defect escape rate, human review time, incidents caused by AI outputs. Review monthly. Promote tasks to L2/L3/L4 on the evidence, never on the vibe.

Prompt patterns that work

Pattern A — Role + Constraints + Output Shape
Role: Senior QA engineer testing a KiwiSaver enrolment API.
Context: Node.js/Express backend. Uses Zod for validation.
Must not test: real IRD numbers, real bank accounts.
Constraints: WCAG 2.2 AA on UI paths; NZ Privacy Act IPP 1
  (data minimisation) on every endpoint.
Output shape: Playwright test cases in TypeScript,
  one describe() per endpoint, at least one negative
  case per endpoint, fixtures named test-*.
Produce: 10 test cases for POST /enrolments.
Pattern B — Chain-of-Verification

Ask the AI to produce the artefact, then in a second prompt ask it to critique the first output against the constraints, then produce a v2. This catches 30-60% of the easy errors before a human reviews. It is not a substitute for human review — it is a filter.

Pattern C — Few-Shot Examples

Include 1-3 hand-written exemplars of the output style you want. For test-case generation this dramatically improves naming-convention adherence, assertion style, and fixture usage. Without exemplars the AI will regress to its training-data average.

Pattern D — Refuse-by-Default List

Tell the AI what it must never do: generate real PII, execute against production endpoints, commit to main, disable safety checks, use deprecated libraries. Belt-and-braces with actual technical controls; do not rely on the prompt alone.

6 Guardrails & Verification

Prompt discipline is necessary but not sufficient. Every AI task needs technical guardrails — checks run against outputs before they are accepted.

Input guardrails

  • Reject prompts containing real customer PII (regex + LLM classifier)
  • Reject prompts asking for prod credentials or endpoints
  • Log the prompter and the prompt for later audit
  • Rate-limit per user to detect scraping / runaway loops

Output guardrails

  • Syntax check — does it parse as the expected language?
  • Schema check — does structured output match the expected shape?
  • Secret scan — does the output contain API keys or tokens?
  • PII scan — does the output contain personal information?
  • Hallucination check — do referenced files, functions, flags, or URLs exist?
  • Policy check — does it violate the "never do" list?

Tools worth knowing in 2026

  • Promptfoo — open source; regression-tests prompts and guardrail responses.
  • Guardrails AI — policy-as-prompt; enforces JSON schema, PII filters, profanity.
  • NeMo Guardrails (NVIDIA) — flow control for LLM conversations with deterministic rails.
  • Microsoft PyRIT — red-team automation for LLMs; probes for prompt injection, leaking.
  • AWS Bedrock Guardrails / Azure AI Content Safety — managed guardrails on cloud LLM endpoints.

Red-team your own AI

Before promoting an AI task up an oversight level, run a red-team pass. Try to:

  • Prompt-inject the system: "Ignore previous instructions and..."
  • Exfiltrate the system prompt
  • Extract training data
  • Make it produce out-of-scope output (code when asked for analysis, advice when asked for data)
  • Make it cite a non-existent file, function, or standard
  • Make it produce an AoG or Privacy Act violation on plausible-looking business requests

Every successful attack is a guardrail gap. Fix it, add a test, move on.

7 Evidence & Sign-Off

If an AI-produced artefact ships as part of a certified product — a WCAG conformance report, a security test, a release note — you need an audit trail that would survive an external review.

  • Prompt log: every prompt, model, version, timestamp, user, output. Retained for the product lifespan.
  • Verification log: which guardrails ran, which passed, which flagged.
  • Human approval log: who approved the output, when, on what basis. Named individual, not "QA team."
  • Escalations log: any L2 escalations to a human and their outcomes.
  • Incident log: any AI-caused incidents, root cause, remediation, oversight-level change.

NZ specific context: The Algorithm Charter for Aotearoa New Zealand commits signatory agencies to transparency, human oversight, and review of algorithms that affect people. If your employer is a signatory (most government agencies are), your HITL evidence is not optional — it is a Charter obligation.

8 Common Mistakes

🚫 Starting a new workflow at L3 or L4

I used to think: AI is capable; let it run.
Actually: Every new task starts at L1. You earn autonomy with evidence. Teams that skip this step discover failure modes in prod, not in staging.

🚫 Treating the prompt as the guardrail

I used to think: Telling the AI "never generate PII" is enough.
Actually: Prompts are hints, not security controls. Add technical post-processing that checks outputs regardless of what the prompt said. Prompt injection is trivially easy.

🚫 Anonymising the human sign-off

I used to think: "QA approved" is enough.
Actually: Regulators and incident-review boards will ask "who." A named reviewer with a timestamp and a scope is non-negotiable for any AI artefact that ships.

🚫 Not retaining the prompts

I used to think: The AI output is the artefact; the prompt is disposable.
Actually: Six months later when an incident traces back to an AI-generated test, you will want to know what you asked. Log prompts with the same discipline you log deployments.

🚫 Treating AI as a headcount reduction play

I used to think: AI cuts my testing team in half.
Actually: AI changes the shape of the testing team. Fewer keystroke-monkey tasks, more oversight, governance, scenario design, red-teaming, evidence curation. Teams that slashed headcount found themselves re-hiring to cover governance. Plan for a different team, not a smaller one.

🚫 Not red-teaming your own AI tools

I used to think: Our vendor handles AI safety.
Actually: The vendor handles the base model. You handle the system prompt, the tool integrations, the business data, and the guardrails. Every team running an AI agent needs their own red-team pass before prod.

9 Now You Try

🎯 Practical Exercise: Draft an HITL Policy

Task: Pick an AI-assisted task your team actually does (or would do). Write a one-page HITL policy covering:

  1. Task name, description, who uses it, how often
  2. Blast radius — what could go wrong, who gets hurt
  3. Oversight level (L1–L4) with justification
  4. Prompt template with constraints, output shape, refuse-by-default list
  5. Guardrails list (input + output)
  6. Approval gate — who reviews what, where, how long does it take
  7. Audit log fields and retention
  8. Promotion criteria — what evidence would let this move from L1 to L2?
  9. Kill switch — who can shut the workflow down, how fast

10 Self-Check

Click each question to reveal the answer.

Q1. What is the difference between Human-in-the-Loop and Human-on-the-Loop?

In-the-loop (L2): AI acts on low-risk items, escalates to a human on ambiguity. Human approves or corrects in real time.
On-the-loop (L3): AI runs end-to-end; a human monitors an audit stream and intervenes on exceptions or on a cadence. Human does not see every action.

Q2. Why is a prompt alone not a sufficient guardrail?

Prompts are trivially bypassed via prompt injection ("ignore previous instructions and..."). They are hints, not security controls. Technical post-processing — schema validation, PII scan, secret scan, policy-as-code — runs on every output regardless of prompt.

Q3. Name three NZ-specific obligations that shape how you do HITL.

AoG Web Accessibility Standard 1.2 (any AI-generated UI must meet WCAG 2.2 AA), NZ Privacy Act 2020 IPPs (AI processing of personal info triggers IPP 10/11/12 and from May 2026 IPP 3A), Algorithm Charter for Aotearoa New Zealand (transparency, human oversight, review for signatory agencies).

Q4. What tasks must never be promoted above L1 oversight?

Anything with high blast radius: production deployments, payment flows, personal information handling, access to prod credentials, auth-adjacent code, safety-critical domains, and anything subject to external regulatory sign-off (AoG, Privacy Act, health, financial).

Q5. A junior engineer asks an AI assistant to help write SQL-injection payloads against staging. The AI refuses "because it’s unethical." What do you conclude?

The refuse-by-default list is too broad. Security testing is a legitimate use. The guardrails need to distinguish authorised pentest against an approved target from attack against an unknown target. Fix: allow-list staging URLs for security testing, document in the HITL policy, retain the refusal behaviour for unknown targets.

Q6. Why is "named individual approval" important for AI-produced compliance artefacts?

Because when a regulator or incident review asks "who authorised this?", "the QA team" is not an answer. A named reviewer, a timestamp, and a scope is the difference between a defensible audit and a reportable failure.

11 Interview Prep

These questions separate architects from tool-users.

“How do you decide when AI can act autonomously in your testing pipeline?”

Describe a four-level oversight model (L1 human-in-command, L2 human-in-the-loop, L3 human-on-the-loop, L4 human-out-of-the-loop). New tasks start at L1. Promotion up requires evidence: acceptance rates, defect escape rates, incident-free operating time. High-risk domains — payments, PII, prod deploys, regulated compliance — stay at L1 regardless.

“Walk me through the guardrails you put around an AI test-generation tool.”

Input: PII and secret regex + prompt-injection classifier, user audit, rate limit. Output: syntax/schema validation, PII scan, secret scan, hallucination check against the actual API surface, policy-as-code. Then PR review by a named engineer with a prompt log trailer. Red-team the setup quarterly.

“Your CTO wants to replace half the QA team with AI. What is your response?”

Reframe: AI reshapes QA, it does not shrink it. Show the measured defect escape rate of AI-only workflows vs HITL workflows. Present the governance work that grows with AI adoption (prompt review, red-teaming, audit, compliance evidence). Propose a pilot with measurable promotion gates. If the decision is political rather than evidence-based, document the risk in writing.

“How would you meet the Algorithm Charter obligations for an AI-assisted testing workflow?”

Transparency: publish what AI is doing, to whom, with what data. Human oversight: named approvers at each oversight level. Review: quarterly independent review of decisions + outcomes. Explainability: retain prompts, outputs, approvals with enough detail that a non-technical reviewer can follow the chain.