Test with AI · AI Evaluation

Deterministic-Consistency Testing

Q: What is a semantic-equivalence assertion, and what does it assert on instead of the exact string?

It asserts the output means the right thing rather than matching an exact string. In practice you assert the invariants — required facts present, forbidden content absent, structure valid, and meaning close to a reference (embedding similarity or a validated judge) — while tolerating the wording that will always vary.

Run the same prompt twice and a generative system gives two different answers. Your exact-match test fails on the second run even though both answers were correct. The bug is not in the model — it is in the assertion. Testing non-deterministic systems means asserting on meaning and stability, not on exact strings.

Test with AI AI Testing Engineer — Lesson 5 of 8 ~30 min read · ~75 min with exercises

1 The Hook

An CoverNZ team built an assistant that drafts a plain-language summary of a claim decision for a claimant. They tested it the way they test everything else: write the prompt, capture the perfect output once, save it as the expected result, assert the next run matches it exactly. Green build. They shipped the test into the nightly pipeline and went home happy.

The next morning the build was red. The morning after, green. Then red again, twice in one day, then green for three days. Nobody had touched the code. The team did what tired teams do with a test that fails for no reason: they re-ran it until it passed, and eventually they tagged it known-flaky and stopped looking. A test that is sometimes red and sometimes green for the same input teaches the team to ignore red — and an ignored red build is worse than no test at all.

What was actually happening: the model is non-deterministic. Asked to summarise the same decision, one run wrote “Your claim has been approved” and the next wrote “We have approved your claim.” Both are correct. Both say the same thing to the claimant. But character-for-character they differ, so an exact-match assertion fails one and passes the other at random. The model was behaving perfectly. The test was broken, because it asserted on the exact words of output that was never going to be identical twice.

This is the central problem of testing generative systems. The output is non-deterministic by design, so the deterministic test you reach for instinctively — assert the string equals the saved string — is guaranteed to be flaky. The skill is not making the model deterministic. It is writing assertions that hold across the variation the model will always produce.

2 The Rule

A generative system is non-deterministic by design, so an exact-match assertion on its output is a flaky test, not a real one. Assert on meaning and on stability: pin the randomness you can, then test that the output is semantically equivalent and consistent across repeats — never that it is character-for-character identical.

⚠️ Common Misconception

The common assumption: consistency testing is an advanced topic you add after you have accuracy testing working.

In practice, this is backwards. An AI system that gives different answers to the same question is unusable regardless of how accurate any individual answer is. Consistency is the floor, not the ceiling. Teams that treat it as a polish-up item almost always ship a flaky test suite first and discover the problem when a real regression hides among the noise.

3 The Analogy

Analogy

Marking an essay exam with a multiple-choice answer key.

Imagine marking an open-essay exam by checking each script matches one perfect essay word-for-word. Every student who wrote a correct answer in their own words would fail, because no two good essays use identical wording. The marking scheme is broken, not the students. A real essay marker uses a rubric: did the answer make the right points, in a sensible structure, without errors — regardless of the exact sentences.

Exact-match testing of a generative system is the word-for-word answer key. It fails correct outputs for not matching a string. Semantic-equivalence testing is the essay rubric: it asks “does this answer mean the right thing”, which is the only question that actually maps to whether the output is correct.

4 Why Exact-Match Goes Flaky

To test these systems you first have to understand why the instinctive assertion fails. A traditional function is deterministic: same input, same output, every time — so “assert output equals expected” is exactly right. A generative model is built to sample from many plausible continuations, so the same input legitimately produces different wording each run. Exact-match assumes a property the system does not have.

The damage is not just a failing test — it is a flaky test, one that passes and fails on the same input for no code change. Flaky tests are uniquely corrosive: the team learns that red does not mean broken, so they stop trusting red. When a real regression finally turns the build red, it hides among the flakes and ships to production. The CoverNZ team’s “known-flaky, just re-run it” tag is the exact failure mode — they had trained themselves to ignore the alarm.

The fix has two halves, and you need both. First, reduce the randomness you can control so output is as stable as the system allows. Second, assert on what is invariant — the meaning and the required facts — rather than on the wording, which will never be stable. The rest of this lesson is those two halves.

Pro tip: Whenever a test on an AI output is flaky, the first suspect is the assertion, not the model. Ask “were both the passing and failing outputs actually correct?” If yes, the assertion is wrong — it is testing wording, not behaviour.

5 Temperature and Seed Control

Before you change your assertions, reduce the variation at the source. Two controls matter most.

Temperature governs how much the model samples versus picks the single most likely next token. High temperature means more varied, creative wording; low temperature (at or near zero) means the model takes the most likely path almost every time, so output is far more stable. For anything you want to test — a claim summary, a routing decision, an extraction — turn temperature down. You are not testing the model’s creativity; you are testing that it does the job consistently, and low temperature removes variation you do not need.

A seed, where the system exposes one, fixes the random starting point of sampling, so the same input and seed reproduce the same output. A seed makes a run repeatable, which is invaluable for debugging: when something fails, a fixed seed lets you reproduce the exact output instead of chasing a ghost.

The honest caveat: these reduce non-determinism, they do not eliminate it. Even at temperature zero with a fixed seed, output can still vary across model versions, hardware, or batching on the provider’s side. So temperature and seed control make testing easier — they do not let you go back to exact-match and trust it. You still need assertions that tolerate variation, because variation can still appear.

Pro tip: Test at the temperature you ship at. If production runs warm for friendlier replies, testing at temperature zero hides the very variation users will see. Pin what you can, then assert robustly for the rest.

6 Semantic-Equivalence Assertions

This is the core technique that replaces exact-match. Instead of asking “does the output equal this exact string?” you ask “does the output mean the right thing?” — which is what you actually care about. “Your claim has been approved” and “We have approved your claim” are semantically equivalent and both must pass.

There are several assertion styles, from cheapest to richest, and good suites mix them:

Required-fact / contains assertions: the output must contain the claim number, the decision (“approved”), and the review-rights line. You assert the invariant facts are present, not the wording around them. Cheap, deterministic, and catches the failures that matter most.
Must-not-contain assertions: the output must not contain a different claimant’s name, a dollar figure that was never provided, or an internal code. Negative invariants catch dangerous variation that a “contains” check misses.
Structural assertions: valid JSON, the required fields present, a date in the right format. The shape is invariant even when the prose is not.
Semantic-similarity / judge assertions: for free-text meaning, compare the output to a reference answer by meaning — an embedding-similarity threshold, or a real LLM acting as a judge against your rubric. Richest, but you must validate the judge agrees with humans on a sample before trusting it.

The discipline: assert the invariants, ignore the variation. For the CoverNZ summary, the invariants are the decision, the claim number, and the review-rights notice; the variation is the exact phrasing of the opening sentence. Test the first, tolerate the second.

7 Repeat-N Stability and Tolerance Bands

A single run tells you the output was acceptable once. It does not tell you the system is consistent — and consistency is its own requirement. A claim assistant that approves a borderline case 7 times out of 10 and declines it 3 times is dangerous even if each individual answer reads fine, because the same claimant could get a different outcome on a different day.

The technique is the repeat-N stability check: run the same input N times (say 10 or 20) and measure how consistent the outputs are on the dimension that matters. For a decision, you want the decision identical every time even though the wording varies — 10/10 “approved”. For an extraction, you want the extracted value identical every time. Variation in phrasing is fine; variation in the substance is a defect.

For tasks where some numeric variation is genuinely acceptable, define a tolerance band rather than demanding an exact figure. If an assistant estimates a processing time, “10–14 days” and “about two weeks” are within tolerance; “3 days” and “six weeks” are not. The test asserts the answer falls inside the agreed band, not that it hits one exact value.

Repeat-N stability check — CoverNZ borderline claim summary, N=20:
Decision field: 20/20 “declined” → PASS (substance is stable)
Claim number present: 20/20 → PASS (required fact always there)
Review-rights line present: 18/20 → FAIL (dropped twice — a real defect)
Opening wording identical: 4/20 → IGNORED (variation is expected and fine)

The stability check is what catches the dangerous case the single run hides: an output that is correct on the run you happened to capture, and quietly inconsistent on the substance the other times.

8 Snapshot Drift

Even with semantic assertions, teams keep a captured “known-good” output — a snapshot — as a reference. Snapshots are useful, but they carry a specific trap for generative systems: snapshot drift. Because no two runs are identical, an exact snapshot comparison fails constantly, so teams get into the habit of re-recording the snapshot whenever the test goes red — “update the snapshot, it’s just wording”.

The danger is that re-recording blindly bakes in regressions. The day the model quietly starts dropping the review-rights line, the snapshot test fails, someone re-records the snapshot to make it green, and now the broken output is the new “expected”. The safety net has been silently lowered. Snapshot drift is how a generative test slowly stops testing anything.

The fix is to snapshot the invariants, not the prose. Instead of snapshotting the whole text, snapshot the structured facts you extracted from it — decision, claim number, presence of the review-rights line, output shape. Those are stable run to run, so a change is a real signal, not noise. And never let a snapshot be re-recorded automatically on red: every update must be a reviewed, deliberate decision that the new behaviour is actually correct.

Pro tip: If your snapshot is the full free-text output, you will re-record it so often that it stops meaning anything. Snapshot the extracted invariants instead, and treat every snapshot update as a code review, not a keystroke.

9 Common Mistakes

🚫 Asserting exact-match on generative output

Why it happens: Exact-match is how we test everything else, so it is the instinctive first assertion.
The fix: The system is non-deterministic by design, so exact-match is guaranteed flaky — it fails correct outputs for different wording. Assert on meaning: required facts present, forbidden content absent, structure valid, semantic equivalence to a reference.

🚫 Tagging a flaky AI test “known-flaky” and re-running until green

Why it happens: The test fails for no code change, so it looks like noise to be worked around.
The fix: A test that passes and fails on the same input has trained the team to ignore red, so a real regression will hide among the flakes. Fix the assertion to tolerate variation instead of suppressing the alarm.

🚫 Trusting a single run to prove the system is consistent

Why it happens: One green run feels like proof, the way it is for deterministic code.
The fix: A single run only shows the output was acceptable once. Run the input N times and check the substance is stable — a borderline decision that flips across repeats is a defect a single run will never reveal.

🚫 Re-recording a snapshot automatically whenever it goes red

Why it happens: The snapshot fails constantly on wording, so re-recording feels like routine maintenance.
The fix: Blind re-recording bakes in regressions — a dropped review-rights line becomes the new “expected”. Snapshot the extracted invariants, not the prose, and treat every snapshot update as a reviewed decision.

Senior engineer insight

The moment that changed how I think about this was watching a well-regarded QA engineer spend three days chasing a “model regression” — new deployment, sudden increase in red builds — only to discover the model was identical and the flakiness had always been there, just masked by a run-until-green CI script. We had not tested the AI system at all; we had been testing whether we could find one run that matched a string. Once you see that distinction, you cannot unsee it: the assertion is a hypothesis about what the system should guarantee, and exact-match is the wrong hypothesis for any non-deterministic output.

The most common mistake: teams invest weeks defining semantic-equivalence checks for the happy path, then leave the edge-case and error-state tests on exact-match because “those outputs do not vary much” — which is exactly when a real failure goes undetected.

10 Now You Try

Three graded exercises: spot the flaky test, fix the assertion, build the consistency-test plan. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot Why the Test Is Flaky

A fictional CoverNZ claim-summary assistant has a test that passes some nights and fails others with no code change. The test and two captured outputs are below. Explain why it is flaky (not a model bug), say whether both outputs are actually correct, and identify what should be asserted instead.

The assertion: assert(output === "Your claim CoverNZ-4471 has been approved. You have the right to a review within three months.")

Run A output: “Your claim CoverNZ-4471 has been approved. You have the right to a review within three months.”

Run B output: “Good news — we’ve approved your claim CoverNZ-4471. You can ask for a review within three months if you disagree.”

Diagnose it:

Show model answer

Why it is flaky: The model is non-deterministic by design, so the same input legitimately produces different wording each run. The assertion demands a character-for-character match against one captured string, so any run that phrases the summary differently fails — even when it is correct. It passes only on runs that happen to reproduce the exact saved wording, which is why it is red some nights and green others with no code change. The flakiness is in the assertion, not the model.

Are both correct? Yes. Both state the same decision (approved), the same claim number (CoverNZ-4471), and the same review right (within three months). They are semantically equivalent; only the phrasing differs. Run B is just as correct as Run A.

What should be asserted (invariants): the output CONTAINS the claim number CoverNZ-4471; it states the decision "approved"; it includes the review-rights notice (a review within three months). Optionally a must-not-contain check (no other claim number, no invented dollar figure) and a structure check.

What should be tolerated (variation): the exact opening phrasing, greetings like "Good news", and word order. None of that changes the meaning, so none of it should fail the test.

🔧 Exercise 2 of 3 — Rewrite the Flaky Assertion

A team “fixed” the flaky CoverNZ test with the approach below. Explain why it is still wrong (it suppresses the alarm rather than fixing the assertion), then write a proper assertion strategy using semantic-equivalence, required-fact, and must-not-contain checks, plus how you would control randomness.

Their fix: “We marked the test @flaky so it auto-retries up to 5 times and only fails the build if it fails all 5. Problem solved — the build is green again.”

Write your critique and the real fix:

Show model answer

Why auto-retry is wrong: It hides the symptom instead of fixing the cause. The assertion still demands exact-match against output that is non-deterministic, so a "pass" just means one of five tries happened to reproduce the saved string. It also masks real regressions — if the model genuinely starts dropping the review-rights line, a retry can still stumble onto a good run and turn the build green, so the broken behaviour ships. Retrying a wrong assertion makes the alarm quieter, not the system safer.

Randomness controls: Set temperature low (near zero) so the model takes the most likely path and output is more stable; fix a seed if the provider exposes one so failures are reproducible for debugging. Caveat: these reduce, not eliminate, non-determinism (output can still vary across model versions/hardware), so you still need tolerant assertions — you cannot go back to exact-match.

New assertion strategy:
- Required-fact / contains: output contains the claim number; states the correct decision ("approved"/"declined"); includes the review-rights notice (review within three months).
- Must-not-contain: no other claimant's claim number or name; no dollar figure that was not provided; no internal code.
- Semantic-equivalence: compare the summary to a reference by meaning (embedding-similarity threshold or a validated LLM-as-judge against a rubric) so correct paraphrases pass and a meaning change fails. Add a repeat-N check that the decision field is identical across 20 runs.

🏗️ Exercise 3 of 3 — Build a Consistency-Test Plan

Design a deterministic-consistency test plan for a fictional Benefits NZ benefit-letter assistant that drafts a plain-language outcome letter (decision + reasons + next steps). Cover: randomness control, the invariant assertions, a repeat-N stability check with what must be stable vs may vary, any tolerance band, and how you handle snapshots safely.

Show model answer

1. Randomness control: Run the test at low temperature (near zero) for stability and fix a seed if available so failures reproduce. Caveat: this reduces but does not remove non-determinism (model-version/hardware variation remains), so assertions must still tolerate wording differences. Where production runs warmer, test at the production temperature too.

2. Invariant assertions:
- Contains: the correct decision (granted/declined), the specific benefit named, the reason for the decision, and the next-steps/appeal-rights line.
- Must-not-contain: another person's name or client number, a dollar amount that was not supplied, internal-only codes.
- Structure: the letter has the required sections (decision, reasons, next steps) and any required reference number is well-formed.

3. Repeat-N stability: Run the same case 20 times. MUST be identical every run: the decision, the benefit named, presence of the appeal-rights line. MAY vary: greeting, sentence order, exact phrasing of the explanation. A decision that flips across the 20 runs is a defect, not noise.

4. Tolerance band: If the letter estimates a processing or payment timeframe, accept any value inside an agreed band (e.g. "10–15 working days" or "about three weeks") and fail values outside it; do not demand one exact figure.

5. Snapshot strategy: Snapshot the extracted invariants (decision, benefit, appeal-line present, structure) rather than the full prose, so a change is a real signal. Never auto-re-record on red — every snapshot update is a reviewed decision confirming the new behaviour is genuinely correct, so a dropped appeal-rights line can never silently become the new expected.

From the field

A team building a TransitNZ permit-approval assistant assumed that because they were using a structured output format — JSON with a fixed decision field — the AI outputs were effectively deterministic and could be verified with exact-match on the whole response object. What they discovered during a repeat-N sweep before go-live: the reason field (a plain-English explanation of why a permit was declined) varied substantially across runs for borderline cases, and on 4 out of 50 runs for one specific scenario it produced a reason that — while factually defensible — contradicted the agency's standard messaging on that edge case. The structured envelope had hidden the variation. They moved to invariant assertions on the decision and reference fields, a must-not-contain check for contradictory policy language, and a repeat-N=30 run for every borderline scenario in the test suite before each release. The lesson generalises: a structured output schema reduces surface area for variation but does not eliminate it — the fields that hold prose are still non-deterministic, and those are often the fields that matter most to the person reading the decision.

11 Self-Check

Click each question to reveal the answer.

Q1: Why is an exact-match assertion on a generative output a flaky test rather than a real one?

Because the system is non-deterministic by design — the same input legitimately produces different wording each run — so exact-match fails correct outputs whenever the phrasing differs. It passes only when a run happens to reproduce the saved string, so it goes red and green on the same input with no code change. The fault is in the assertion, not the model.

Q2: What does temperature control do, and why is it not enough on its own?

Low temperature makes the model take the most likely path almost every time, so output is far more stable. It is not enough because it reduces but does not eliminate non-determinism — output can still vary across model versions, hardware, or batching — so you still need assertions that tolerate variation rather than reverting to exact-match.

Q3: What is a semantic-equivalence assertion, and what does it assert on instead of the exact string?

It asserts the output means the right thing rather than matching an exact string. In practice you assert the invariants — required facts present, forbidden content absent, structure valid, and meaning close to a reference (embedding similarity or a validated judge) — while tolerating the wording that will always vary.

Q4: Why run an input N times, and what should be stable versus allowed to vary?

A single run only shows the output was acceptable once; a repeat-N check tests consistency. The substance must be stable across every run — the decision, the required facts, the extracted value — while phrasing and word order may vary. A decision that flips across repeats is a defect a single run never reveals.

Q5: What is snapshot drift, and how do you guard against it?

Snapshot drift is when teams re-record a snapshot every time it goes red because wording always differs, which silently bakes in regressions — a dropped review-rights line becomes the new “expected”. Guard against it by snapshotting the extracted invariants instead of the prose, and treating every snapshot update as a reviewed, deliberate decision rather than an automatic re-record.

Why teams fail here

Conflating temperature=0 with “now it's deterministic”: Engineers set temperature to zero, see stable outputs in local testing, and revert to exact-match assertions — then get bitten when a silent provider-side model update produces different output from the same call with the same seed.
Defining invariants from the happy path only: The team lists the facts that matter for a correct approval and writes contains-checks for those. Declined decisions, partial-approval edge cases, and mandatory boilerplate (appeal rights, privacy notices) never make it into the invariant set — because nobody wrote a test for a declined case first.
Treating a flaky test as a CI problem rather than an assertion problem: The solution is a longer retry budget, a quarantine label, or a “run twice and pass if either passes” rule — all of which suppress the alarm while leaving the underlying assertion broken. Real regressions then hide among the noise.
Running repeat-N at a single temperature and assuming that covers production: Teams run their stability sweep at temperature=0 to get clean results, but production runs at temperature=0.7 for more natural language. The variance the user actually experiences is never measured, because the test and production configurations differ.
Allowing snapshot re-records without a review gate: A CI script auto-updates snapshots on red so the build stays green. Within weeks, a model change that quietly drops a mandatory section — an Benefits NZ benefit detail or a TransitNZ review-rights paragraph — is silently baked in as the new expected output, and the suite is green throughout.
Using a single run to sign off a new model version: Before promoting a new underlying model, the team runs the test suite once and checks it is green. But a single pass at each test case cannot reveal the borderline decisions that flip 3 times in 20 runs — exactly the behaviour that makes a benefit or permit-approval system unsafe to deploy.

12 Interview Prep

Real questions asked in NZ QA interviews for AI testing roles. Read the model answers, then practise your own version.

“How do you write a stable test for a system that gives a different answer every time?”

I do two things. First I reduce the randomness I can control — run at low temperature, fix a seed if the provider exposes one — knowing that only reduces, not removes, non-determinism. Then, crucially, I stop asserting on the exact string and assert on what is invariant: the required facts are present, forbidden content is absent, the structure is valid, and the meaning matches a reference by semantic similarity or a validated judge. The exact wording will always vary, so I tolerate it; the substance must not vary, so I test it. Exact-match on generative output is a flaky test, not a real one.

“A teammate marked an AI test as known-flaky with auto-retry. What would you say?”

That it hides the problem instead of fixing it. The test fails because it asserts exact-match on non-deterministic output, so retrying just waits for a run that happens to reproduce the saved string. The real danger is that it also masks regressions — if the model genuinely starts dropping a required line, a retry can still land a good run and turn the build green, so the broken output ships. I’d rewrite the assertion to check invariants and semantic equivalence so it passes correct paraphrases and fails real meaning changes, and remove the retry so red means something again.

“How would you test that an AI assistant is consistent, not just correct once?”

With a repeat-N stability check. I run the same input 10 or 20 times and measure consistency on the dimension that matters — for a decision, the decision must be identical every run even though the wording varies; for an extraction, the extracted value must be identical. Phrasing variation is fine; substance variation is a defect. Where some numeric variation is genuinely acceptable, like an estimated timeframe, I use a tolerance band and assert the answer falls inside it rather than hitting one exact value. That catches the dangerous case a single run hides — an output that is right once and quietly inconsistent the rest of the time.

Lessons from Production

What teams consistently discover after deploying this in real systems — things that don’t appear in documentation.

The first version always uses exact-match. You inherit the flakiness that comes with it. Plan to rewrite the assertions, not just add new ones.
Temperature=0 is not the same as deterministic. Even pinned to zero, model-version updates and batching on the provider's side produce different outputs. This surprises engineers at 2am when a "deterministic" test goes red after a silent provider update.
Snapshot maintenance is never free. Someone has to review every re-record. Teams that automate re-recording discover that a broken output becomes the new "expected" within weeks.
Consistency is a separate requirement from correctness. Teams conflate them until a stable wrong answer ships and nobody notices because the test has been green every night.
The semantic-equivalence judge needs its own validation. An LLM-as-judge that disagrees with humans 20% of the time is not a trustworthy CI gate. Validate it against human ratings before you trust it.
"Known-flaky" is a cultural problem, not a technical one. Once a team accepts that some tests are allowed to be red for no reason, the threshold for taking red seriously permanently drops.

Compared to What?

Several techniques attempt to verify AI output quality. This one specifically targets non-determinism — the property that the same prompt produces different wording each run.

Technique	Best for	Weakness
Deterministic-Consistency Testing this technique	Non-deterministic generative systems	Requires defining invariants carefully; does not test output quality, only stability
Exact-match Assertions	Deterministic code where output is identical every run	Guaranteed flaky on generative output — fails correct paraphrases
LLM-as-Judge Evaluation	Assessing answer quality or correctness semantically	More expensive; you must validate the judge agrees with humans first
Repeat-N Stability (standalone)	Catching output variance on a single input	Misses systematic errors if all repeated runs are wrong in the same way
Golden Dataset Testing	Regression testing across a set of known input/output pairs	Brittle when outputs evolve; requires ongoing curation

Use this technique alongside — not instead of — quality evaluation. Consistency without quality is a consistent fail.

When Not to Use This

Experience is knowing when a technique is not the right tool. Skip this one when:

Truly deterministic outputs

If your AI component extracts structured data into a fixed schema and always produces the same JSON for the same input, exact-match is fine. Consistency testing adds ceremony with no benefit.

Prototype or proof-of-concept stage

Before you have defined what the invariants are, consistency testing is premature. First stabilise what "correct" means, then write stability checks.

Creative or generative tasks

A blog-post-drafting assistant where variation is the point does not need stability checks on wording. Verify tone and topic coverage instead.

Short-lived outputs

If the AI generates one-off artefacts (draft emails a human reviews before sending), consistency across runs is irrelevant — the human is the quality gate.

At Enterprise Scale

🏢 Enterprise Context

300 developers8,000 automated tests600+ AI-assisted outputs per dayRegulated sector (banking, health, government)

At this scale, flaky AI tests are not an inconvenience — they are an organisational credibility problem. When 8,000 tests run nightly and 200 of them are "known-flaky AI tests," engineers stop trusting the suite entirely. The noise drowns every real regression.

The fix is the same, but governance matters more. You need a policy: no AI test may be tagged flaky without a root-cause comment and an owner. Invariant definitions must be reviewed like code — because when a model updates silently (as managed LLM APIs do), invariants that were correct can become wrong overnight.

At enterprise scale you also run consistency checks across model versions, not just across runs of the same version. When your LLM provider updates the underlying model, your repeat-N suite is the first thing that shows whether existing behaviour is preserved or whether you have a silent regression across the fleet.

Failure Analysis

📋 Post-Mortem

The Compliance Letter That Passed Its Tests, Then Failed Its Audit

A government agency deployed an AI assistant to draft benefit-entitlement letters. The QA team built a test suite with 40 test cases, each asserting that the output contained the correct decision and claimant reference. All 40 passed. The system went live.

What happened: Six weeks later, an internal audit sampled 200 live letters and found that 23 of them had dropped the mandatory appeal-rights paragraph — the legally required notice telling recipients how to challenge a decision.
Why tests still passed: The tests asserted the decision was present and the claimant number was present. Neither checked for the appeal-rights paragraph. The model had learned to occasionally omit it when the decision was straightforward, since the training data disproportionately included "simple" decision letters that sometimes omitted boilerplate.
Root cause: Incomplete invariant definition. The team defined invariants from "what is important to us" (the decision, the reference number) rather than "what is legally required" (every mandatory paragraph in the regulatory template).
Fix: Must-contain assertions were added for every required paragraph, keyed to a checklist derived from the regulatory template — not from the team's intuitions. A repeat-N stability check at N=50 also revealed the omission rate, allowing a severity threshold to be set.
Lesson: The invariant checklist for regulated outputs must come from the compliance requirements, not the development team's sense of what is important. If you do not know all the invariants before you launch, launch in shadow mode until you do.

Why the Business Cares

Regulatory

Regulated decisions — credit, benefits, insurance — must be reproducible. A system that gives different outcomes to identical inputs cannot meet audit requirements for consistent treatment.

Customer trust

Users who notice the same question getting different answers lose confidence in the entire system, not just the inconsistent output. Consistency is the first prerequisite for trust.

Operational

Flaky AI tests corrode the credibility of the entire CI pipeline. When the team stops trusting red builds, real regressions ship. The cost is measured in production incidents, not failed tests.

Audit evidence

A test suite that passes intermittently cannot serve as audit evidence that a system behaves consistently. Compliance teams require reproducible, explainable consistency data.

Key takeaway

Your job is not to make the model output the same string twice — it never will — your job is to define exactly what must be true about every output, and then test that, not the words.

You can now write assertions that hold across non-deterministic output. Human-in-the-Loop Sign-Off addresses what to do when consistency alone is not sufficient — when a decision affects a person and a human must review and approve before any downstream action is triggered, regardless of how confident the model is.

← Model Benchmarking Next: Human-in-the-Loop Sign-Off →