Deterministic-Consistency Testing
Run the same prompt twice and a generative system gives two different answers. Your exact-match test fails on the second run even though both answers were correct. The bug is not in the model — it is in the assertion. Testing non-deterministic systems means asserting on meaning and stability, not on exact strings.
1 The Hook
An ACC team built an assistant that drafts a plain-language summary of a claim decision for a claimant. They tested it the way they test everything else: write the prompt, capture the perfect output once, save it as the expected result, assert the next run matches it exactly. Green build. They shipped the test into the nightly pipeline and went home happy.
The next morning the build was red. The morning after, green. Then red again, twice in one day, then green for three days. Nobody had touched the code. The team did what tired teams do with a test that fails for no reason: they re-ran it until it passed, and eventually they tagged it known-flaky and stopped looking. A test that is sometimes red and sometimes green for the same input teaches the team to ignore red — and an ignored red build is worse than no test at all.
What was actually happening: the model is non-deterministic. Asked to summarise the same decision, one run wrote “Your claim has been approved” and the next wrote “We have approved your claim.” Both are correct. Both say the same thing to the claimant. But character-for-character they differ, so an exact-match assertion fails one and passes the other at random. The model was behaving perfectly. The test was broken, because it asserted on the exact words of output that was never going to be identical twice.
This is the central problem of testing generative systems. The output is non-deterministic by design, so the deterministic test you reach for instinctively — assert the string equals the saved string — is guaranteed to be flaky. The skill is not making the model deterministic. It is writing assertions that hold across the variation the model will always produce.
2 The Rule
A generative system is non-deterministic by design, so an exact-match assertion on its output is a flaky test, not a real one. Assert on meaning and on stability: pin the randomness you can, then test that the output is semantically equivalent and consistent across repeats — never that it is character-for-character identical.
3 The Analogy
Marking an essay exam with a multiple-choice answer key.
Imagine marking an open-essay exam by checking each script matches one perfect essay word-for-word. Every student who wrote a correct answer in their own words would fail, because no two good essays use identical wording. The marking scheme is broken, not the students. A real essay marker uses a rubric: did the answer make the right points, in a sensible structure, without errors — regardless of the exact sentences.
Exact-match testing of a generative system is the word-for-word answer key. It fails correct outputs for not matching a string. Semantic-equivalence testing is the essay rubric: it asks “does this answer mean the right thing”, which is the only question that actually maps to whether the output is correct.
4 Why Exact-Match Goes Flaky
To test these systems you first have to understand why the instinctive assertion fails. A traditional function is deterministic: same input, same output, every time — so “assert output equals expected” is exactly right. A generative model is built to sample from many plausible continuations, so the same input legitimately produces different wording each run. Exact-match assumes a property the system does not have.
The damage is not just a failing test — it is a flaky test, one that passes and fails on the same input for no code change. Flaky tests are uniquely corrosive: the team learns that red does not mean broken, so they stop trusting red. When a real regression finally turns the build red, it hides among the flakes and ships to production. The ACC team’s “known-flaky, just re-run it” tag is the exact failure mode — they had trained themselves to ignore the alarm.
The fix has two halves, and you need both. First, reduce the randomness you can control so output is as stable as the system allows. Second, assert on what is invariant — the meaning and the required facts — rather than on the wording, which will never be stable. The rest of this lesson is those two halves.
5 Temperature and Seed Control
Before you change your assertions, reduce the variation at the source. Two controls matter most.
Temperature governs how much the model samples versus picks the single most likely next token. High temperature means more varied, creative wording; low temperature (at or near zero) means the model takes the most likely path almost every time, so output is far more stable. For anything you want to test — a claim summary, a routing decision, an extraction — turn temperature down. You are not testing the model’s creativity; you are testing that it does the job consistently, and low temperature removes variation you do not need.
A seed, where the system exposes one, fixes the random starting point of sampling, so the same input and seed reproduce the same output. A seed makes a run repeatable, which is invaluable for debugging: when something fails, a fixed seed lets you reproduce the exact output instead of chasing a ghost.
The honest caveat: these reduce non-determinism, they do not eliminate it. Even at temperature zero with a fixed seed, output can still vary across model versions, hardware, or batching on the provider’s side. So temperature and seed control make testing easier — they do not let you go back to exact-match and trust it. You still need assertions that tolerate variation, because variation can still appear.
6 Semantic-Equivalence Assertions
This is the core technique that replaces exact-match. Instead of asking “does the output equal this exact string?” you ask “does the output mean the right thing?” — which is what you actually care about. “Your claim has been approved” and “We have approved your claim” are semantically equivalent and both must pass.
There are several assertion styles, from cheapest to richest, and good suites mix them:
- Required-fact / contains assertions: the output must contain the claim number, the decision (“approved”), and the review-rights line. You assert the invariant facts are present, not the wording around them. Cheap, deterministic, and catches the failures that matter most.
- Must-not-contain assertions: the output must not contain a different claimant’s name, a dollar figure that was never provided, or an internal code. Negative invariants catch dangerous variation that a “contains” check misses.
- Structural assertions: valid JSON, the required fields present, a date in the right format. The shape is invariant even when the prose is not.
- Semantic-similarity / judge assertions: for free-text meaning, compare the output to a reference answer by meaning — an embedding-similarity threshold, or a real LLM acting as a judge against your rubric. Richest, but you must validate the judge agrees with humans on a sample before trusting it.
The discipline: assert the invariants, ignore the variation. For the ACC summary, the invariants are the decision, the claim number, and the review-rights notice; the variation is the exact phrasing of the opening sentence. Test the first, tolerate the second.
7 Repeat-N Stability and Tolerance Bands
A single run tells you the output was acceptable once. It does not tell you the system is consistent — and consistency is its own requirement. A claim assistant that approves a borderline case 7 times out of 10 and declines it 3 times is dangerous even if each individual answer reads fine, because the same claimant could get a different outcome on a different day.
The technique is the repeat-N stability check: run the same input N times (say 10 or 20) and measure how consistent the outputs are on the dimension that matters. For a decision, you want the decision identical every time even though the wording varies — 10/10 “approved”. For an extraction, you want the extracted value identical every time. Variation in phrasing is fine; variation in the substance is a defect.
For tasks where some numeric variation is genuinely acceptable, define a tolerance band rather than demanding an exact figure. If an assistant estimates a processing time, “10–14 days” and “about two weeks” are within tolerance; “3 days” and “six weeks” are not. The test asserts the answer falls inside the agreed band, not that it hits one exact value.
Decision field: 20/20 “declined” → PASS (substance is stable)
Claim number present: 20/20 → PASS (required fact always there)
Review-rights line present: 18/20 → FAIL (dropped twice — a real defect)
Opening wording identical: 4/20 → IGNORED (variation is expected and fine)
The stability check is what catches the dangerous case the single run hides: an output that is correct on the run you happened to capture, and quietly inconsistent on the substance the other times.
8 Snapshot Drift
Even with semantic assertions, teams keep a captured “known-good” output — a snapshot — as a reference. Snapshots are useful, but they carry a specific trap for generative systems: snapshot drift. Because no two runs are identical, an exact snapshot comparison fails constantly, so teams get into the habit of re-recording the snapshot whenever the test goes red — “update the snapshot, it’s just wording”.
The danger is that re-recording blindly bakes in regressions. The day the model quietly starts dropping the review-rights line, the snapshot test fails, someone re-records the snapshot to make it green, and now the broken output is the new “expected”. The safety net has been silently lowered. Snapshot drift is how a generative test slowly stops testing anything.
The fix is to snapshot the invariants, not the prose. Instead of snapshotting the whole text, snapshot the structured facts you extracted from it — decision, claim number, presence of the review-rights line, output shape. Those are stable run to run, so a change is a real signal, not noise. And never let a snapshot be re-recorded automatically on red: every update must be a reviewed, deliberate decision that the new behaviour is actually correct.
9 Common Mistakes
🚫 Asserting exact-match on generative output
Why it happens: Exact-match is how we test everything else, so it is the instinctive first assertion.
The fix: The system is non-deterministic by design, so exact-match is guaranteed flaky — it fails correct outputs for different wording. Assert on meaning: required facts present, forbidden content absent, structure valid, semantic equivalence to a reference.
🚫 Tagging a flaky AI test “known-flaky” and re-running until green
Why it happens: The test fails for no code change, so it looks like noise to be worked around.
The fix: A test that passes and fails on the same input has trained the team to ignore red, so a real regression will hide among the flakes. Fix the assertion to tolerate variation instead of suppressing the alarm.
🚫 Trusting a single run to prove the system is consistent
Why it happens: One green run feels like proof, the way it is for deterministic code.
The fix: A single run only shows the output was acceptable once. Run the input N times and check the substance is stable — a borderline decision that flips across repeats is a defect a single run will never reveal.
🚫 Re-recording a snapshot automatically whenever it goes red
Why it happens: The snapshot fails constantly on wording, so re-recording feels like routine maintenance.
The fix: Blind re-recording bakes in regressions — a dropped review-rights line becomes the new “expected”. Snapshot the extracted invariants, not the prose, and treat every snapshot update as a reviewed decision.
10 Now You Try
Three graded exercises: spot the flaky test, fix the assertion, build the consistency-test plan. Write your answer, run it for AI feedback, then compare to the model answer.
A fictional ACC claim-summary assistant has a test that passes some nights and fails others with no code change. The test and two captured outputs are below. Explain why it is flaky (not a model bug), say whether both outputs are actually correct, and identify what should be asserted instead.
Run A output: “Your claim ACC-4471 has been approved. You have the right to a review within three months.”
Run B output: “Good news — we’ve approved your claim ACC-4471. You can ask for a review within three months if you disagree.”
Diagnose it:
Show model answer
Why it is flaky: The model is non-deterministic by design, so the same input legitimately produces different wording each run. The assertion demands a character-for-character match against one captured string, so any run that phrases the summary differently fails — even when it is correct. It passes only on runs that happen to reproduce the exact saved wording, which is why it is red some nights and green others with no code change. The flakiness is in the assertion, not the model. Are both correct? Yes. Both state the same decision (approved), the same claim number (ACC-4471), and the same review right (within three months). They are semantically equivalent; only the phrasing differs. Run B is just as correct as Run A. What should be asserted (invariants): the output CONTAINS the claim number ACC-4471; it states the decision "approved"; it includes the review-rights notice (a review within three months). Optionally a must-not-contain check (no other claim number, no invented dollar figure) and a structure check. What should be tolerated (variation): the exact opening phrasing, greetings like "Good news", and word order. None of that changes the meaning, so none of it should fail the test.
A team “fixed” the flaky ACC test with the approach below. Explain why it is still wrong (it suppresses the alarm rather than fixing the assertion), then write a proper assertion strategy using semantic-equivalence, required-fact, and must-not-contain checks, plus how you would control randomness.
Write your critique and the real fix:
Show model answer
Why auto-retry is wrong: It hides the symptom instead of fixing the cause. The assertion still demands exact-match against output that is non-deterministic, so a "pass" just means one of five tries happened to reproduce the saved string. It also masks real regressions — if the model genuinely starts dropping the review-rights line, a retry can still stumble onto a good run and turn the build green, so the broken behaviour ships. Retrying a wrong assertion makes the alarm quieter, not the system safer.
Randomness controls: Set temperature low (near zero) so the model takes the most likely path and output is more stable; fix a seed if the provider exposes one so failures are reproducible for debugging. Caveat: these reduce, not eliminate, non-determinism (output can still vary across model versions/hardware), so you still need tolerant assertions — you cannot go back to exact-match.
New assertion strategy:
- Required-fact / contains: output contains the claim number; states the correct decision ("approved"/"declined"); includes the review-rights notice (review within three months).
- Must-not-contain: no other claimant's claim number or name; no dollar figure that was not provided; no internal code.
- Semantic-equivalence: compare the summary to a reference by meaning (embedding-similarity threshold or a validated LLM-as-judge against a rubric) so correct paraphrases pass and a meaning change fails. Add a repeat-N check that the decision field is identical across 20 runs.
Design a deterministic-consistency test plan for a fictional MSD benefit-letter assistant that drafts a plain-language outcome letter (decision + reasons + next steps). Cover: randomness control, the invariant assertions, a repeat-N stability check with what must be stable vs may vary, any tolerance band, and how you handle snapshots safely.
Show model answer
1. Randomness control: Run the test at low temperature (near zero) for stability and fix a seed if available so failures reproduce. Caveat: this reduces but does not remove non-determinism (model-version/hardware variation remains), so assertions must still tolerate wording differences. Where production runs warmer, test at the production temperature too. 2. Invariant assertions: - Contains: the correct decision (granted/declined), the specific benefit named, the reason for the decision, and the next-steps/appeal-rights line. - Must-not-contain: another person's name or client number, a dollar amount that was not supplied, internal-only codes. - Structure: the letter has the required sections (decision, reasons, next steps) and any required reference number is well-formed. 3. Repeat-N stability: Run the same case 20 times. MUST be identical every run: the decision, the benefit named, presence of the appeal-rights line. MAY vary: greeting, sentence order, exact phrasing of the explanation. A decision that flips across the 20 runs is a defect, not noise. 4. Tolerance band: If the letter estimates a processing or payment timeframe, accept any value inside an agreed band (e.g. "10–15 working days" or "about three weeks") and fail values outside it; do not demand one exact figure. 5. Snapshot strategy: Snapshot the extracted invariants (decision, benefit, appeal-line present, structure) rather than the full prose, so a change is a real signal. Never auto-re-record on red — every snapshot update is a reviewed decision confirming the new behaviour is genuinely correct, so a dropped appeal-rights line can never silently become the new expected.
11 Self-Check
Click each question to reveal the answer.
Q1: Why is an exact-match assertion on a generative output a flaky test rather than a real one?
Because the system is non-deterministic by design — the same input legitimately produces different wording each run — so exact-match fails correct outputs whenever the phrasing differs. It passes only when a run happens to reproduce the saved string, so it goes red and green on the same input with no code change. The fault is in the assertion, not the model.
Q2: What does temperature control do, and why is it not enough on its own?
Low temperature makes the model take the most likely path almost every time, so output is far more stable. It is not enough because it reduces but does not eliminate non-determinism — output can still vary across model versions, hardware, or batching — so you still need assertions that tolerate variation rather than reverting to exact-match.
Q3: What is a semantic-equivalence assertion, and what does it assert on instead of the exact string?
It asserts the output means the right thing rather than matching an exact string. In practice you assert the invariants — required facts present, forbidden content absent, structure valid, and meaning close to a reference (embedding similarity or a validated judge) — while tolerating the wording that will always vary.
Q4: Why run an input N times, and what should be stable versus allowed to vary?
A single run only shows the output was acceptable once; a repeat-N check tests consistency. The substance must be stable across every run — the decision, the required facts, the extracted value — while phrasing and word order may vary. A decision that flips across repeats is a defect a single run never reveals.
Q5: What is snapshot drift, and how do you guard against it?
Snapshot drift is when teams re-record a snapshot every time it goes red because wording always differs, which silently bakes in regressions — a dropped review-rights line becomes the new “expected”. Guard against it by snapshotting the extracted invariants instead of the prose, and treating every snapshot update as a reviewed, deliberate decision rather than an automatic re-record.
12 Interview Prep
Real questions asked in NZ QA interviews for AI testing roles. Read the model answers, then practise your own version.
“How do you write a stable test for a system that gives a different answer every time?”
I do two things. First I reduce the randomness I can control — run at low temperature, fix a seed if the provider exposes one — knowing that only reduces, not removes, non-determinism. Then, crucially, I stop asserting on the exact string and assert on what is invariant: the required facts are present, forbidden content is absent, the structure is valid, and the meaning matches a reference by semantic similarity or a validated judge. The exact wording will always vary, so I tolerate it; the substance must not vary, so I test it. Exact-match on generative output is a flaky test, not a real one.
“A teammate marked an AI test as known-flaky with auto-retry. What would you say?”
That it hides the problem instead of fixing it. The test fails because it asserts exact-match on non-deterministic output, so retrying just waits for a run that happens to reproduce the saved string. The real danger is that it also masks regressions — if the model genuinely starts dropping a required line, a retry can still land a good run and turn the build green, so the broken output ships. I’d rewrite the assertion to check invariants and semantic equivalence so it passes correct paraphrases and fails real meaning changes, and remove the retry so red means something again.
“How would you test that an AI assistant is consistent, not just correct once?”
With a repeat-N stability check. I run the same input 10 or 20 times and measure consistency on the dimension that matters — for a decision, the decision must be identical every run even though the wording varies; for an extraction, the extracted value must be identical. Phrasing variation is fine; substance variation is a defect. Where some numeric variation is genuinely acceptable, like an estimated timeframe, I use a tolerance band and assert the answer falls inside it rather than hitting one exact value. That catches the dangerous case a single run hides — an output that is right once and quietly inconsistent the rest of the time.