Test with AI · AI Evaluation

Model Benchmarking

A vendor benchmark says a model scores 92% on a famous public test. Your council triage assistant then misroutes a third of real ratepayer messages. The benchmark was not lying — it was answering a different question than the one you needed answered. Model benchmarking is how you measure the model on your task, not someone else’s.

Test with AI AI Testing Engineer — Lesson 4 of 8 ~30 min read · ~75 min with exercises

1 The Hook

A regional council was choosing a model to power a triage assistant: ratepayers send a message — a pothole, a noise complaint, a rates query — and the assistant routes it to the right team. The procurement team did the sensible-looking thing. They pulled the public leaderboard, found the model with the highest headline score, and signed off on it. It topped a famous reasoning benchmark, so surely it would top this.

In the first fortnight of the pilot, roughly a third of messages landed on the wrong desk. Noise complaints went to roading. A burst water main sat in the general-enquiries queue for two days because it was filed as “feedback”. The model that aced the public benchmark was middling at the one task the council actually had: reading a short, messy, te-reo-sprinkled message from a real resident and picking one of nine local team categories.

A tester on the project ran a quiet experiment. She hand-built 120 real anonymised messages, tagged each with the correct team, and ran three candidate models against that set — including a cheaper, lower-leaderboard model the procurement team had dismissed. The cheaper model routed 91% correctly. The leaderboard champion managed 78%. The benchmark everyone trusted had measured general reasoning; her eval set measured the actual job.

This is the lesson of model benchmarking. A score is only meaningful against a task. The public number told the council how a model does on someone else’s exam. It told them nothing about routing Palmerston North’s mail. The benchmark that matters is the one you build from your own task.

2 The Rule

A leaderboard tells you how a model does on someone else’s task. It cannot tell you how it does on yours. Benchmark every model and prompt against a task-specific eval set built from your own data, scored with a rubric you wrote, and decide on the full picture — accuracy, latency, and cost together, never accuracy alone.

⚠️ Common Misconception

The common assumption: the model with the highest benchmark score is the best choice for your task.

Benchmark scores measure performance on the benchmark. If the benchmark does not represent your task, the score does not predict performance on your task. Published benchmarks are also contaminated — the models being evaluated were likely trained on data that includes the benchmark itself, making high scores partly an artefact of memorisation rather than capability. The right use of a published benchmark is to shortlist candidates for further evaluation, not to make the final selection. The right benchmark for selection is one you build yourself, from your own task data, with your own scoring criteria — and even then, it measures only what you put in it.

3 The Analogy

Analogy

Hiring a chef on their cooking-show trophy instead of a tasting against your own menu.

A chef wins a national competition and the trophy is real — they can cook. But your restaurant serves a fixed menu to a particular crowd: fast, consistent, to a local taste. The smart move is not to read the trophy; it is to put your actual menu in front of the candidates and taste what they produce, against the standard your kitchen has to hit, at the speed your service demands, at a food cost you can afford.

A model leaderboard is the trophy. Your eval set is the tasting. The trophy says “this chef is good at something”. The tasting says “this chef is good at my dishes, at my pace, at my price” — which is the only question that decides who you hire.

4 Your Own Eval, Not a Leaderboard

Public leaderboards rank models on standard academic tests — broad reasoning, general knowledge, coding puzzles. They are useful for one thing: a rough shortlist of which models are even in the running. They are useless for the decision that follows, because your task is not on the leaderboard.

There are three reasons a leaderboard cannot make your call. First, task mismatch: a model strong at maths proofs may be weak at classifying a terse council message into one of nine local categories. Second, data mismatch: public benchmarks are written in clean, formal English; your inputs are messy, abbreviated, bilingual, and full of local context a global test never saw. Third, contamination: popular benchmarks leak into training data, so a high score can reflect memorisation rather than capability.

The fix is a task-specific eval set built from your own data: real inputs, the correct outputs you expect, and a scoring method that reflects what “good” means for your task. The council tester’s 120 tagged messages are exactly this. It does not need to be large to be decisive — 100 to 300 representative, correctly-labelled items will separate candidate models far better than any public score.

Pro tip: Use leaderboards to pick which three models to test, never to pick the winner. The leaderboard makes the shortlist; your eval set makes the decision.

5 Golden Datasets and Scoring Rubrics

A benchmark is only as trustworthy as its answer key. The answer key is your golden dataset: a set of inputs each paired with the known-correct output, agreed by someone who actually knows the domain. For the council, the golden dataset is the 120 messages with their human-verified correct team. Get the golden answers wrong and every score after that is measuring against a lie.

How you score depends on the task, and there are two broad shapes:

Closed tasks (classification, routing, extraction) have one right answer, so you score with exact-match metrics — accuracy, precision and recall per category. The council routing task is closed: the message goes to the right team or it does not.
Open tasks (summarising, drafting a reply, explaining) have many acceptable answers, so exact-match is useless. You score these with a rubric: a written scale defining what a 1, 3, and 5 look like on each quality you care about — correctness, completeness, tone, safety.

A rubric turns “the answer felt good” into a repeatable judgement. Write down, before scoring, what each level means: “5 = every fact correct and the resident’s next step is clear; 3 = correct but vague on next steps; 1 = a factual error or wrong tone.” Two people scoring the same answer with the same rubric should land within a point. If they cannot, the rubric is too loose — tighten it before you trust any number it produces.

Pro tip: For open tasks at scale, a real LLM can apply your rubric as a judge — but only after you have checked it agrees with human scores on a sample. An unchecked AI judge just moves the trust problem somewhere you can no longer see it.

6 A/B Comparison of Models and Prompts

Once you have a golden dataset and a rubric, benchmarking becomes a controlled comparison. You run each candidate against the same eval set and compare the scores. The discipline that makes this valid is the one testers already know: change one thing at a time.

There are two distinct comparisons, and teams routinely confuse them:

Model A/B: same prompt, same eval set, different model. This isolates the model’s contribution — the council’s cheaper model vs the leaderboard champion on identical inputs and instructions.
Prompt A/B: same model, same eval set, different prompt. This isolates the prompt’s contribution — often a better prompt closes most of the gap between two models for a fraction of the cost.

If you change the model and the prompt at once and the score moves, you have no idea which change caused it. Run them as separate experiments. And keep the eval set fixed across every run — the moment you tweak the test set between candidates, you are no longer comparing models, you are comparing tests.

Pro tip: Before paying for a bigger model, run a prompt A/B on the cheaper one. A clearer prompt often beats a more expensive model on a narrow task — the council’s routing job is exactly the kind that rewards a tight prompt over raw model size.

From the field

A central government agency in Wellington was running a NZISM-aligned procurement process for an AI model to process OIA requests — triaging complexity, redaction scope, and likely response time. The team assumed the model with the highest accuracy on their 200-item golden dataset was the obvious pick; it had scored 94% versus 88% for the runner-up. When they finally benchmarked latency and per-request cost at production volume (roughly 4,000 OIA requests per month), the 94%-accurate model would have cost six times more annually than the 88% model — and the 6% accuracy difference translated to about one additional misclassification per day, easily caught in human review. They shipped the cheaper model. The lesson that generalises: always calculate the cost of your accuracy gap in real-world terms before paying the premium for the higher-scoring model.

7 Accuracy vs Latency vs Cost

Accuracy is one axis of three, and the highest-accuracy model is often the wrong choice. Every benchmark should report three numbers per candidate, not one: quality (accuracy or rubric score), latency (how long a response takes), and cost (price per request, multiplied by your real volume).

These trade against each other in ways a single score hides. The council’s leaderboard champion was the most accurate on open-ended explanations — but it was three times the cost and twice the latency of the cheaper model that won the routing task. For a high-volume routing job where a one-second response keeps the queue moving, a model that is two points more accurate but twice as slow and three times the price is the wrong answer, not the right one.

Benchmark scorecard — council triage, 120-message eval set:
Model A (leaderboard champion): routing 78%, latency 2.1s, cost 3.0×
Model B (cheaper, shortlisted): routing 91%, latency 0.9s, cost 1.0×
Model C (mid-tier): routing 86%, latency 1.4s, cost 1.7×
The headline score would have picked A. The full scorecard picks B on every axis.

The right model is the cheapest, fastest one that clears your quality bar — the minimum acceptable score for the task’s risk. Set that bar first, from the consequence of a wrong answer, then choose the most efficient candidate above it. Chasing the top score regardless of cost and speed is how teams ship a slow, expensive assistant that is two points better at a job nobody complained about.

8 Regression Gating on Model Upgrades

The benchmark is not a one-off for procurement — it becomes a permanent gate. Generative AI models change underneath you. A provider pushes a new version, deprecates the one you tested, or silently retunes it. The model that passed last quarter is not guaranteed to be the model answering today, and an “upgrade” can quietly regress on your specific task even as its public scores rise.

This is why your eval set earns its keep long after go-live. Regression gating means: before any model or prompt change reaches production, you re-run the full eval set and compare against the current baseline. If the new version drops below the quality bar — or regresses on any category that matters — the change does not ship, however shiny the release notes. The council would run their 120 messages against any proposed model swap and block it if routing accuracy fell below, say, 88%.

Treat it exactly like a regression test suite in traditional software, because that is what it is. The eval set is your AI regression pack; the quality bar is your pass condition; the gate is automated in the deployment pipeline so no one can wave through a “minor” model bump that breaks the one task you ship.

Pro tip: Record the model version, prompt version, and full scorecard with every benchmark run. When something regresses three months later, that history is the difference between “the model changed on this date” and a week of guessing.

LLM Evaluation Pipeline

A model selection decision is only as credible as the pipeline that produced it. Each stage catches a different failure class: dataset construction catches representation gaps, scoring catches average-case performance gaps, and the deployment gate catches regressions against the current production baseline.

Golden Dataset

task-specific

→

Model A

Model B

→

Scoring

accuracy · latency · cost

→

Leaderboard

ranked comparison

→

Deployment Gate

vs production baseline

→

Deploy ✓

Rollback ✕

Both models run against the same golden dataset so results are comparable. The gate compares the winner against the current production model — a candidate that scores highest in isolation but regresses on a specific task category should not deploy.

9 Common Mistakes

🚫 Choosing a model on its public leaderboard score

Why it happens: A high headline number looks like proof of quality and saves the work of building an eval set.
The fix: A leaderboard measures someone else’s task on clean academic data, and popular benchmarks leak into training. Use it only to shortlist; decide with a task-specific eval set built from your own data.

🚫 Reporting accuracy alone and ignoring latency and cost

Why it happens: Accuracy is the easy number to quote and feels like the only thing that matters.
The fix: The most accurate model is often too slow or too expensive for the real volume. Report quality, latency, and cost together, set a quality bar from the task’s risk, and pick the most efficient candidate above it.

🚫 Changing the model and the prompt at the same time

Why it happens: It feels efficient to improve everything in one pass.
The fix: If both change and the score moves, you cannot tell which change caused it. Run model A/B and prompt A/B as separate experiments, one variable at a time, against a fixed eval set.

🚫 Benchmarking once for procurement, then never again

Why it happens: The model was chosen, the project moved on, and the eval set was filed away.
The fix: Providers update and deprecate models, and an upgrade can regress on your task. Keep the eval set as a regression gate and re-run it before any model or prompt change reaches production.

Senior engineer insight

The turning point for me was realising that a benchmark is a contract between you and the task — not a report card for the model. When we benchmarked a document-classification system for a government agency, three models were within two percentage points of each other on our golden set. The decision came down entirely to latency under concurrent load, which we almost didn’t measure at all. The model that “won” on accuracy was third on every production-relevant metric.

What changed my thinking: I now run the latency and cost benchmarks before the accuracy benchmark, so I never fall in love with a model that I’ll have to talk the team out of on cost grounds later.

The most common mistake: teams spend weeks curating a beautiful golden dataset, then score accuracy only — and end up with a technically rigorous argument for the wrong model.

10 Now You Try

Three graded exercises: spot the flawed selection, fix the rubric, build the benchmark plan. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Flawed Model Selection

A fictional Revenue NZ chatbot team posts the rationale below for picking its model. Identify what is wrong with this selection process, name the specific benchmarking errors, and say what the team should have done instead.

Team rationale: “We chose Model X because it ranks #1 on a public reasoning leaderboard with 94%. We did not build a separate test set — the leaderboard already proves it is the best model. It is also our most accurate option in general, so even though it is the slowest and most expensive of the three we tried, accuracy wins. We will lock it in now and not re-test, since the model is already the best available.”

Diagnose it:

Show model answer

Errors: (1) Deciding on a leaderboard. A public reasoning score measures a different task on clean academic data, and popular benchmarks leak into training, so the 94% may reflect memorisation, not capability on Revenue NZ's actual job. (2) No task-specific eval set, so the team has zero evidence about the one task that matters — answering real Revenue NZ questions. (3) Accuracy-only thinking. Picking the slowest, most expensive model because it is "most accurate in general" ignores that for a high-volume help chatbot, latency and cost are first-class constraints, and the accuracy advantage may not even hold on Revenue NZ's task. (4) No re-testing/regression gate, so a future model update could silently regress and no one would know.

What they should have done: Use the leaderboard only to shortlist three candidates. Build a golden dataset of real anonymised Revenue NZ questions with verified correct answers, write a scoring rubric (or exact-match for closed questions). Run a model A/B with the same prompt across all three, report quality, latency, and cost per candidate, set a quality bar from the risk of a wrong tax answer, and pick the cheapest, fastest model above the bar. Then keep the eval set as a regression gate and re-run it before any model or prompt change ships.

🔧 Exercise 2 of 3 — Fix a Weak Scoring Rubric

A team benchmarking a fictional council triage assistant’s plain-language reply quality wrote the rubric below. Explain why it will produce unreliable, non-repeatable scores, then rewrite it as a proper rubric with defined levels and the qualities that matter.

Their rubric: “Score each reply out of 10 based on how good it is. Higher is better. Use your judgement.”

Write your critique and the rewritten rubric:

Show model answer

Why it is unreliable: "How good it is, use your judgement" defines nothing, so two scorers will disagree wildly and the same scorer will drift over a long session. A 10-point scale with no anchors invents false precision — nobody can say what separates a 6 from a 7. The scores are not repeatable, so they cannot compare models or gate a release.

Qualities to measure (for a council reply): factual correctness, completeness (does it give the resident's next step), tone/clarity in plain language, and safety (no invented policy or wrong contact).

Rewritten rubric (a tight 1–5 with anchors):
  5 = Every fact correct, the resident's next step is clear and actionable, tone is plain and respectful.
  3 = Facts correct but vague on the next step, or tone slightly off; usable but not great.
  1 = A factual error, an invented policy/contact, or a tone that would upset a resident.
(2 and 4 are the half-steps between these anchors.) Score each quality separately, then combine, so a safety failure cannot be hidden by good tone.

Checking tightness: have two people independently score the same 20 replies. If they agree within one point on most, the rubric is tight enough; if not, sharpen the anchors and re-test before trusting any benchmark number.

🏗️ Exercise 3 of 3 — Build a Benchmark Plan

Design a model-benchmarking plan for a fictional council triage assistant that routes ratepayer messages to one of nine teams. Cover: the golden dataset, the scoring method, the A/B design, the three trade-off axes with a quality bar, and the regression gate. Be specific and NZ-appropriate.

Show model answer

1. Golden dataset: 150–250 real anonymised ratepayer messages (potholes, noise, rates, water, consents, te-reo-sprinkled and abbreviated ones included), each tagged with the human-verified correct team. Cover all nine teams and the messy edge cases, not just the tidy ones.

2. Scoring method: Routing is a CLOSED task — one correct team — so score with exact-match accuracy, plus per-team precision and recall so a team that is always misrouted shows up. (If you also benchmark the plain-language reply, that part needs a 1–5 rubric.)

3. A/B design: Fix the eval set and the prompt; vary only the model (model A/B) across the three shortlisted candidates. Separately, fix the model and vary the prompt (prompt A/B) to see if a tighter prompt closes the gap cheaply. Never change model and prompt together.

4. Trade-off axes + quality bar: Report routing accuracy, median latency, and cost-per-message × monthly volume for each candidate. Set the quality bar from the consequence of a misroute (e.g. a burst main sitting in the wrong queue) — say 88% routing accuracy minimum — then pick the cheapest, fastest model above the bar, not the single most accurate one.

5. Regression gate: Re-run the full eval set before any model or prompt change ships; block the change if accuracy falls below 88% or any critical team (e.g. water/emergency) regresses. Record model version, prompt version, date, and the full scorecard for every run so a future regression can be traced to the exact change.

Why teams fail here

Treating the golden dataset as done once it passes a spot-check. If the domain expert who labelled the data made systematic errors — miscategorised edge cases, applied ambiguous rules inconsistently — every number produced by the benchmark is wrong in the same direction. Teams rarely re-validate the golden set after the first pass.
Running model A/B and prompt A/B in the same experiment. A score improvement that could be the model, the prompt, or their interaction is not an improvement you can act on — you don’t know what to keep when something changes.
Using an LLM as a judge without calibrating it first. An AI-graded rubric that was never checked against human scores on a representative sample is just moving the uncertainty somewhere less visible. The judge model can have its own biases, including preferring outputs that resemble its own style.
Benchmarking the average and ignoring the tail. A model that scores 91% overall but routes emergency-maintenance requests correctly only 65% of the time is not a 91%-accurate model for operational purposes. Aggregate scores hide the categories where failure has the highest consequence.
Letting the eval set drift out of sync with production inputs. A benchmark built from data sampled 18 months ago measures the model on a distribution that may no longer exist. If your input mix has shifted — new request types, seasonal patterns, language changes — the benchmark is measuring the past, not the present.
Skipping the regression gate for “minor” model version bumps. Provider release notes describe improvements, never regressions. Every model version change — however minor — must pass the full eval set before reaching production, because regressions on narrow tasks are exactly what release notes don’t mention.

11 Self-Check

Click each question to reveal the answer.

Q1: Why can a public leaderboard not decide which model to ship for your task?

Because it measures a different task on clean academic data, your real inputs are messier and more local, and popular benchmarks leak into training so a high score can be memorisation. A leaderboard is only good for shortlisting candidates; a task-specific eval set built from your own data makes the actual decision.

Q2: What is a golden dataset, and why does its quality cap the quality of everything that follows?

A golden dataset is a set of inputs each paired with the known-correct output, verified by someone who knows the domain. It is the answer key. If the golden answers are wrong, every benchmark score is measured against a lie, so no later comparison can be trusted — the answer key’s accuracy is the ceiling on the benchmark’s accuracy.

Q3: When do you score with exact-match metrics, and when do you need a rubric?

Closed tasks — classification, routing, extraction — have one right answer, so you use exact-match metrics like accuracy, precision, and recall. Open tasks — summarising, drafting, explaining — have many acceptable answers, so exact-match fails and you score with a written rubric defining what each quality level looks like.

Q4: Why is reporting accuracy alone a mistake, and what should a benchmark report instead?

The most accurate model can be too slow or too expensive for the real volume, so accuracy alone can pick the wrong model. A benchmark should report quality, latency, and cost together; set a quality bar from the task’s risk; then choose the cheapest, fastest candidate that clears the bar.

Q5: What is regression gating, and why does an eval set still matter after go-live?

Regression gating means re-running the full eval set before any model or prompt change reaches production and blocking the change if it drops below the quality bar. It still matters because providers update and deprecate models, and an “upgrade” can silently regress on your specific task even as its public scores rise.

12 Interview Prep

Real questions asked in NZ QA interviews for AI testing roles. Read the model answers, then practise your own version.

“How would you choose between two models for a production feature?”

I would not choose on a leaderboard — that measures someone else’s task on clean data and can be contaminated by training leakage. I’d build a task-specific eval set from real inputs with verified correct outputs, score it with exact-match for a closed task or a written rubric for an open one, and run a model A/B with the same prompt across both candidates. Then I’d report quality, latency, and cost for each, set a quality bar from the risk of a wrong answer, and pick the cheapest, fastest model above the bar. The leaderboard only ever decides my shortlist, never the winner.

“Our provider is upgrading the model next month. What do you do?”

I treat the upgrade as a change that must pass the regression gate, not a free improvement. Before it reaches production I re-run the full eval set against the new version and compare to the current baseline on every metric and every category. If the new version drops below the quality bar, or regresses on any category that matters — even if its public scores went up — the upgrade does not ship until it is fixed. I’d also confirm we recorded the model and prompt versions with each run, so if something regresses later we can trace it to the exact change rather than guessing.

“A stakeholder wants the highest-accuracy model regardless of cost. How do you respond?”

I’d reframe the decision around three axes, not one. Accuracy matters, but latency and cost are real constraints — for a high-volume assistant, a model that is two points more accurate but twice as slow and three times the price can be the wrong choice. I’d set a quality bar from the consequence of a wrong answer, show the scorecard for each candidate, and demonstrate that the cheaper, faster model already clears the bar. Often a tighter prompt on the cheaper model closes most of the accuracy gap, so I’d run that prompt A/B before agreeing to pay for size we do not need.

Lessons from Production

What teams consistently discover after deploying this in real systems — things that don’t appear in documentation.

The benchmark that earns buy-in from leadership is rarely the benchmark that predicts production performance. Accuracy on a curated test set is easy to present. Latency under load, cost per query, and degradation on edge cases are not.
Benchmark contamination is assumed to be under control. It almost never is. Models trained on internet-scale data have almost certainly seen the evaluation datasets — sometimes verbatim.
The second benchmark run is used to confirm the decision already made. If the team has already committed to a vendor, subsequent evaluation is motivated reasoning. Build governance into benchmark design before the selection pressure is on.
Benchmark datasets drift faster than teams expect. Annual refresh is the minimum for production systems. A benchmark from 18 months ago may no longer represent your current input distribution.
Whoever builds the benchmark has enormous influence over the result. This is a governance issue, not a technical one. Cross-team review of benchmark construction is as important as cross-team review of model selection.
Cost and latency benchmarks are always run later, after the accuracy benchmark has already selected the model. Run them at the same time — a model with slightly lower accuracy and half the latency at a quarter of the cost often wins in production.

Compared to What?

Benchmarking is one of several ways to evaluate whether an AI model is fit for purpose. The right choice depends on what "fit for purpose" means for your specific task.

Technique	Best for	Weakness
Task-Specific Benchmarking this technique	Comparing models on your actual production task with real or representative data	Requires ground-truth data curation; results are not generalisable beyond your task
Published Academic Benchmarks (MMLU, HumanEval, etc.)	Initial model shortlisting and vendor comparison	Benchmark contamination; may not correlate with your task performance
A/B Testing in Production	Measuring real user impact of model changes	Requires live traffic; statistical significance takes time; risky for harmful outputs
Human Evaluation	Assessing subjective quality (tone, helpfulness, creativity)	Expensive; slow; inter-rater variability; cannot scale to large test sets
Red Teaming	Discovering failure modes rather than measuring average performance	Not a measurement of typical quality; complementary to, not a replacement for, benchmarking

Published benchmarks are a starting shortlist, not a buying decision. Always run your own task-specific evaluation before selecting a model for production.

When Not to Use This

Experience is knowing when a technique is not the right tool. Skip this one when:

When you have no ground truth

Benchmarking requires a reference answer or human judgements to measure against. If you cannot define what "correct" looks like for your task, benchmarking cannot tell you which model is better — only which model is different.

Fully stable tasks that never change

If your task has not changed in 12 months and your model is performing adequately, benchmarking runs add overhead without insight. Revisit when the model, the task, or the data distribution changes.

Cost-insensitive model selection

If your organisation's policy is to use a specific provider for data sovereignty reasons, benchmarking other providers is academic. Spend the evaluation budget on fine-tuning and prompt engineering within the allowed model.

Very early prototyping

Before you understand the task requirements well enough to write evaluation criteria, benchmarks will tell you nothing reliable. Define criteria first, then measure.

At Enterprise Scale

🏢 Enterprise Context

40+ models and fine-tunes under evaluation8 task verticals3 model providersMonthly re-evaluation cadence

At enterprise scale, benchmarking becomes an asset management problem. When 40 product teams each run their own evaluation, you end up with 40 different benchmark datasets, 40 different scoring rubrics, and no way to compare results across teams. The first enterprise investment is a shared evaluation harness: a standard format for benchmark datasets, a centralised scoring service, and a leaderboard that all teams write to.

The second problem is benchmark freshness. A benchmark dataset curated 18 months ago against an older model may have been contaminated into newer models' training data, or may no longer represent the current distribution of production inputs. At enterprise scale, benchmark maintenance — regular sampling from live traffic, annotation review, drift monitoring — needs to be a dedicated function, not an afterthought.

The third problem is benchmark gaming. When product teams are measured on "benchmark improvement," the pressure to select evaluation sets that happen to suit the model being evaluated is real. Enterprise governance requires held-out test sets that product teams cannot access during development, and cross-team review of benchmark construction before results are published.

Failure Analysis

📋 Post-Mortem

The Model That Aced Benchmarks and Failed in Production

An insurance company evaluated three LLMs for a claims triage assistant using a benchmark dataset of 500 historical claims with human-labelled urgency scores. Model B scored highest on precision and recall and was selected. It went live two months later.

What happened: Claims handlers reported that Model B was frequently misclassifying a specific new category of claim — property damage from severe weather events — that had spiked following an unusual storm season. Accuracy on this category was below 60%, compared to 89% on the benchmark.
Why benchmarks missed it: The 500-claim benchmark dataset had been sampled from claims filed in the previous 18 months, before the storm-season spike. The new category represented less than 0.5% of the benchmark but 22% of current live volume. Model B had never been evaluated on a representative sample of this claim type.
Root cause: Benchmark dataset was not refreshed before selection, and it was not stratified to ensure representation of rare claim types. The model that performed best on the historical distribution performed worst on the emerging distribution.
Fix: The benchmark process was redesigned: (1) datasets are refreshed quarterly with live samples; (2) rare-but-important categories are deliberately oversampled; (3) all evaluation results are broken down by category, not just reported as aggregate scores.
Lesson: A benchmark tells you how a model performs on the data in the benchmark. If the benchmark does not represent your current input distribution, the score means nothing. Production distributions shift — your benchmark must shift with them.

Why the Business Cares

Procurement

Model selection is a procurement decision with multi-year cost implications. Rigorous benchmarking is the evidence that selection was objective, not vendor-driven — a requirement in most public sector and large enterprise procurement frameworks.

Regulatory

When an AI system is challenged, the organisation must demonstrate that the selected model was fit for purpose. A defensible benchmarking process is the first line of that evidence.

Operational cost

A model with 2% higher accuracy that costs 5× more per query may be the wrong choice at scale. Benchmarking that omits cost and latency is benchmarking an incomplete picture.

Competitive risk

Model capabilities change rapidly. A benchmark run once at selection and never repeated means the organisation may be running an inferior model for years without knowing it.

Key takeaway

A public benchmark score tells you which model is best at someone else’s exam — your task-specific eval set, scored on accuracy, latency, and cost together, is the only evidence that tells you which model is best at your job.

Benchmarking tells you how a model performs on average across your evaluation set. Deterministic-Consistency Testing asks a different question: does it perform reliably on every individual run? A model that scores highest on your benchmark but gives different answers to the same input on different days is the wrong choice for production — especially in regulated contexts.

← Agent Testing Next: Deterministic-Consistency Testing →