Model Benchmarking
A vendor benchmark says a model scores 92% on a famous public test. Your council triage assistant then misroutes a third of real ratepayer messages. The benchmark was not lying — it was answering a different question than the one you needed answered. Model benchmarking is how you measure the model on your task, not someone else’s.
1 The Hook
A regional council was choosing a model to power a triage assistant: ratepayers send a message — a pothole, a noise complaint, a rates query — and the assistant routes it to the right team. The procurement team did the sensible-looking thing. They pulled the public leaderboard, found the model with the highest headline score, and signed off on it. It topped a famous reasoning benchmark, so surely it would top this.
In the first fortnight of the pilot, roughly a third of messages landed on the wrong desk. Noise complaints went to roading. A burst water main sat in the general-enquiries queue for two days because it was filed as “feedback”. The model that aced the public benchmark was middling at the one task the council actually had: reading a short, messy, te-reo-sprinkled message from a real resident and picking one of nine local team categories.
A tester on the project ran a quiet experiment. She hand-built 120 real anonymised messages, tagged each with the correct team, and ran three candidate models against that set — including a cheaper, lower-leaderboard model the procurement team had dismissed. The cheaper model routed 91% correctly. The leaderboard champion managed 78%. The benchmark everyone trusted had measured general reasoning; her eval set measured the actual job.
This is the lesson of model benchmarking. A score is only meaningful against a task. The public number told the council how a model does on someone else’s exam. It told them nothing about routing Palmerston North’s mail. The benchmark that matters is the one you build from your own task.
2 The Rule
A leaderboard tells you how a model does on someone else’s task. It cannot tell you how it does on yours. Benchmark every model and prompt against a task-specific eval set built from your own data, scored with a rubric you wrote, and decide on the full picture — accuracy, latency, and cost together, never accuracy alone.
3 The Analogy
Hiring a chef on their cooking-show trophy instead of a tasting against your own menu.
A chef wins a national competition and the trophy is real — they can cook. But your restaurant serves a fixed menu to a particular crowd: fast, consistent, to a local taste. The smart move is not to read the trophy; it is to put your actual menu in front of the candidates and taste what they produce, against the standard your kitchen has to hit, at the speed your service demands, at a food cost you can afford.
A model leaderboard is the trophy. Your eval set is the tasting. The trophy says “this chef is good at something”. The tasting says “this chef is good at my dishes, at my pace, at my price” — which is the only question that decides who you hire.
4 Your Own Eval, Not a Leaderboard
Public leaderboards rank models on standard academic tests — broad reasoning, general knowledge, coding puzzles. They are useful for one thing: a rough shortlist of which models are even in the running. They are useless for the decision that follows, because your task is not on the leaderboard.
There are three reasons a leaderboard cannot make your call. First, task mismatch: a model strong at maths proofs may be weak at classifying a terse council message into one of nine local categories. Second, data mismatch: public benchmarks are written in clean, formal English; your inputs are messy, abbreviated, bilingual, and full of local context a global test never saw. Third, contamination: popular benchmarks leak into training data, so a high score can reflect memorisation rather than capability.
The fix is a task-specific eval set built from your own data: real inputs, the correct outputs you expect, and a scoring method that reflects what “good” means for your task. The council tester’s 120 tagged messages are exactly this. It does not need to be large to be decisive — 100 to 300 representative, correctly-labelled items will separate candidate models far better than any public score.
5 Golden Datasets and Scoring Rubrics
A benchmark is only as trustworthy as its answer key. The answer key is your golden dataset: a set of inputs each paired with the known-correct output, agreed by someone who actually knows the domain. For the council, the golden dataset is the 120 messages with their human-verified correct team. Get the golden answers wrong and every score after that is measuring against a lie.
How you score depends on the task, and there are two broad shapes:
- Closed tasks (classification, routing, extraction) have one right answer, so you score with exact-match metrics — accuracy, precision and recall per category. The council routing task is closed: the message goes to the right team or it does not.
- Open tasks (summarising, drafting a reply, explaining) have many acceptable answers, so exact-match is useless. You score these with a rubric: a written scale defining what a 1, 3, and 5 look like on each quality you care about — correctness, completeness, tone, safety.
A rubric turns “the answer felt good” into a repeatable judgement. Write down, before scoring, what each level means: “5 = every fact correct and the resident’s next step is clear; 3 = correct but vague on next steps; 1 = a factual error or wrong tone.” Two people scoring the same answer with the same rubric should land within a point. If they cannot, the rubric is too loose — tighten it before you trust any number it produces.
6 A/B Comparison of Models and Prompts
Once you have a golden dataset and a rubric, benchmarking becomes a controlled comparison. You run each candidate against the same eval set and compare the scores. The discipline that makes this valid is the one testers already know: change one thing at a time.
There are two distinct comparisons, and teams routinely confuse them:
- Model A/B: same prompt, same eval set, different model. This isolates the model’s contribution — the council’s cheaper model vs the leaderboard champion on identical inputs and instructions.
- Prompt A/B: same model, same eval set, different prompt. This isolates the prompt’s contribution — often a better prompt closes most of the gap between two models for a fraction of the cost.
If you change the model and the prompt at once and the score moves, you have no idea which change caused it. Run them as separate experiments. And keep the eval set fixed across every run — the moment you tweak the test set between candidates, you are no longer comparing models, you are comparing tests.
7 Accuracy vs Latency vs Cost
Accuracy is one axis of three, and the highest-accuracy model is often the wrong choice. Every benchmark should report three numbers per candidate, not one: quality (accuracy or rubric score), latency (how long a response takes), and cost (price per request, multiplied by your real volume).
These trade against each other in ways a single score hides. The council’s leaderboard champion was the most accurate on open-ended explanations — but it was three times the cost and twice the latency of the cheaper model that won the routing task. For a high-volume routing job where a one-second response keeps the queue moving, a model that is two points more accurate but twice as slow and three times the price is the wrong answer, not the right one.
Model A (leaderboard champion): routing 78%, latency 2.1s, cost 3.0×
Model B (cheaper, shortlisted): routing 91%, latency 0.9s, cost 1.0×
Model C (mid-tier): routing 86%, latency 1.4s, cost 1.7×
The headline score would have picked A. The full scorecard picks B on every axis.
The right model is the cheapest, fastest one that clears your quality bar — the minimum acceptable score for the task’s risk. Set that bar first, from the consequence of a wrong answer, then choose the most efficient candidate above it. Chasing the top score regardless of cost and speed is how teams ship a slow, expensive assistant that is two points better at a job nobody complained about.
8 Regression Gating on Model Upgrades
The benchmark is not a one-off for procurement — it becomes a permanent gate. Generative AI models change underneath you. A provider pushes a new version, deprecates the one you tested, or silently retunes it. The model that passed last quarter is not guaranteed to be the model answering today, and an “upgrade” can quietly regress on your specific task even as its public scores rise.
This is why your eval set earns its keep long after go-live. Regression gating means: before any model or prompt change reaches production, you re-run the full eval set and compare against the current baseline. If the new version drops below the quality bar — or regresses on any category that matters — the change does not ship, however shiny the release notes. The council would run their 120 messages against any proposed model swap and block it if routing accuracy fell below, say, 88%.
Treat it exactly like a regression test suite in traditional software, because that is what it is. The eval set is your AI regression pack; the quality bar is your pass condition; the gate is automated in the deployment pipeline so no one can wave through a “minor” model bump that breaks the one task you ship.
9 Common Mistakes
🚫 Choosing a model on its public leaderboard score
Why it happens: A high headline number looks like proof of quality and saves the work of building an eval set.
The fix: A leaderboard measures someone else’s task on clean academic data, and popular benchmarks leak into training. Use it only to shortlist; decide with a task-specific eval set built from your own data.
🚫 Reporting accuracy alone and ignoring latency and cost
Why it happens: Accuracy is the easy number to quote and feels like the only thing that matters.
The fix: The most accurate model is often too slow or too expensive for the real volume. Report quality, latency, and cost together, set a quality bar from the task’s risk, and pick the most efficient candidate above it.
🚫 Changing the model and the prompt at the same time
Why it happens: It feels efficient to improve everything in one pass.
The fix: If both change and the score moves, you cannot tell which change caused it. Run model A/B and prompt A/B as separate experiments, one variable at a time, against a fixed eval set.
🚫 Benchmarking once for procurement, then never again
Why it happens: The model was chosen, the project moved on, and the eval set was filed away.
The fix: Providers update and deprecate models, and an upgrade can regress on your task. Keep the eval set as a regression gate and re-run it before any model or prompt change reaches production.
10 Now You Try
Three graded exercises: spot the flawed selection, fix the rubric, build the benchmark plan. Write your answer, run it for AI feedback, then compare to the model answer.
A fictional IRD chatbot team posts the rationale below for picking its model. Identify what is wrong with this selection process, name the specific benchmarking errors, and say what the team should have done instead.
Diagnose it:
Show model answer
Errors: (1) Deciding on a leaderboard. A public reasoning score measures a different task on clean academic data, and popular benchmarks leak into training, so the 94% may reflect memorisation, not capability on IRD's actual job. (2) No task-specific eval set, so the team has zero evidence about the one task that matters — answering real IRD questions. (3) Accuracy-only thinking. Picking the slowest, most expensive model because it is "most accurate in general" ignores that for a high-volume help chatbot, latency and cost are first-class constraints, and the accuracy advantage may not even hold on IRD's task. (4) No re-testing/regression gate, so a future model update could silently regress and no one would know. What they should have done: Use the leaderboard only to shortlist three candidates. Build a golden dataset of real anonymised IRD questions with verified correct answers, write a scoring rubric (or exact-match for closed questions). Run a model A/B with the same prompt across all three, report quality, latency, and cost per candidate, set a quality bar from the risk of a wrong tax answer, and pick the cheapest, fastest model above the bar. Then keep the eval set as a regression gate and re-run it before any model or prompt change ships.
A team benchmarking a fictional council triage assistant’s plain-language reply quality wrote the rubric below. Explain why it will produce unreliable, non-repeatable scores, then rewrite it as a proper rubric with defined levels and the qualities that matter.
Write your critique and the rewritten rubric:
Show model answer
Why it is unreliable: "How good it is, use your judgement" defines nothing, so two scorers will disagree wildly and the same scorer will drift over a long session. A 10-point scale with no anchors invents false precision — nobody can say what separates a 6 from a 7. The scores are not repeatable, so they cannot compare models or gate a release. Qualities to measure (for a council reply): factual correctness, completeness (does it give the resident's next step), tone/clarity in plain language, and safety (no invented policy or wrong contact). Rewritten rubric (a tight 1–5 with anchors): 5 = Every fact correct, the resident's next step is clear and actionable, tone is plain and respectful. 3 = Facts correct but vague on the next step, or tone slightly off; usable but not great. 1 = A factual error, an invented policy/contact, or a tone that would upset a resident. (2 and 4 are the half-steps between these anchors.) Score each quality separately, then combine, so a safety failure cannot be hidden by good tone. Checking tightness: have two people independently score the same 20 replies. If they agree within one point on most, the rubric is tight enough; if not, sharpen the anchors and re-test before trusting any benchmark number.
Design a model-benchmarking plan for a fictional council triage assistant that routes ratepayer messages to one of nine teams. Cover: the golden dataset, the scoring method, the A/B design, the three trade-off axes with a quality bar, and the regression gate. Be specific and NZ-appropriate.
Show model answer
1. Golden dataset: 150–250 real anonymised ratepayer messages (potholes, noise, rates, water, consents, te-reo-sprinkled and abbreviated ones included), each tagged with the human-verified correct team. Cover all nine teams and the messy edge cases, not just the tidy ones. 2. Scoring method: Routing is a CLOSED task — one correct team — so score with exact-match accuracy, plus per-team precision and recall so a team that is always misrouted shows up. (If you also benchmark the plain-language reply, that part needs a 1–5 rubric.) 3. A/B design: Fix the eval set and the prompt; vary only the model (model A/B) across the three shortlisted candidates. Separately, fix the model and vary the prompt (prompt A/B) to see if a tighter prompt closes the gap cheaply. Never change model and prompt together. 4. Trade-off axes + quality bar: Report routing accuracy, median latency, and cost-per-message × monthly volume for each candidate. Set the quality bar from the consequence of a misroute (e.g. a burst main sitting in the wrong queue) — say 88% routing accuracy minimum — then pick the cheapest, fastest model above the bar, not the single most accurate one. 5. Regression gate: Re-run the full eval set before any model or prompt change ships; block the change if accuracy falls below 88% or any critical team (e.g. water/emergency) regresses. Record model version, prompt version, date, and the full scorecard for every run so a future regression can be traced to the exact change.
11 Self-Check
Click each question to reveal the answer.
Q1: Why can a public leaderboard not decide which model to ship for your task?
Because it measures a different task on clean academic data, your real inputs are messier and more local, and popular benchmarks leak into training so a high score can be memorisation. A leaderboard is only good for shortlisting candidates; a task-specific eval set built from your own data makes the actual decision.
Q2: What is a golden dataset, and why does its quality cap the quality of everything that follows?
A golden dataset is a set of inputs each paired with the known-correct output, verified by someone who knows the domain. It is the answer key. If the golden answers are wrong, every benchmark score is measured against a lie, so no later comparison can be trusted — the answer key’s accuracy is the ceiling on the benchmark’s accuracy.
Q3: When do you score with exact-match metrics, and when do you need a rubric?
Closed tasks — classification, routing, extraction — have one right answer, so you use exact-match metrics like accuracy, precision, and recall. Open tasks — summarising, drafting, explaining — have many acceptable answers, so exact-match fails and you score with a written rubric defining what each quality level looks like.
Q4: Why is reporting accuracy alone a mistake, and what should a benchmark report instead?
The most accurate model can be too slow or too expensive for the real volume, so accuracy alone can pick the wrong model. A benchmark should report quality, latency, and cost together; set a quality bar from the task’s risk; then choose the cheapest, fastest candidate that clears the bar.
Q5: What is regression gating, and why does an eval set still matter after go-live?
Regression gating means re-running the full eval set before any model or prompt change reaches production and blocking the change if it drops below the quality bar. It still matters because providers update and deprecate models, and an “upgrade” can silently regress on your specific task even as its public scores rise.
12 Interview Prep
Real questions asked in NZ QA interviews for AI testing roles. Read the model answers, then practise your own version.
“How would you choose between two models for a production feature?”
I would not choose on a leaderboard — that measures someone else’s task on clean data and can be contaminated by training leakage. I’d build a task-specific eval set from real inputs with verified correct outputs, score it with exact-match for a closed task or a written rubric for an open one, and run a model A/B with the same prompt across both candidates. Then I’d report quality, latency, and cost for each, set a quality bar from the risk of a wrong answer, and pick the cheapest, fastest model above the bar. The leaderboard only ever decides my shortlist, never the winner.
“Our provider is upgrading the model next month. What do you do?”
I treat the upgrade as a change that must pass the regression gate, not a free improvement. Before it reaches production I re-run the full eval set against the new version and compare to the current baseline on every metric and every category. If the new version drops below the quality bar, or regresses on any category that matters — even if its public scores went up — the upgrade does not ship until it is fixed. I’d also confirm we recorded the model and prompt versions with each run, so if something regresses later we can trace it to the exact change rather than guessing.
“A stakeholder wants the highest-accuracy model regardless of cost. How do you respond?”
I’d reframe the decision around three axes, not one. Accuracy matters, but latency and cost are real constraints — for a high-volume assistant, a model that is two points more accurate but twice as slow and three times the price can be the wrong choice. I’d set a quality bar from the consequence of a wrong answer, show the scorecard for each candidate, and demonstrate that the cheaper, faster model already clears the bar. Often a tighter prompt on the cheaper model closes most of the accuracy gap, so I’d run that prompt A/B before agreeing to pay for size we do not need.