RAG Evaluation
A retrieval system can give a fluent, confident, well-written answer that is completely made up. The model sounds right because sounding right is what it was built to do. RAG evaluation is how you tell a grounded answer from a convincing one.
1 The Hook
Te Whatu Ora ran a pilot: a patient-facing assistant that answered questions about medications by retrieving from the official medicine data sheets. Ask it “can I take this with ibuprofen?” and it would fetch the relevant data sheet, read it, and answer in plain language. In the demo it was superb. Clear, calm, well-written answers. Everyone in the room was impressed.
A pharmacist on the review panel asked a question off the demo script: a query about a drug interaction for a medicine whose data sheet did not actually cover that interaction. The assistant answered anyway — a fluent, confident, specific paragraph about how the two drugs interact and what dose to watch for. It sounded exactly like the good answers. The pharmacist went pale. The interaction it described was wrong, and the data sheet it had retrieved said nothing about it at all. The model had filled the gap with something that read like medical advice.
This is the failure that defines RAG testing. The retrieval part worked — it fetched a real document. The generation part failed — it answered beyond what the document supported, and did it so smoothly that nobody could tell from the wording alone. A human reading the answer cannot see whether it came from the source or from the model’s imagination. The two look identical.
The team had tested the assistant the way you test a chatbot: ask questions, read answers, see if they sound right. Every answer sounded right. That is exactly why it passed. RAG evaluation throws out “does it sound right” and replaces it with a harder question: is every claim in this answer actually supported by what the system retrieved?
2 The Rule
In a RAG system, a fluent answer and a grounded answer are not the same thing — and you cannot tell them apart by reading. Evaluate the answer against the retrieved context, not against your gut. Every claim in the output must trace to something the system actually retrieved. If it does not, the answer is ungrounded, no matter how good it sounds.
3 The Analogy
A law student in a closed-book exam who keeps writing when they have run out of case law.
Picture a sharp student answering a question about the Privacy Act 2020. Where they can cite a real section, the answer is excellent — clear, on point, correct. Then they hit a part of the question the Act does not cover. A weak student would stop. This student keeps writing in the same confident voice, inventing a section that sounds plausible. The marker, reading quickly, cannot tell the invented part from the cited part — it is all written in the same authoritative tone.
A RAG system is that student. The retrieved documents are the case law it is allowed to cite. Faithfulness testing is the marker going line by line asking “show me where in the source this came from” — and failing every sentence that cannot point to a real citation, however well it reads.
4 How RAG Works, and Where It Breaks
Retrieval-augmented generation has two stages, and each can fail on its own. You cannot test RAG well until you can see which stage broke.
Stage one — retrieval. The user’s question is turned into a search, and the system pulls back the chunks of source documents it thinks are most relevant. For the Te Whatu Ora assistant, this is fetching the right paragraphs of the right medicine data sheet. If retrieval fails, the model is answering from the wrong documents, or with the right answer simply not present.
Stage two — generation. The retrieved chunks are handed to the model along with the question, and the model writes the answer. If generation fails, the model has the right context in front of it but answers beyond it, contradicts it, or ignores it.
The single most important idea in RAG evaluation is that these two failures need different fixes. A retrieval failure is fixed by improving search, chunking, or the document set. A generation failure is fixed by changing the prompt, the model, or adding grounding constraints. If your test only says “the answer was wrong”, you cannot tell an engineer which half to fix. A good RAG eval separates the two.
5 Faithfulness — the Core Metric
Faithfulness asks the central RAG question: is every claim in the answer supported by the retrieved context? It is the metric that catches the Te Whatu Ora failure. An answer is faithful if you can take each statement it makes and point to the line in the retrieved source that backs it. It is unfaithful — ungrounded — the moment it asserts something the source does not.
You measure faithfulness by breaking the answer into individual claims and checking each one against the context:
- Decompose: split the answer into atomic factual claims. “Paracetamol is safe with ibuprofen and the maximum daily dose is 4g” is two claims, not one.
- Check each claim: for every claim, is it supported by the retrieved context — yes or no?
- Score: faithfulness is the share of claims that are supported. One unsupported claim in a medical answer is a fail, not a 90%.
The key discipline: faithfulness is measured against the retrieved context, not against the truth. An answer can be factually true and still unfaithful, if the truth was not in what the system retrieved — because then the model got lucky, not grounded, and next time it will get unlucky. You are testing whether the system stays inside its evidence, not whether it happened to be right.
6 Answer Relevance — Did It Answer the Question?
Faithfulness checks that the answer does not invent. Answer relevance checks the opposite risk: that the answer actually addresses what was asked. A RAG system can be perfectly faithful and still useless — it can recite three true, well-grounded paragraphs that do not answer the user’s question.
Consider an IRD help assistant. A user asks “when is my provisional tax due?” The system retrieves a grounded, accurate paragraph about how provisional tax is calculated and answers with that. Every claim is faithful. The user still does not know when to pay. That is an answer-relevance failure: on-topic, grounded, and beside the point.
Answer relevance is measured by asking how directly the answer responds to the specific question — not whether it is true, but whether it is responsive. A common technique is to take the generated answer, ask the model what question this answer best responds to, and compare that back to the original question. A large gap means the answer drifted off the actual ask.
Faithfulness and relevance together form the floor of RAG quality: the answer must be grounded (faithfulness) and on-target (relevance). Either one alone is not enough.
7 Context Precision and Recall — Testing the Retrieval
Faithfulness and relevance judge the answer. Context precision and recall judge the retrieval — the first stage — before the model ever writes a word. They tell you whether the model was even given a fair chance.
Context recall: of all the information needed to answer the question correctly, how much did retrieval actually fetch? Low recall means the answer the user needed was never put in front of the model. No amount of prompt engineering fixes this — the model cannot ground an answer in a document it never received. For the Te Whatu Ora interaction question, recall was zero: the data sheet did not contain the interaction, so it could not be retrieved.
Context precision: of the chunks retrieval did fetch, how many were actually relevant? Low precision means the model was handed a pile of mostly-irrelevant text and had to find the needle. This is where models get distracted, latch onto the wrong chunk, and answer from a related-but-wrong document — a real risk in NZ systems where, say, an old and a current version of a policy both sit in the document store.
Context recall & context precision → test the retrieval stage (was the right context fetched, and only the right context?).
Faithfulness & answer relevance → test the generation stage (did the model stay grounded, and stay on-topic?).
This is why the four together are diagnostic, not just a score. Low context recall plus low faithfulness tells a clear story: retrieval missed the answer, so the model filled the gap by inventing — exactly the medicine-assistant failure. The metrics do not just tell you it broke; they tell you where.
8 Building a RAG Eval Set
You cannot measure any of this without a test set built for it. A RAG eval set is not a list of questions — each item carries the question, the ground-truth answer, and the ground-truth context it should come from. That third field is what makes it a RAG eval set rather than a chatbot script.
A solid eval item for an MSD benefits assistant looks like this:
Question: What is the stand-down period before Jobseeker Support starts?
Ground-truth answer: Up to two weeks, depending on income and circumstances.
Source document: Jobseeker Support — eligibility policy, section “Stand-down”.
Expected context: The paragraph defining stand-down length and what affects it.
Question type: Factual lookup (single-document)
Notes: Negative variant RAG-MSD-031b asks a stand-down question the
policy does NOT cover — correct behaviour is to decline, not invent.
Two things make an eval set strong. First, coverage of question types: single-document lookups, multi-document questions that need two sources combined, and edge cases. Second — and this is the one teams skip — negative or unanswerable questions: questions whose answer is deliberately not in the document store. These are how you test for the confident-but-ungrounded failure. The correct answer to an unanswerable question is “I don’t have that information”, and a RAG system that invents instead of declining fails the most important test on the sheet.
9 Common Mistakes
🚫 Judging RAG answers by whether they “sound right”
Why it happens: The answers are fluent and confident, so reading them feels like enough.
The fix: Fluency is what the model is built for — it is not evidence of grounding. A made-up answer sounds exactly like a real one. Measure faithfulness against the retrieved context, claim by claim, instead of trusting the tone.
🚫 Treating a wrong answer as a single failure without finding the stage
Why it happens: The output is wrong, so “the AI got it wrong” feels like the whole bug.
The fix: A wrong answer is either a retrieval failure (the right context was never fetched) or a generation failure (it had the context and answered beyond it). Ask “was the answer in the retrieved context?” first — the two need completely different fixes.
🚫 Scoring faithfulness against the truth instead of the retrieved context
Why it happens: If the answer is factually correct, it feels like it should pass.
The fix: An answer that is true but not supported by what was retrieved is a lucky guess, not a grounded answer — and luck does not repeat. Faithfulness checks the answer against the context the system actually had, so you are testing the behaviour, not the coincidence.
🚫 An eval set with no unanswerable questions in it
Why it happens: It feels natural to write questions the documents can answer.
The fix: The worst RAG failure is inventing an answer when the documents do not contain one. You can only catch it with questions whose answer is deliberately absent. The right behaviour is to decline; a system that invents instead must fail.
10 Now You Try
Three graded exercises: spot the failure, fix the metric, build the eval set. Write your answer, run it for AI feedback, then compare to the model answer.
Below is a question, the context a fictional IRD help assistant retrieved, and the answer it produced. Identify whether this is a retrieval failure or a generation failure, which RAG metric catches it (faithfulness, answer relevance, context precision, or context recall), and which specific claim in the answer is ungrounded.
Retrieved context: “Provisional tax under the standard option is paid in three instalments. The instalment dates for a standard 31 March balance date are 28 August, 15 January, and 7 May.”
Answer given: “Your second provisional tax instalment is due on 15 January. A 10% late-payment penalty applies if you miss it, and interest accrues daily from the due date.”
Diagnose it:
Show model answer
Failure stage: Generation. The retrieved context was correct and contained the answer to the actual question (15 January is the second instalment date), so retrieval did its job. Metric that catches it: Faithfulness. The answer added claims the context does not support. The ungrounded claims: "A 10% late-payment penalty applies" and "interest accrues daily from the due date." Neither penalties nor interest appear anywhere in the retrieved context. The first sentence (15 January) is faithful; everything after it is invented. Correct behaviour: Answer only the grounded part — "Your second instalment is due on 15 January" — and either stop, or explicitly say it does not have information about penalties in the retrieved material. This is the classic confident-but-ungrounded pattern: a true, grounded opening sentence followed by fluent invented detail that reads just as authoritatively. Note also that the invented penalty figure could be wrong, which in a tax context is a real harm.
A team testing a fictional Te Whatu Ora medicines assistant says: “We read 50 answers and they all sounded accurate, so the RAG system passes.” Explain why that is not a valid RAG evaluation, then specify the four metrics you would measure instead and what each one would catch that reading answers misses.
Write your critique and the four-metric plan:
Show model answer
Why it is not valid: "Sounded accurate" measures fluency, which is exactly what the model is built to produce — it is not evidence of grounding. A confidently invented answer reads identically to a real one, so reading 50 answers that all sound right is consistent with a system that invents. The team also only tested questions the documents could answer; they never tested the unanswerable case, which is where the worst failure lives. Metric 1 — Faithfulness: catches answers that assert claims the retrieved data sheet does not support — the invented drug-interaction paragraph. Measured claim-by-claim against the retrieved context, not against the clinician's gut. Metric 2 — Answer relevance: catches answers that are grounded but do not address the actual question — e.g. answering "how the drug works" when asked "can I take it with X". Metric 3 — Context recall: catches the case where the information needed was never retrieved (the data sheet did not cover the interaction). This is the root of the medicine-assistant failure and no prompt fix touches it. Metric 4 — Context precision: catches retrieval handing over mostly-irrelevant chunks, including the risk of fetching an old vs current data sheet, which distracts the model into the wrong document. Add to the test set: unanswerable/negative questions where the correct behaviour is to decline, multi-document interaction questions, and items with a recorded expected-context field so faithfulness and recall can actually be scored.
Design a 5-item RAG eval set for a fictional MSD benefits help assistant that answers from published benefit-eligibility policy documents. Each item needs: an ID, the question, the ground-truth answer, the expected source/context, and the question type. At least one item must be an unanswerable (negative) question where the correct behaviour is to decline.
Show model answer
RAG-01 | Question: How long is the stand-down period before Jobseeker Support starts? | Ground-truth answer: Up to two weeks, depending on income and circumstances. | Expected context: Jobseeker Support eligibility policy, "Stand-down" section. | Type: Factual lookup (single-document) RAG-02 | Question: Can I get an Accommodation Supplement if I already receive Jobseeker Support? | Ground-truth answer: Yes, the Accommodation Supplement can be paid alongside Jobseeker Support subject to income and asset tests. | Expected context: Accommodation Supplement policy + Jobseeker interaction note. | Type: Multi-document (two policies combined) RAG-03 | Question: What income limit applies to the Accommodation Supplement for a single person with no children? | Ground-truth answer: [the specific threshold from the current policy table]. | Expected context: Accommodation Supplement income thresholds table, single/no-children row. | Type: Factual lookup with a specific value (tests precise retrieval) RAG-04 | Question: I'm on Jobseeker Support and want to start part-time study — does that affect my benefit? | Ground-truth answer: Part-time study is generally allowed; full-time study may move you to a different support type. | Expected context: Jobseeker obligations / study rules section. | Type: Conditional / reasoning across one document RAG-05 (UNANSWERABLE) | Question: What is the exact dollar amount I personally will receive next week? | Ground-truth answer: The assistant cannot answer this — it depends on individual circumstances not in the policy documents; correct behaviour is to decline and direct the person to their MyMSD account or a case manager. | Expected context: none — the documents do not contain personalised payment amounts. | Type: Negative / unanswerable (must decline, not invent) What makes this strong: a mix of single-document, multi-document, specific-value, and reasoning questions, every item carries an expected-context field so faithfulness and recall can be scored, and RAG-05 deliberately has no answer in the document store so it tests the confident-but-ungrounded failure. A weak eval set is five lookups the documents can all answer.
11 Self-Check
Click each question to reveal the answer.
Q1: What does faithfulness measure, and why measure it against the retrieved context rather than the truth?
Faithfulness measures whether every claim in the answer is supported by the retrieved context. You measure against the context, not the truth, because an answer that is true but unsupported by what was retrieved is a lucky guess, not grounded behaviour — and luck does not repeat. You are testing whether the system stays inside its evidence.
Q2: A RAG answer is wrong. What is the first question you ask, and why?
“Was the correct information in the retrieved context?” If yes, it is a generation failure (fix the prompt/model/grounding). If no, it is a retrieval failure (fix search, chunking, or the document set). The two need completely different fixes, so this question splits the debugging in two.
Q3: How do context recall and context precision differ, and which stage do they test?
Both test the retrieval stage. Context recall asks: of the information needed, how much did retrieval fetch? (Low recall = the answer was never put in front of the model.) Context precision asks: of what was fetched, how much was actually relevant? (Low precision = the model was distracted by irrelevant or wrong-version chunks.)
Q4: Why can a perfectly faithful RAG answer still fail evaluation?
Because it can fail answer relevance — it can be fully grounded yet not address the question that was asked. Grounded but off-target is still a failure. The floor of RAG quality is both grounded (faithfulness) and on-target (relevance).
Q5: What is the one type of question a RAG eval set must include to catch the confident-but-ungrounded failure?
Unanswerable (negative) questions — questions whose answer is deliberately not in the document store. The correct behaviour is to decline; a system that invents an answer instead fails the most important test. Without these, you are not testing the worst RAG failure at all.
12 Interview Prep
Real questions asked in NZ QA interviews for AI testing roles. Read the model answers, then practise your own version.
“How would you test a retrieval-augmented chatbot beyond just reading its answers?”
Reading answers only tests fluency, which is what the model is built for — a made-up answer reads exactly like a real one. I’d build a RAG eval set where each item carries the question, the ground-truth answer, and the expected source context, then measure four things: faithfulness (is every claim supported by what was retrieved), answer relevance (does it address the actual question), context recall (was the needed information fetched at all), and context precision (was the retrieved set actually relevant). Crucially I’d include unanswerable questions, because the worst failure is the system inventing an answer the documents do not contain. Those four metrics also tell me which stage broke, not just that it broke.
“A RAG answer was factually correct but you marked it as failing faithfulness. Explain.”
Faithfulness is measured against the retrieved context, not against the truth. If the answer was correct but the supporting fact was not in what the system retrieved, the model produced a correct answer by chance, not by grounding — it reached outside its evidence and happened to land on the truth. That behaviour is unreliable: next time it reaches outside its evidence it will be wrong, and it will sound just as confident. I mark it as failing because I’m testing whether the system stays grounded, and this one did not.
“Our RAG system gives bad answers about half the time. Where do you start?”
I separate retrieval failures from generation failures before touching anything, because they have opposite fixes. For a sample of the bad answers I ask: was the correct information actually in the retrieved context? If recall is low — the right context was never fetched — that’s a retrieval problem, and I’d look at chunking, search, and the document set. If recall is fine but faithfulness is low — the context was there and the model answered beyond it — that’s a generation problem, and I’d look at the prompt and grounding constraints. Throwing prompt changes at a retrieval failure burns a sprint and fixes nothing, so the diagnosis comes first.