ISO/IEC 42119 — Part 8 Practical Lab

This lesson is a practical lab fulfilling Part 8 of ISO/IEC 42119: quality assessment guidelines for text-to-text generative AI systems. RAG evaluation directly addresses the grounding, faithfulness, and retrieval quality characteristics the standard requires. Exercises here generate usable audit artefacts.

Test with AI · AI Evaluation

RAG Evaluation

A retrieval system can give a fluent, confident, well-written answer that is completely made up. The model sounds right because sounding right is what it was built to do. RAG evaluation is how you tell a grounded answer from a convincing one.

Test with AI AI Testing Engineer — Lesson 1 of 8 ~30 min read · ~70 min with exercises · ISO/IEC 42119 Part 8 lab

1 The Hook

HealthNZ ran a pilot: a patient-facing assistant that answered questions about medications by retrieving from the official medicine data sheets. Ask it “can I take this with ibuprofen?” and it would fetch the relevant data sheet, read it, and answer in plain language. In the demo it was superb. Clear, calm, well-written answers. Everyone in the room was impressed.

A pharmacist on the review panel asked a question off the demo script: a query about a drug interaction for a medicine whose data sheet did not actually cover that interaction. The assistant answered anyway — a fluent, confident, specific paragraph about how the two drugs interact and what dose to watch for. It sounded exactly like the good answers. The pharmacist went pale. The interaction it described was wrong, and the data sheet it had retrieved said nothing about it at all. The model had filled the gap with something that read like medical advice.

This is the failure that defines RAG testing. The retrieval part worked — it fetched a real document. The generation part failed — it answered beyond what the document supported, and did it so smoothly that nobody could tell from the wording alone. A human reading the answer cannot see whether it came from the source or from the model’s imagination. The two look identical.

The team had tested the assistant the way you test a chatbot: ask questions, read answers, see if they sound right. Every answer sounded right. That is exactly why it passed. RAG evaluation throws out “does it sound right” and replaces it with a harder question: is every claim in this answer actually supported by what the system retrieved?

2 The Rule

In a RAG system, a fluent answer and a grounded answer are not the same thing — and you cannot tell them apart by reading. Evaluate the answer against the retrieved context, not against your gut. Every claim in the output must trace to something the system actually retrieved. If it does not, the answer is ungrounded, no matter how good it sounds.

⚠️ Common Misconception

The common conclusion: we measured retrieval recall at 92% — our RAG system is working.

Retrieval recall tells you whether the right document was found. It does not tell you whether the model faithfully used what it found. A system with 92% retrieval recall and 70% faithful synthesis — meaning the model correctly used the retrieved content 70% of the time — is answering incorrectly on roughly 30% of queries, even when the right document was retrieved. Teams that celebrate retrieval metrics without measuring faithfulness and answer relevance are measuring the easy half of the problem and ignoring the expensive half. The failure that damages customer trust — the confidently wrong answer that cites the right document incorrectly — is a faithfulness failure, not a retrieval failure.

3 The Analogy

Analogy

A law student in a closed-book exam who keeps writing when they have run out of case law.

Picture a sharp student answering a question about the Privacy Act 2020. Where they can cite a real section, the answer is excellent — clear, on point, correct. Then they hit a part of the question the Act does not cover. A weak student would stop. This student keeps writing in the same confident voice, inventing a section that sounds plausible. The marker, reading quickly, cannot tell the invented part from the cited part — it is all written in the same authoritative tone.

A RAG system is that student. The retrieved documents are the case law it is allowed to cite. Faithfulness testing is the marker going line by line asking “show me where in the source this came from” — and failing every sentence that cannot point to a real citation, however well it reads.

4 How RAG Works, and Where It Breaks

Retrieval-augmented generation has two stages, and each can fail on its own. You cannot test RAG well until you can see which stage broke.

RAG Pipeline Architecture

💬

User Query

→

🧠

Embedding Model Query → vector

→

🗄

Vector Store Similarity search

→

📋

Retrieved Chunks Top-K passages

→

🤖

LLM + Context Grounded response

→

✅

Answer

Each arrow is a test boundary. QA must verify: embedding quality, retrieval relevance, chunk faithfulness, and final grounding — not just the end-to-end answer.

Stage one — retrieval. The user’s question is turned into a search, and the system pulls back the chunks of source documents it thinks are most relevant. For the HealthNZ assistant, this is fetching the right paragraphs of the right medicine data sheet. If retrieval fails, the model is answering from the wrong documents, or with the right answer simply not present.

Stage two — generation. The retrieved chunks are handed to the model along with the question, and the model writes the answer. If generation fails, the model has the right context in front of it but answers beyond it, contradicts it, or ignores it.

The single most important idea in RAG evaluation is that these two failures need different fixes. A retrieval failure is fixed by improving search, chunking, or the document set. A generation failure is fixed by changing the prompt, the model, or adding grounding constraints. If your test only says “the answer was wrong”, you cannot tell an engineer which half to fix. A good RAG eval separates the two.

Pro tip: The first question to ask of any wrong RAG answer is “was the correct information in the retrieved context?” If yes, it is a generation failure. If no, it is a retrieval failure. That one question splits your whole debugging effort in two.

5 Faithfulness — the Core Metric

Faithfulness asks the central RAG question: is every claim in the answer supported by the retrieved context? It is the metric that catches the HealthNZ failure. An answer is faithful if you can take each statement it makes and point to the line in the retrieved source that backs it. It is unfaithful — ungrounded — the moment it asserts something the source does not.

You measure faithfulness by breaking the answer into individual claims and checking each one against the context:

Decompose: split the answer into atomic factual claims. “Paracetamol is safe with ibuprofen and the maximum daily dose is 4g” is two claims, not one.
Check each claim: for every claim, is it supported by the retrieved context — yes or no?
Score: faithfulness is the share of claims that are supported. One unsupported claim in a medical answer is a fail, not a 90%.

The key discipline: faithfulness is measured against the retrieved context, not against the truth. An answer can be factually true and still unfaithful, if the truth was not in what the system retrieved — because then the model got lucky, not grounded, and next time it will get unlucky. You are testing whether the system stays inside its evidence, not whether it happened to be right.

6 Answer Relevance — Did It Answer the Question?

Faithfulness checks that the answer does not invent. Answer relevance checks the opposite risk: that the answer actually addresses what was asked. A RAG system can be perfectly faithful and still useless — it can recite three true, well-grounded paragraphs that do not answer the user’s question.

Consider an Revenue NZ help assistant. A user asks “when is my provisional tax due?” The system retrieves a grounded, accurate paragraph about how provisional tax is calculated and answers with that. Every claim is faithful. The user still does not know when to pay. That is an answer-relevance failure: on-topic, grounded, and beside the point.

Answer relevance is measured by asking how directly the answer responds to the specific question — not whether it is true, but whether it is responsive. A common technique is to take the generated answer, ask the model what question this answer best responds to, and compare that back to the original question. A large gap means the answer drifted off the actual ask.

Faithfulness and relevance together form the floor of RAG quality: the answer must be grounded (faithfulness) and on-target (relevance). Either one alone is not enough.

7 Context Precision and Recall — Testing the Retrieval

Faithfulness and relevance judge the answer. Context precision and recall judge the retrieval — the first stage — before the model ever writes a word. They tell you whether the model was even given a fair chance.

Context recall: of all the information needed to answer the question correctly, how much did retrieval actually fetch? Low recall means the answer the user needed was never put in front of the model. No amount of prompt engineering fixes this — the model cannot ground an answer in a document it never received. For the HealthNZ interaction question, recall was zero: the data sheet did not contain the interaction, so it could not be retrieved.

Context precision: of the chunks retrieval did fetch, how many were actually relevant? Low precision means the model was handed a pile of mostly-irrelevant text and had to find the needle. This is where models get distracted, latch onto the wrong chunk, and answer from a related-but-wrong document — a real risk in NZ systems where, say, an old and a current version of a policy both sit in the document store.

The four metrics, mapped to the two stages:
Context recall & context precision → test the retrieval stage (was the right context fetched, and only the right context?).
Faithfulness & answer relevance → test the generation stage (did the model stay grounded, and stay on-topic?).

This is why the four together are diagnostic, not just a score. Low context recall plus low faithfulness tells a clear story: retrieval missed the answer, so the model filled the gap by inventing — exactly the medicine-assistant failure. The metrics do not just tell you it broke; they tell you where.

Pro tip: When faithfulness is low, always read context recall first. If recall is also low, the model invented because retrieval gave it nothing to ground in — fix retrieval, not the prompt. Punishing the model for a retrieval failure wastes a sprint.

8 Building a RAG Eval Set

You cannot measure any of this without a test set built for it. A RAG eval set is not a list of questions — each item carries the question, the ground-truth answer, and the ground-truth context it should come from. That third field is what makes it a RAG eval set rather than a chatbot script.

A solid eval item for an Benefits NZ benefits assistant looks like this:

ID:                RAG-Benefits NZ-031

Question:          What is the stand-down period before Jobseeker Support starts?

Ground-truth answer: Up to two weeks, depending on income and circumstances.

Source document:   Jobseeker Support — eligibility policy, section “Stand-down”.

Expected context:   The paragraph defining stand-down length and what affects it.

Question type:      Factual lookup (single-document)

Notes:             Negative variant RAG-Benefits NZ-031b asks a stand-down question the

                 policy does NOT cover — correct behaviour is to decline, not invent.

Two things make an eval set strong. First, coverage of question types: single-document lookups, multi-document questions that need two sources combined, and edge cases. Second — and this is the one teams skip — negative or unanswerable questions: questions whose answer is deliberately not in the document store. These are how you test for the confident-but-ungrounded failure. The correct answer to an unanswerable question is “I don’t have that information”, and a RAG system that invents instead of declining fails the most important test on the sheet.

Pro tip: If your RAG eval set has no unanswerable questions in it, you are not testing the failure that the HealthNZ assistant had. Half your value as an AI testing engineer is in the negative cases nobody else thought to write.

9 Common Mistakes

🚫 Judging RAG answers by whether they “sound right”

Why it happens: The answers are fluent and confident, so reading them feels like enough.
The fix: Fluency is what the model is built for — it is not evidence of grounding. A made-up answer sounds exactly like a real one. Measure faithfulness against the retrieved context, claim by claim, instead of trusting the tone.

🚫 Treating a wrong answer as a single failure without finding the stage

Why it happens: The output is wrong, so “the AI got it wrong” feels like the whole bug.
The fix: A wrong answer is either a retrieval failure (the right context was never fetched) or a generation failure (it had the context and answered beyond it). Ask “was the answer in the retrieved context?” first — the two need completely different fixes.

🚫 Scoring faithfulness against the truth instead of the retrieved context

Why it happens: If the answer is factually correct, it feels like it should pass.
The fix: An answer that is true but not supported by what was retrieved is a lucky guess, not a grounded answer — and luck does not repeat. Faithfulness checks the answer against the context the system actually had, so you are testing the behaviour, not the coincidence.

🚫 An eval set with no unanswerable questions in it

Why it happens: It feels natural to write questions the documents can answer.
The fix: The worst RAG failure is inventing an answer when the documents do not contain one. You can only catch it with questions whose answer is deliberately absent. The right behaviour is to decline; a system that invents instead must fail.

Senior engineer insight

The thing that changed how I think about RAG evaluation was realising that faithfulness and retrieval recall live in completely different organisational lanes. On one project — a NZ Parliamentary library AI for bill research — the search team hit 94% context recall and declared the pipeline sound. Three weeks later we found the generation layer was inventing legislative intent at around a 20% rate on complex multi-clause queries. They had measured everything they owned and nothing they did not. Good RAG testing requires someone who owns the seam between the two stages, not just each stage independently.

The most common mistake is starting faithfulness evaluation with LLM-as-judge before validating the judge against human annotations — you end up with an expensive metric that correlates poorly with actual user harm.

10 Now You Try

Three graded exercises: spot the failure, fix the metric, build the eval set. Write your answer, run it for AI feedback, then compare to the model answer.

🔍 Exercise 1 of 3 — Spot the Failure in a RAG Output

Below is a question, the context a fictional Revenue NZ help assistant retrieved, and the answer it produced. Identify whether this is a retrieval failure or a generation failure, which RAG metric catches it (faithfulness, answer relevance, context precision, or context recall), and which specific claim in the answer is ungrounded.

Question: “What is the due date for my second provisional tax instalment?”

Retrieved context: “Provisional tax under the standard option is paid in three instalments. The instalment dates for a standard 31 March balance date are 28 August, 15 January, and 7 May.”

Answer given: “Your second provisional tax instalment is due on 15 January. A 10% late-payment penalty applies if you miss it, and interest accrues daily from the due date.”

Diagnose it:

Show model answer

Failure stage: Generation. The retrieved context was correct and contained the answer to the actual question (15 January is the second instalment date), so retrieval did its job.

Metric that catches it: Faithfulness. The answer added claims the context does not support.

The ungrounded claims: "A 10% late-payment penalty applies" and "interest accrues daily from the due date." Neither penalties nor interest appear anywhere in the retrieved context. The first sentence (15 January) is faithful; everything after it is invented.

Correct behaviour: Answer only the grounded part — "Your second instalment is due on 15 January" — and either stop, or explicitly say it does not have information about penalties in the retrieved material. This is the classic confident-but-ungrounded pattern: a true, grounded opening sentence followed by fluent invented detail that reads just as authoritatively. Note also that the invented penalty figure could be wrong, which in a tax context is a real harm.

🔧 Exercise 2 of 3 — Choose and Justify the Metrics

A team testing a fictional HealthNZ medicines assistant says: “We read 50 answers and they all sounded accurate, so the RAG system passes.” Explain why that is not a valid RAG evaluation, then specify the four metrics you would measure instead and what each one would catch that reading answers misses.

Their claim: “50 answers read accurate to a clinician on the team, therefore the RAG system is validated for go-live.”

Write your critique and the four-metric plan:

Show model answer

Why it is not valid: "Sounded accurate" measures fluency, which is exactly what the model is built to produce — it is not evidence of grounding. A confidently invented answer reads identically to a real one, so reading 50 answers that all sound right is consistent with a system that invents. The team also only tested questions the documents could answer; they never tested the unanswerable case, which is where the worst failure lives.

Metric 1 — Faithfulness: catches answers that assert claims the retrieved data sheet does not support — the invented drug-interaction paragraph. Measured claim-by-claim against the retrieved context, not against the clinician's gut.

Metric 2 — Answer relevance: catches answers that are grounded but do not address the actual question — e.g. answering "how the drug works" when asked "can I take it with X".

Metric 3 — Context recall: catches the case where the information needed was never retrieved (the data sheet did not cover the interaction). This is the root of the medicine-assistant failure and no prompt fix touches it.

Metric 4 — Context precision: catches retrieval handing over mostly-irrelevant chunks, including the risk of fetching an old vs current data sheet, which distracts the model into the wrong document.

Add to the test set: unanswerable/negative questions where the correct behaviour is to decline, multi-document interaction questions, and items with a recorded expected-context field so faithfulness and recall can actually be scored.

🏗️ Exercise 3 of 3 — Build a RAG Eval Set

Design a 5-item RAG eval set for a fictional Benefits NZ benefits help assistant that answers from published benefit-eligibility policy documents. Each item needs: an ID, the question, the ground-truth answer, the expected source/context, and the question type. At least one item must be an unanswerable (negative) question where the correct behaviour is to decline.

Show model answer

RAG-01 | Question: How long is the stand-down period before Jobseeker Support starts? | Ground-truth answer: Up to two weeks, depending on income and circumstances. | Expected context: Jobseeker Support eligibility policy, "Stand-down" section. | Type: Factual lookup (single-document)

RAG-02 | Question: Can I get an Accommodation Supplement if I already receive Jobseeker Support? | Ground-truth answer: Yes, the Accommodation Supplement can be paid alongside Jobseeker Support subject to income and asset tests. | Expected context: Accommodation Supplement policy + Jobseeker interaction note. | Type: Multi-document (two policies combined)

RAG-03 | Question: What income limit applies to the Accommodation Supplement for a single person with no children? | Ground-truth answer: [the specific threshold from the current policy table]. | Expected context: Accommodation Supplement income thresholds table, single/no-children row. | Type: Factual lookup with a specific value (tests precise retrieval)

RAG-04 | Question: I'm on Jobseeker Support and want to start part-time study — does that affect my benefit? | Ground-truth answer: Part-time study is generally allowed; full-time study may move you to a different support type. | Expected context: Jobseeker obligations / study rules section. | Type: Conditional / reasoning across one document

RAG-05 (UNANSWERABLE) | Question: What is the exact dollar amount I personally will receive next week? | Ground-truth answer: The assistant cannot answer this — it depends on individual circumstances not in the policy documents; correct behaviour is to decline and direct the person to their MyMSD account or a case manager. | Expected context: none — the documents do not contain personalised payment amounts. | Type: Negative / unanswerable (must decline, not invent)

What makes this strong: a mix of single-document, multi-document, specific-value, and reasoning questions, every item carries an expected-context field so faithfulness and recall can be scored, and RAG-05 deliberately has no answer in the document store so it tests the confident-but-ungrounded failure. A weak eval set is five lookups the documents can all answer.

From the field

A team building an CoverNZ policy assistant assumed their chunking strategy was solid — 512-token chunks with 50-token overlap, standard setup, and context recall was 89% on their golden set. What they did not test was precision under temporal ambiguity: the vector index held both the pre-2022 and post-2022 CoverNZ treatment injury policy, and similarity search kept retrieving both for the same query, sometimes within the same top-3. The LLM, faced with two contradictory chunks, would synthesise a plausible-sounding middle ground that matched neither. The faithfulness score looked acceptable because some claims traced to each source. Once they added document effective-date metadata to the retrieval filter and retired superseded chunks, precision jumped from 61% to 88% and the synthesised-contradiction failure disappeared entirely. The lesson that generalises: low precision is often a document-lifecycle problem disguised as a search problem.

Why teams fail here

Measuring retrieval recall and calling it done. Context recall tells you whether the right document was found — it says nothing about whether the model used it faithfully. Teams celebrate 90% recall while 25% of answers contain ungrounded claims the eval never touches.
No unanswerable questions in the eval set. Every item in the set can be answered from the documents, so the “decline gracefully” behaviour is never exercised. The worst RAG failure — confident invention when the documents are silent — passes every test undetected.
Scoring faithfulness against ground-truth rather than the retrieved context. If the answer happens to be factually correct, teams mark it passing. An answer that is true but not supported by what was retrieved is a lucky guess — it will be wrong the next time the model reaches outside its evidence, and it will sound equally confident both times.
Not separating retrieval failures from generation failures in bug reports. “The AI got it wrong” goes to the ML team who tunes the prompt. The prompt change does nothing because the real failure was retrieval — the right context was never fetched. Two sprints later the answer is still wrong and the team has no idea why.
Using LLM-as-judge for faithfulness without calibrating against human ratings first. Automated faithfulness scorers are convenient but not neutral — they share the same tendency to find claims “plausible” rather than verifiably grounded. A faithfulness judge that agrees with human annotators 70% of the time is not a faithfulness metric; it is noise with a name.
Static golden datasets in a changing knowledge base. The eval set was built against version 1 of the documents. Six months later 30% of the source documents have changed or been superseded, but the golden dataset has not. High scores now reflect the system’s memory of old policy, not current grounding quality.

11 Self-Check

Click each question to reveal the answer.

Q1: What does faithfulness measure, and why measure it against the retrieved context rather than the truth?

Faithfulness measures whether every claim in the answer is supported by the retrieved context. You measure against the context, not the truth, because an answer that is true but unsupported by what was retrieved is a lucky guess, not grounded behaviour — and luck does not repeat. You are testing whether the system stays inside its evidence.

Q2: A RAG answer is wrong. What is the first question you ask, and why?

“Was the correct information in the retrieved context?” If yes, it is a generation failure (fix the prompt/model/grounding). If no, it is a retrieval failure (fix search, chunking, or the document set). The two need completely different fixes, so this question splits the debugging in two.

Q3: How do context recall and context precision differ, and which stage do they test?

Both test the retrieval stage. Context recall asks: of the information needed, how much did retrieval fetch? (Low recall = the answer was never put in front of the model.) Context precision asks: of what was fetched, how much was actually relevant? (Low precision = the model was distracted by irrelevant or wrong-version chunks.)

Q4: Why can a perfectly faithful RAG answer still fail evaluation?

Because it can fail answer relevance — it can be fully grounded yet not address the question that was asked. Grounded but off-target is still a failure. The floor of RAG quality is both grounded (faithfulness) and on-target (relevance).

Q5: What is the one type of question a RAG eval set must include to catch the confident-but-ungrounded failure?

Unanswerable (negative) questions — questions whose answer is deliberately not in the document store. The correct behaviour is to decline; a system that invents an answer instead fails the most important test. Without these, you are not testing the worst RAG failure at all.

How this has changed

The field moved fast. Here is what the evolution looked like for RAG System Evaluation.

2023

Retrieval-Augmented Generation popularised by LangChain. Testing RAG means checking whether answers were correct — no systematic evaluation framework exists.

2024

RAGAS framework published — first systematic metrics for RAG evaluation (faithfulness, answer relevance, context precision, context recall). Becomes the de facto standard.

2025

Agentic RAG systems (where the agent decides what to retrieve) require more complex evaluation. Evaluation datasets become first-class test artefacts requiring governance.

Now

RAG evaluation is a continuous process, not a one-time check. Production RAG systems have monitoring dashboards tracking faithfulness and hallucination rates in real time.

12 Interview Prep

Real questions asked in NZ QA interviews for AI testing roles. Read the model answers, then practise your own version.

“How would you test a retrieval-augmented chatbot beyond just reading its answers?”

Reading answers only tests fluency, which is what the model is built for — a made-up answer reads exactly like a real one. I’d build a RAG eval set where each item carries the question, the ground-truth answer, and the expected source context, then measure four things: faithfulness (is every claim supported by what was retrieved), answer relevance (does it address the actual question), context recall (was the needed information fetched at all), and context precision (was the retrieved set actually relevant). Crucially I’d include unanswerable questions, because the worst failure is the system inventing an answer the documents do not contain. Those four metrics also tell me which stage broke, not just that it broke.

“A RAG answer was factually correct but you marked it as failing faithfulness. Explain.”

Faithfulness is measured against the retrieved context, not against the truth. If the answer was correct but the supporting fact was not in what the system retrieved, the model produced a correct answer by chance, not by grounding — it reached outside its evidence and happened to land on the truth. That behaviour is unreliable: next time it reaches outside its evidence it will be wrong, and it will sound just as confident. I mark it as failing because I’m testing whether the system stays grounded, and this one did not.

“Our RAG system gives bad answers about half the time. Where do you start?”

I separate retrieval failures from generation failures before touching anything, because they have opposite fixes. For a sample of the bad answers I ask: was the correct information actually in the retrieved context? If recall is low — the right context was never fetched — that’s a retrieval problem, and I’d look at chunking, search, and the document set. If recall is fine but faithfulness is low — the context was there and the model answered beyond it — that’s a generation problem, and I’d look at the prompt and grounding constraints. Throwing prompt changes at a retrieval failure burns a sprint and fixes nothing, so the diagnosis comes first.

Lessons from Production

What teams consistently discover after deploying this in real systems — things that don’t appear in documentation.

The first implementation retrieves too many chunks. Teams tune chunk size and overlap for months after launch. Build the chunking strategy evaluation into pre-launch, not post-launch.
Embedding model quality matters more than retrieval strategy for most use cases. Teams spend weeks optimising BM25 vs vector vs hybrid retrieval while using a suboptimal embedding model. Start with the embedding.
Document staleness is never caught until it causes a visible error in a high-profile case. A document-lifecycle process — how documents are added, updated, and retired — must be designed before launch, not retrofitted after an incident.
Answer faithfulness is the metric that matters and the one nobody measured. Retrieval recall is easy to compute. Faithfulness requires an LLM judge or human annotation, which is why it gets deferred — and then never added.
Re-indexing the entire knowledge base on a schema change takes much longer than expected. What seems like a 4-hour job in development is a 48-hour job in production with 2 million documents. Model this before committing to a schema.
"Retrieval worked" and "answer is correct" need separate ownership. The same team measuring both will conflate them. One team owns the index; another owns the generation evaluation.

Compared to What?

RAG evaluation is distinct from general LLM evaluation because the failure modes span both the retrieval layer and the generation layer. Each has different testing approaches.

Technique	Best for	Weakness
RAG Evaluation (end-to-end) this technique	Complete RAG pipelines where retrieval quality affects answer quality	Requires golden QA datasets; RAGAS-style metrics need their own validation
Traditional Information Retrieval Evaluation (NDCG, MRR)	Assessing the retrieval component in isolation	Does not capture how well the LLM uses retrieved context; misses hallucination
LLM Benchmark / Closed-Book Evaluation	Measuring the model's pre-trained knowledge without retrieval	Cannot evaluate the retrieval augmentation — you're testing the model, not the RAG system
Human Q&A Review	Final-quality checks on answer helpfulness and accuracy	Slow; expensive; cannot scale; best used to validate automated metrics
Fine-Tuning Instead of RAG	Embedding domain knowledge into the model weights	Expensive; knowledge becomes stale between fine-tuning runs; not suitable for rapidly-changing content

RAGAS-style metrics (faithfulness, answer relevance, context precision) are useful starting points but they use an LLM as the judge — validate them against human ratings before trusting them in CI.

When Not to Use This

Experience is knowing when a technique is not the right tool. Skip this one when:

Small, stable knowledge bases

If your knowledge base has fewer than 200 documents and changes rarely, simple keyword search with exact-match evaluation may be more reliable than embedding-based RAG. The overhead of RAG evaluation is disproportionate to the benefit.

When fine-tuning outperforms retrieval for your task

For highly specialised domains where the required knowledge has a very specific structure (medical coding, legal citation), a fine-tuned model may outperform RAG. Evaluate both before committing to RAG architecture.

Fully deterministic lookups

If the question always maps to one exact record (e.g., "what is the interest rate for product X?"), a structured database query with a thin formatting layer is more reliable than RAG. Reserve RAG for queries that genuinely require synthesising across multiple documents.

When you cannot build a golden dataset

RAG evaluation depends on knowing what the correct answer should be. If your domain is too complex or too new to produce annotated QA pairs, you cannot evaluate RAG quality rigorously — and "feels right" is not a safe evaluation strategy for production systems.

At Enterprise Scale

🏢 Enterprise Context

10 million documents in the knowledge base30 product teams using shared RAG infrastructure6 document types (policies, contracts, FAQs, forms, case notes, legislation)

At enterprise scale, RAG becomes an infrastructure and governance problem. A shared RAG platform serving 30 product teams must handle document-level access control (team A's documents must never appear in team B's retrieval results), versioning (when a policy document is updated, old versions must be retired cleanly from the index), and staleness detection (a retrieved document that was accurate six months ago may now be superseded).

The evaluation challenge at scale is golden dataset maintenance. With 10 million documents across 6 types, a static golden QA dataset becomes unrepresentative within months as documents are added, updated, and retired. Enterprise RAG needs a continuous evaluation pipeline: randomly sampled queries from live traffic are annotated by domain experts, scored against the current retrieval results, and tracked over time for drift.

The most common failure mode at enterprise scale is retrieval success hiding answer failure. Teams measure "did we retrieve a relevant document?" and celebrate high recall. They do not measure "did the model correctly synthesise across three conflicting retrieved documents?" or "did the model hallucinate despite low-relevance retrieval?" At scale these gaps compound: a system with 92% retrieval recall and 70% faithful synthesis accuracy produces wrong answers on 30% of queries even when the right document was retrieved.

Failure Analysis

📋 Post-Mortem

The Policy Chatbot That Cited Superseded Legislation

A government agency deployed a RAG-based chatbot to help the public navigate benefit entitlements. The knowledge base was populated with policy documents, legislation, and FAQs. At launch, the team measured retrieval quality and found 88% of queries retrieved at least one relevant document.

What happened: Three months after launch, a policy change came into effect. New documents were added to the knowledge base, but the old documents were not removed or flagged as superseded. For the next two months, the chatbot cited both the old and new policy, sometimes giving contradictory advice in the same response.
Why tests missed it: The RAG evaluation tested retrieval recall and answer faithfulness at a single point in time — the launch date. No ongoing evaluation ran after launch. No document-lifecycle process existed: old policy documents were not retired from the index when new ones replaced them.
Root cause: Two gaps: (1) no document lifecycle management — superseded documents remained in the vector index; (2) no continuous evaluation — quality was checked at launch, not maintained over time.
Fix: A document metadata schema was added including effective_date and superseded_by fields. The retrieval pipeline filters out documents where superseded_by is set. A monthly evaluation run automatically samples 200 live queries and checks for citation of known-superseded documents.
Lesson: A RAG system's knowledge quality degrades as the real world changes. Evaluating at launch is necessary but not sufficient. The knowledge lifecycle — how documents are added, updated, retired, and verified — is as important to system quality as the embedding model or the chunking strategy.

Why the Business Cares

Accuracy and liability

A RAG system that cites real documents incorrectly or uses superseded policy creates legal and compliance exposure — particularly in regulated domains where cited guidance must be current and accurate.

Customer trust

Users who discover that a system confidently cited the wrong section of a document — or an outdated version — lose trust in everything the system has told them, not just the specific error.

Operational cost

Re-indexing large knowledge bases, re-annotating golden datasets, and remediating staleness incidents are expensive. A document lifecycle process prevents most of that cost.

Competitive differentiation

RAG systems that stay current and cite accurately are a genuine differentiator in knowledge-intensive domains. Most deployed RAG systems accumulate staleness debt within six months of launch.

Key takeaway

A fluent answer and a grounded answer are indistinguishable by reading — the entire discipline of RAG evaluation exists because your gut cannot tell them apart, and in regulated domains the cost of getting that wrong falls on the person who trusted the answer.

You can now measure whether an AI’s answer is grounded in what it retrieved. The next lesson takes the attacker’s view of the same retrieval path — Prompt Injection Testing tests what happens when the content the system reads contains hostile instructions designed to override its behaviour.

← AI Evaluation Track Next: Prompt-Injection Testing →