When a RAG answer is wrong, two failures look identical from the outside. The retriever did not surface the relevant chunk and the generator made something up. Or the retriever surfaced the right chunk and the generator ignored it. End-to-end scoring cannot distinguish those cases, which means end-to-end scoring cannot tell you what to fix. The two-layer split is the diagnostic that recovers the signal.
Why one number is the wrong number
A pass-rate of 0.74 on end-to-end RAG answers is an aggregate over at least three independent failure modes (retrieval miss, generation ignoring context, generation inventing facts). Move the retriever's top-K from three to five and the pass rate goes up by two points. Was that because the retriever found more relevant chunks, or because the generator got more shots at one of them, or because adding noise pushed the generator to hedge instead of fabricate? You cannot tell. You ship the change anyway. Next quarter you discover the generator got more verbose and stopped citing sources, and you cannot tell whether that started with the K change or with the model upgrade in week six.
The fix is to score retrieval and generation as separate experiments with separate metrics, and only combine them into an end-to-end number once each layer is independently passing its own bar [1]. The RAG-as-a-pipeline literature (Ragas, ARES, Self-RAG) all converge on this split because every team that tried to skip it eventually had to come back and do it [2] [3].
The two layers
flowchart LR
subgraph Retrieval["Retrieval layer"]
Q[Query] --> EMB[Embed/encode]
EMB --> IDX[(Index search)]
IDX --> RANK[Top-K chunks]
RANK --> RM["Metrics:<br/>Recall@K · MRR · NDCG<br/>Hit-Rate · Context Precision"]
end
subgraph Generation["Generation layer"]
RANK --> PROMPT[Prompt + context]
PROMPT --> LLM[LLM]
LLM --> ANS[Answer + citations]
ANS --> GM["Metrics:<br/>Faithfulness · Answer Relevance<br/>Context Utilization · Citation Accuracy"]
end
RM -.diagnoses.-> RFAIL[Retrieval failure]
GM -.diagnoses.-> GFAIL[Generation failure]
The retrieval layer takes the query and returns a ranked list of chunks. The generation layer takes the chunks plus the original query and returns an answer. Each layer has its own scoring set, its own metrics, and its own bar to clear.
Retrieval metrics: the catalogue
Information retrieval has spent thirty years on this; do not invent new metrics, use the ones that are already validated [4].
Recall@K answers the question "did the correct chunk appear in the top K?" Set-valued; the dataset needs a ground-truth chunk id per query. The standard configuration is to report Recall@1, Recall@3, Recall@5, and Recall@10. The right K is product-specific: a chat UI that fits two chunks in context cares about Recall@2; a long-context system that pours twenty chunks in cares about Recall@20.
Mean Reciprocal Rank (MRR) answers "if the chunk was in the top K, how high was it?" Rank-sensitive; rewards getting the right chunk to position one. Useful when the generator only really reads the first chunk in practice.
Normalized Discounted Cumulative Gain (NDCG) generalizes to graded relevance (multiple chunks can be partially relevant). Use it when the ground truth labels each candidate chunk with a relevance score, not when ground truth is one-hot.
Hit-Rate is the binary version of Recall@K: did any relevant chunk appear in the top K? Cruder than Recall@K but useful for fast smoke tests in CI.
Context Precision is the Ragas-specific metric that asks "of the chunks I returned, what fraction were actually relevant?" Pairs naturally with Recall@K as the precision-recall tradeoff at the retrieval stage [5].
Generation metrics: the catalogue
Generation metrics assume the chunks in front of the LLM are the right ones (or at least the only chunks it had access to). The interesting question is what the LLM did with them.
Faithfulness is the headline. Every claim in the answer should be supported by the retrieved context. A binary judge per atomic fact is the right shape; details are in generation-faithfulness.
Answer Relevance is whether the answer addresses the user's question. An answer can be perfectly faithful to the context (every fact supported) and still useless if it dodges the question.
Context Utilization is whether the answer actually used the most relevant parts of the context, or filled in from parametric memory and added the citation as decoration. The Ragas implementation uses an LLM judge over the answer and the retrieved chunks [5].
Citation Accuracy is whether the citation spans the answer claims to be from actually contain the cited claim. This is a string-match or semantic-match check on the cited spans, not an LLM judge.
A diagnostic flow
When an end-to-end answer fails, the two layers attribute the failure to a layer.
# diagnose_rag.py
from dataclasses import dataclass
from typing import Literal
@dataclass
class Diagnosis:
layer: Literal["RETRIEVAL", "GENERATION", "BOTH", "PASS"]
detail: str
def diagnose(example, system) -> Diagnosis:
query = example["input"]
gt_chunk_id = example["expected_output"]["ground_truth_chunk_id"]
gt_claims = example["expected_output"]["atomic_claims"]
chunks = system.retrieve(query, k=5)
retrieved_ids = [c.id for c in chunks]
retrieval_ok = gt_chunk_id in retrieved_ids
if not retrieval_ok:
return Diagnosis("RETRIEVAL", f"ground-truth chunk {gt_chunk_id} not in top-5 {retrieved_ids}")
answer = system.generate(query, chunks)
unsupported = [c for c in gt_claims if not is_supported(c, chunks)]
answer_claims = atomic_facts(answer)
fabricated = [c for c in answer_claims if not is_supported(c, chunks)]
if fabricated and unsupported:
return Diagnosis("BOTH", f"missing supported claims {unsupported}; fabricated {fabricated}")
if fabricated:
return Diagnosis("GENERATION", f"fabricated unsupported claims: {fabricated}")
if unsupported:
return Diagnosis("GENERATION", f"answer missing supported claims: {unsupported}")
return Diagnosis("PASS", "retrieval and generation both OK")
The function returns one of four labels. The two-layer split turns a vague "the answer was wrong" into a precise "the retriever did not find the right chunk" or "the generator had the right chunk and ignored it." A regression run that emits these labels lets you graph retrieval failures separately from generation failures across releases. When the retrieval-failure curve moves, you tune the retriever. When the generation-failure curve moves, you tune the prompt or the judge. Conflating them throws away the information [6].
A runnable two-layer scorer with Ragas
Ragas is the framework that operationalizes the split. The library exposes retrieval metrics, generation metrics, and combined scorecards in one interface [7].
# score_two_layer.py
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
context_recall,
context_precision,
faithfulness,
answer_relevancy,
)
samples = [
{
"user_input": "What is the bake temperature for the gingerbread castle cookies?",
"reference": "350 degrees F for 8 to 10 minutes.",
"response": "Bake them at 350F for about 8 to 10 minutes.",
"retrieved_contexts": [
"Gingerbread Castle Cookies: bake at 350 degrees F for 8 to 10 minutes...",
"Royal icing recipe: combine 2 cups powdered sugar...",
],
},
# ... more samples
]
dataset = Dataset.from_list(samples)
result = evaluate(
dataset=dataset,
metrics=[
context_recall, # retrieval layer
context_precision, # retrieval layer
faithfulness, # generation layer
answer_relevancy, # generation layer
],
)
print(result.to_pandas()[["context_recall", "context_precision",
"faithfulness", "answer_relevancy"]])
The four metrics produce four columns. The retrieval columns answer "did we surface the right thing." The generation columns answer "did we use it correctly." Reading them side by side is the diagnostic. If context_recall is high but faithfulness is low, the prompt is the problem. If faithfulness is high but context_recall is low, the retriever is the problem. If both are low, the chunking strategy probably hides the answer across multiple chunks and you have an upstream architecture decision to make.
TIP
Pin your eval result to four numbers, not one. The end-to-end pass rate hides which layer regressed; the per-layer numbers tell you what to fix. Display all four on every PR.
The joint case: when the layers interact
The clean split has one important crack. Some queries cannot be answered by any single chunk, and the right behavior is for the generator to synthesize across multiple chunks. In that regime, retrieval recall is high (the chunks are there), generation faithfulness is high per chunk, and yet the final answer is wrong because the model failed to combine. The two-layer score reads as a pass on each layer and a fail on the end-to-end answer.
Two countermoves help. First, add an explicit "multi-chunk synthesis" slice to the eval set, tagged so you can read its end-to-end pass rate separately. Second, add an answer_correctness judge over the final answer that does not look at retrieval at all; it compares the answer to the gold reference. When the two-layer scores pass and answer_correctness fails, you have a synthesis problem and the fix is usually a prompt change (add a chain-of-thought step that lists evidence, then concludes).
A practical bar to clear
The bar a serious RAG pipeline should clear on its first calibrated dataset is roughly the following. Retrieval Recall@5 above 0.85 on the labeled set, with MRR above 0.6 so the right chunk is actually near the top. Context Precision above 0.5 so the prompt is not drowning in noise. Generation Faithfulness above 0.90 on supported answers, and a separate Answer Relevance above 0.85. The end-to-end pass rate that these four numbers imply will land somewhere in the 0.7 to 0.85 range depending on how tightly your gold set is labeled and how strict your faithfulness verifier is. Numbers below those thresholds point to specific work. Retrieval below 0.85 Recall: revisit chunking and the embedding model. Context Precision below 0.5: rerank or filter. Faithfulness below 0.9: tighten the system prompt with explicit citation requirements, and consider a verifier-based decoding step. Answer Relevance below 0.85: the prompt is letting the model answer a different question than the user asked.
The numbers are not magic. They are the thresholds where, in practice, the per-layer signal becomes strong enough that further tuning of a single layer has diminishing returns and you should move attention to the next bottleneck. Treat them as a starting calibration; tune them down for domains where the gold set is necessarily noisier (subjective queries, multi-document reasoning) and up for domains where the right answer is unambiguous (factoid lookup, structured extraction).
What good looks like
A team that has internalized this split runs every RAG release through two scorecards. Retrieval scorecard: Recall@K, MRR, Context Precision. Generation scorecard: Faithfulness, Answer Relevance, Context Utilization. The end-to-end number is a derived metric, useful for the executive summary, not for diagnosis. PR gates fire on per-layer regressions. Postmortems start with "which scorecard moved?" The team can answer the question "what should we tune first?" in less than a minute. That is the payoff of evaluating in two layers.