Eval RAG in two layers

Separate retrieval and generation or you cannot fix what breaks.

When a RAG answer is wrong, two failures look identical from the outside. The retriever did not surface the relevant chunk and the generator made something up. Or the retriever surfaced the right chunk and the generator ignored it. End-to-end scoring cannot distinguish those cases, which means end-to-end scoring cannot tell you what to fix. The two-layer split is the diagnostic that recovers the signal.

Why one number is the wrong number

A pass-rate of 0.74 on end-to-end RAG answers is an aggregate over at least three independent failure modes (retrieval miss, generation ignoring context, generation inventing facts). Move the retriever's top-K from three to five and the pass rate goes up by two points. Was that because the retriever found more relevant chunks, or because the generator got more shots at one of them, or because adding noise pushed the generator to hedge instead of fabricate? You cannot tell. You ship the change anyway. Next quarter you discover the generator got more verbose and stopped citing sources, and you cannot tell whether that started with the K change or with the model upgrade in week six.

The fix is to score retrieval and generation as separate experiments with separate metrics, and only combine them into an end-to-end number once each layer is independently passing its own bar ^[1]. The RAG-as-a-pipeline literature (Ragas, ARES, Self-RAG) all converge on this split because every team that tried to skip it eventually had to come back and do it ^[2] ^[3].

The two layers

flowchart LR
  subgraph Retrieval["Retrieval layer"]
    Q[Query] --> EMB[Embed/encode]
    EMB --> IDX[(Index search)]
    IDX --> RANK[Top-K chunks]
    RANK --> RM["Metrics:<br/>Recall@K · MRR · NDCG<br/>Hit-Rate · Context Precision"]
  end
  subgraph Generation["Generation layer"]
    RANK --> PROMPT[Prompt + context]
    PROMPT --> LLM[LLM]
    LLM --> ANS[Answer + citations]
    ANS --> GM["Metrics:<br/>Faithfulness · Answer Relevance<br/>Context Utilization · Citation Accuracy"]
  end
  RM -.diagnoses.-> RFAIL[Retrieval failure]
  GM -.diagnoses.-> GFAIL[Generation failure]

The retrieval layer takes the query and returns a ranked list of chunks. The generation layer takes the chunks plus the original query and returns an answer. Each layer has its own scoring set, its own metrics, and its own bar to clear.

Retrieval metrics: where to go

Information retrieval has spent thirty years on this; use the validated metrics rather than inventing new ones ^[4]. Each answers a different question and has its own page: Recall@K for coverage, Precision@K for signal purity, MRR for how high the first relevant chunk ranks, and NDCG for graded relevance and ordering. The retrieval metrics hub compares them and states which K to report; context precision and recall covers the Ragas LLM-judged variants. When these numbers drop, retriever debugging is the elimination path.

Generation metrics: where to go

Generation metrics assume the chunks in front of the model are the right ones; the question is what the model did with them. They split into faithfulness (no contradiction of the context), groundedness (per-claim attribution to a source span), correctness (right versus reality), answer relevancy (addresses the question), and completeness (covers every required part). The generation metrics hub lays out the full set and how to read them together; faithfulness has the atomic-fact detail ^[1].

A diagnostic flow

When an end-to-end answer fails, the two layers attribute the failure to a layer.

# diagnose_rag.py
from dataclasses import dataclass
from typing import Literal

@dataclass
class Diagnosis:
    layer: Literal["RETRIEVAL", "GENERATION", "BOTH", "PASS"]
    detail: str

def diagnose(example, system) -> Diagnosis:
    query = example["input"]
    gt_chunk_id = example["expected_output"]["ground_truth_chunk_id"]
    gt_claims = example["expected_output"]["atomic_claims"]

    chunks = system.retrieve(query, k=5)
    retrieved_ids = [c.id for c in chunks]
    retrieval_ok = gt_chunk_id in retrieved_ids

    if not retrieval_ok:
        return Diagnosis("RETRIEVAL", f"ground-truth chunk {gt_chunk_id} not in top-5 {retrieved_ids}")

    answer = system.generate(query, chunks)
    unsupported = [c for c in gt_claims if not is_supported(c, chunks)]
    answer_claims = atomic_facts(answer)
    fabricated = [c for c in answer_claims if not is_supported(c, chunks)]

    if fabricated and unsupported:
        return Diagnosis("BOTH", f"missing supported claims {unsupported}; fabricated {fabricated}")
    if fabricated:
        return Diagnosis("GENERATION", f"fabricated unsupported claims: {fabricated}")
    if unsupported:
        return Diagnosis("GENERATION", f"answer missing supported claims: {unsupported}")
    return Diagnosis("PASS", "retrieval and generation both OK")

The function returns one of four labels. The two-layer split turns a vague "the answer was wrong" into a precise "the retriever did not find the right chunk" or "the generator had the right chunk and ignored it." A regression run that emits these labels lets you graph retrieval failures separately from generation failures across releases. When the retrieval-failure curve moves, you tune the retriever. When the generation-failure curve moves, you tune the prompt or the judge. Conflating them throws away the information ^[5].

A runnable two-layer scorer with Ragas

Ragas is the framework that operationalizes the split. The library exposes retrieval metrics, generation metrics, and combined scorecards in one interface ^[6].

# score_two_layer.py
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    context_recall,
    context_precision,
    faithfulness,
    answer_relevancy,
)

samples = [
    {
        "user_input": "What is the bake temperature for the gingerbread castle cookies?",
        "reference": "350 degrees F for 8 to 10 minutes.",
        "response": "Bake them at 350F for about 8 to 10 minutes.",
        "retrieved_contexts": [
            "Gingerbread Castle Cookies: bake at 350 degrees F for 8 to 10 minutes...",
            "Royal icing recipe: combine 2 cups powdered sugar...",
        ],
    },
    # ... more samples
]

dataset = Dataset.from_list(samples)
result = evaluate(
    dataset=dataset,
    metrics=[
        context_recall,       # retrieval layer
        context_precision,    # retrieval layer
        faithfulness,         # generation layer
        answer_relevancy,     # generation layer
    ],
)
print(result.to_pandas()[["context_recall", "context_precision",
                          "faithfulness", "answer_relevancy"]])

The four metrics produce four columns. The retrieval columns answer "did we surface the right thing." The generation columns answer "did we use it correctly." Reading them side by side is the diagnostic. If context_recall is high but faithfulness is low, the prompt is the problem. If faithfulness is high but context_recall is low, the retriever is the problem. If both are low, the chunking strategy probably hides the answer across multiple chunks and you have an upstream architecture decision to make.

TIP

Pin your eval result to four numbers, not one. The end-to-end pass rate hides which layer regressed; the per-layer numbers tell you what to fix. Display all four on every PR.

The joint case: when the layers interact

The clean split has one important crack. Some queries cannot be answered by any single chunk, and the right behavior is for the generator to synthesize across multiple chunks. In that regime, retrieval recall is high (the chunks are there), generation faithfulness is high per chunk, and yet the final answer is wrong because the model failed to combine. The two-layer score reads as a pass on each layer and a fail on the end-to-end answer.

Two countermoves help. First, add an explicit "multi-chunk synthesis" slice to the eval set, tagged so you can read its end-to-end pass rate separately. Second, add an answer_correctness judge over the final answer that does not look at retrieval at all; it compares the answer to the gold reference. When the two-layer scores pass and answer_correctness fails, you have a synthesis problem and the fix is usually a prompt change (add a chain-of-thought step that lists evidence, then concludes).

A practical bar to clear

The bar a serious RAG pipeline should clear on its first calibrated dataset is roughly the following. Retrieval Recall@5 above 0.85 on the labeled set, with MRR above 0.6 so the right chunk is actually near the top. Context Precision above 0.5 so the prompt is not drowning in noise. Generation Faithfulness above 0.90 on supported answers, and a separate Answer Relevance above 0.85. The end-to-end pass rate that these four numbers imply will land somewhere in the 0.7 to 0.85 range depending on how tightly your gold set is labeled and how strict your faithfulness verifier is. Numbers below those thresholds point to specific work. Retrieval below 0.85 Recall: revisit chunking and the embedding model. Context Precision below 0.5: rerank or filter. Faithfulness below 0.9: tighten the system prompt with explicit citation requirements, and consider a verifier-based decoding step. Answer Relevance below 0.85: the prompt is letting the model answer a different question than the user asked.

The numbers are not magic. They are the thresholds where, in practice, the per-layer signal becomes strong enough that further tuning of a single layer has diminishing returns and you should move attention to the next bottleneck. Treat them as a starting calibration; tune them down for domains where the gold set is necessarily noisier (subjective queries, multi-document reasoning) and up for domains where the right answer is unambiguous (factoid lookup, structured extraction).

What good looks like

A team that has internalized this split runs every RAG release through two scorecards. Retrieval scorecard: Recall@K, MRR, Context Precision. Generation scorecard: Faithfulness, Answer Relevance, Context Utilization. The end-to-end number is a derived metric, useful for the executive summary, not for diagnosis. PR gates fire on per-layer regressions. Postmortems start with "which scorecard moved?" The team can answer the question "what should we tune first?" in less than a minute. That is the payoff of evaluating in two layers.