RAG Evals

Two-layer evaluation: retrieval metrics plus generation faithfulness. Also available as a free PDF download.

RAG Evals

A RAG system fails when retrieval misses the relevant chunk, when generation ignores the relevant chunk it was given, or when generation invents facts the chunks do not support. Each of those failures has a different fix. A retrieval miss demands a different chunking strategy, embedding model, or reranker. A generation that ignores good context demands a prompt change. A faithfulness failure demands a tighter judge and possibly atomic-fact decomposition. You cannot prescribe the right fix until you can attribute the failure to a layer.

That is the spine of every page in this section. Evaluate retrieval as its own thing with information-retrieval metrics; evaluate generation as its own thing with faithfulness and correctness metrics that operate on the supplied context; only then combine them into end-to-end scores. The Ragas paper and the production cookbooks built around it formalize the same split, and the practical literature has converged on this two-layer pattern. For the condensed treatment of the metric families, see reference-based metrics and embedding similarity.

Retrieval

Eval RAG in two layers. The diagnostic split that turns "the answer was wrong" into "the retriever missed" or "the generator ignored the context."
Retrieval metrics: which to use and when. The hub: what each metric answers, the labels it needs, and how to build a scorecard.
Recall@K. Coverage: did the relevant chunks make it into the top K, with Hit-Rate as the crude binary.
Precision@K. Signal purity, and how it differs from Ragas Context Precision.
MRR. How high the first relevant chunk ranks.
NDCG. Graded relevance and ordering, for rerankers.
Retriever debugging. The elimination path when a retrieval metric drops.
Context precision and recall. The Ragas LLM-judged retrieval metrics.

Generation

Generation metrics. The hub: faithfulness, groundedness, correctness, answer relevancy, completeness, and how to read them together.
Faithfulness and atomic facts. Decompose the answer into claims and judge each against the context.
Groundedness vs faithfulness. Per-claim attribution, and why vendors define it differently.
Answer correctness, relevancy, and completeness. The three dimensions faithfulness misses.

Data and coverage

Synthetic query generation. Bootstrap a labeled eval set from your own documents.
Long-context RAG and RULER. Why advertised context lengths exceed effective ones.
Non-English RAG eval. What to measure when the corpus and users are not English.