The point of evaluating RAG in two layers is that retrieval and generation fail for different reasons, and a single score cannot tell you which one broke 1. This recipe builds the smallest useful two-layer eval. retrieval metrics (context precision and context recall) on a known answer set, plus generation metrics (faithfulness to retrieved context, answer relevancy) using Ragas as the metric library 2.
Ragas is a Python library that wraps the standard RAG metrics over an LLM judge. It is not a tracing platform; it computes scores you can pipe into your observability hub. We use it because the metric implementations are the literature reference implementations and because the API is short 3.
The dataset shape
You need four columns per row:
question. The user query.
contexts. The chunks your retriever returned for the query.
answer. The response your RAG pipeline produced.
reference. The gold answer (for context-recall comparison).
If you do not have a gold-answer column, you can run faithfulness and answer-relevancy without it; only context-recall needs the reference.
For a thirty-question dev set with a small corpus, a CSV is fine. At scale, JSONL with one query per line scales better.
question,contexts,answer,reference
"What is the company's parental-leave policy?","Eligible employees receive 16 weeks of paid leave...||Paid leave can be taken in two blocks within 12 months...","Eligible employees get 16 weeks of paid parental leave, which can be split into two blocks taken within 12 months of birth or adoption.","Eligible employees receive 16 weeks paid parental leave, which can be taken in up to two blocks within 12 months of birth or adoption."
We use || to delimit multiple retrieved chunks in a single CSV cell. The loader splits on it.
Step 1: install and load
pip install ragas openai pandas datasets
export OPENAI_API_KEY=...
"""Two-layer RAG eval using Ragas.
Run:
pip install ragas openai pandas datasets langchain langchain-openai
export OPENAI_API_KEY=...
python ragas_eval.py
"""
from __future__ import annotations
import os
from pathlib import Path
import pandas as pd
from datasets import Dataset
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy,
)
def load(path: str) -> Dataset:
df = pd.read_csv(path)
# contexts is a list of strings per row in Ragas.
df["contexts"] = df["contexts"].apply(
lambda s: [c.strip() for c in s.split("||")]
)
# Ragas expects: question, contexts, answer, reference.
return Dataset.from_pandas(df)
def main() -> None:
ds = load("rag_eval.csv")
print(f"Loaded {len(ds)} rows.")
judge = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-2024-08-06", temperature=0))
emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))
result = evaluate(
ds,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy,
],
llm=judge,
embeddings=emb,
)
print(result)
out = result.to_pandas()
out_path = Path("ragas_results.csv")
out.to_csv(out_path, index=False)
print(f"\nWrote per-row scores to {out_path}.")
print("\nMean scores:")
for m in ("context_precision", "context_recall", "faithfulness", "answer_relevancy"):
if m in out.columns:
print(f" {m:22s} {out[m].mean():.3f}")
if __name__ == "__main__":
main()
Run it:
python ragas_eval.py
You will see Ragas iterate over each row, call the judge model for each metric, and print a per-metric mean at the end.
What the four metrics mean
Each metric is a different question about a different layer.
Context precision asks: of the chunks the retriever returned, what fraction were actually relevant? High precision means the retriever did not pad the context with noise. Computed against the reference answer via the judge. If precision is low, your retrieval is over-pulling 1.
Context recall asks: of the claims in the gold answer, what fraction are supported by the retrieved chunks? High recall means the retrieval did not miss any of the needed information. Low recall means your retriever under-pulled (or the chunks do not cover the question). This is the one metric that requires a reference column 2.
Faithfulness asks: of the claims in the generated answer, what fraction are supported by the retrieved context? Low faithfulness means the model hallucinated beyond what the retrieval gave it. Ragas implements this with claim-extraction followed by per-claim verification, the same pattern as FActScore 3 4.
Answer relevancy asks: does the generated answer actually address the question? It is the cheapest sanity check; if it drops while faithfulness holds, your generator is on-topic but wrong-document.
The four split cleanly into two layers:
flowchart LR
subgraph Retrieval
A[context_precision]
B[context_recall]
end
subgraph Generation
C[faithfulness]
D[answer_relevancy]
end
A --> C
B --> C
Step 2: diagnose failures by layer
When you read the per-row CSV, the pattern of low scores tells you what to fix.
| Pattern | Layer to fix |
|---|
| Low context_precision, normal everything else | Retriever returns noisy chunks. Raise the relevance threshold or rerank. |
| Low context_recall, normal everything else | Retriever misses chunks. Lower the threshold, expand chunk size, or add a hybrid search. |
| Normal retrieval, low faithfulness | Generator hallucinates. Add explicit "cite the chunk" prompt instructions and constrain to retrieved context. |
| Normal retrieval, low answer_relevancy | Generator drifts off-topic. Check the prompt template; the model may be over-paraphrasing. |
| Everything low | Either the retrieval index is broken or the question set is impossible for this corpus. |
The mistake teams make is reporting a single "RAG quality" number. Aggregating across layers hides the diagnostic signal that makes the metrics actionable 1.
Step 3: report per-row, not just means
Ragas returns per-row scores. Read them. The mean is for the dashboard; the per-row scores are where you find the bug.
import pandas as pd
df = pd.read_csv("ragas_results.csv")
# Faithfulness disasters (likely hallucination).
hallucinations = df[df["faithfulness"] < 0.5].sort_values("faithfulness")
print(f"Found {len(hallucinations)} likely hallucinations.")
hallucinations[["question", "answer", "faithfulness"]].to_csv(
"hallucinations.csv", index=False
)
# Retrieval misses (context_recall low).
misses = df[df["context_recall"] < 0.5].sort_values("context_recall")
print(f"Found {len(misses)} retrieval misses.")
misses[["question", "reference", "context_recall"]].to_csv(
"retrieval_misses.csv", index=False
)
You read the hallucination CSV by hand. The pattern almost always clusters into one of three buckets: (a) the model invented a number not in the context, (b) the model combined two chunks into a claim neither chunk supports, or (c) the model paraphrased loosely enough that the literal claim is unsupported even though the gist is right. Each bucket has a different fix.
Step 4: a sanity gate before you trust the numbers
Run faithfulness on a handful of trivially-correct rows (the gold answer exactly equals one of the retrieved chunks) to verify the judge is not over-flagging. Then run it on a few trivially-wrong rows (an answer that contradicts the chunks) to verify the judge is not under-flagging.
If either sanity check fails, fix the rubric on the judge before you trust the headline numbers. The Ragas judge prompt is in their repo if you need to override it 3.
Step 5: wire it into CI
Pin a fixed dev set. Run Ragas on every PR that touches the retriever, the chunker, or the prompt. Fail the PR if context_recall or faithfulness drops more than five points from the main-branch baseline:
import json
import sys
import pandas as pd
BASELINE = json.load(open("baseline_metrics.json"))
TOLERANCE = 0.05
df = pd.read_csv("ragas_results.csv")
metrics = {
"context_precision": df["context_precision"].mean(),
"context_recall": df["context_recall"].mean(),
"faithfulness": df["faithfulness"].mean(),
"answer_relevancy": df["answer_relevancy"].mean(),
}
failed = []
for k, v in metrics.items():
base = BASELINE[k]
if v < base - TOLERANCE:
failed.append((k, v, base))
if failed:
print("REGRESSION:")
for name, got, base in failed:
print(f" {name}: {got:.3f} < {base:.3f} - {TOLERANCE}")
sys.exit(1)
print("OK")
For the GitHub Actions wrapping, see production CI integration. For trace logging the retrieved chunks at production time, see Langfuse online eval once that recipe is written.
What this skips
This recipe does not cover synthetic query generation for the eval set itself; for that, see synthetic query generation. It does not cover long-context RAG (RULER-style multi-needle); see long-context eval with RULER. It does not cover non-English RAG, which behaves materially differently because both the embedding model and the judge have language-specific biases. The OpenAI cookbook RAG eval notebook is a good companion to this one 5.
CAUTION
Ragas calls the judge once per metric per row. A 200-row dev set running four metrics is 800 judge calls. Pin a small dev set for CI and run the full set nightly. Cost control matters here.
What to do next
The retrieval vs generation chapter covers the conceptual two-layer model in more depth. The faithfulness chapter covers atomic-fact decomposition and why "no hallucination" is the wrong frame. For dataset versioning, see the versioning chapter before you ship.