Information retrieval has spent thirty years defining these metrics. Do not invent new ones. The five that matter for RAG evaluation are Recall@K, Mean Reciprocal Rank, NDCG, hit-rate, and context precision. Each answers a slightly different question; running all five on the same retrieval run takes one extra line of code and pays back the first time you have to debug a "retrieval got worse" regression.
Setup
Every retrieval metric needs a labeled set: queries paired with the ids of the chunks genuinely relevant to each query. One-hot labels are enough for Recall@K, MRR, and hit-rate; graded labels are required for NDCG. The hard part of retrieval evaluation is producing ground truth, not running the metrics. Bootstrap from real user interactions (a click on a chunk is a weak positive), from synthetic queries against known documents (the document is ground truth by construction), or from manual labeling on a hundred queries (sufficient for a first calibration) [1].
Recall@K
Recall@K is the binary set-membership question: of all relevant chunks for this query, what fraction appeared in the top K? Average across queries.
# retrieval_metrics.py
import math
def recall_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
if not relevant:
return 0.0
return len(set(retrieved[:k]) & relevant) / len(relevant)
def mrr(retrieved: list[str], relevant: set[str]) -> float:
for i, doc_id in enumerate(retrieved, start=1):
if doc_id in relevant:
return 1.0 / i
return 0.0
def hit_rate(retrieved: list[str], relevant: set[str], k: int) -> int:
return int(bool(set(retrieved[:k]) & relevant))
def ndcg_at_k(retrieved: list[str], grades: dict[str, float], k: int) -> float:
dcg = sum(grades.get(d, 0.0) / math.log2(i + 2)
for i, d in enumerate(retrieved[:k]))
ideal = sum(g / math.log2(i + 2)
for i, g in enumerate(sorted(grades.values(), reverse=True)[:k]))
return dcg / ideal if ideal else 0.0
Report Recall@K at the K that matches how many chunks your prompt fits. A chat assistant pouring three chunks into a prompt cares about Recall@3; a long-context summarizer consuming twenty cares about Recall@20. Recall@K does not punish noise: a retriever returning the entire corpus scores 1.0 at K equal to corpus size, so the metric only matters with a fixed production-aligned K.
Mean Reciprocal Rank
MRR measures how high the first relevant chunk sits. The first hit gets 1.0, second 0.5, third 0.33; missing gets 0. Use it when the generator only reads the top chunk in practice. If MRR is 0.4 your top chunk is wrong more than half the time, which is a different problem than "Recall@5 is 0.8."
NDCG
NDCG generalizes to graded relevance and credits multiple relevant chunks. Use it when labels carry relevance grades and the generator benefits from getting the strongest evidence near the top. The cost is the labeling: graded labels take three to five times as long as one-hot. Start with Recall@K and MRR; graduate to NDCG when the team can afford graded labels and the generator is sensitive to chunk order.
Hit-Rate
Hit-Rate is the binary "did any relevant chunk appear in top K." Cruder than Recall@K but the right metric for the smoke tier in CI: easy to compute, easy to interpret, and flake-free. A retrieval system dropping below 90 percent hit-rate at production K has a real problem; one dropping from 88 to 84 percent Recall@5 might be within noise.
Context Precision
Context Precision is the Ragas-specific metric: of the chunks the retriever returned, what fraction were actually relevant? Ragas implements it as an LLM judge over each chunk, conditioned on the query and the gold answer [2]. The intuition is the precision-recall tradeoff at the retrieval stage. A model grounding tightly on each chunk benefits from precision; a model synthesizing across chunks benefits from recall. Knowing both numbers tells you which lever to pull [3].
A scorecard
Track Recall@K, MRR, and hit-rate per release, per chunking strategy, per embedding model. A change that moves Recall@5 up but MRR down means the relevant chunks are in the set but lower in the ranking, which usually means generation gets worse even though the headline retrieval number went up. The diagnostic only exists if you log all three.
TIP
Pair the IR metrics with Ragas Context Precision on the same set. The IR metrics tell you whether the relevant chunks are in the top K. Context Precision tells you whether the rest of the top K is noise.
For PR-time CI, gate on hit-rate at production K. For release gates, gate on Recall@K and MRR with explicit thresholds (Recall@5 above 0.85, MRR above 0.6 are reasonable starts). For quarterly reviews, look at NDCG and Context Precision. Numbers without thresholds are decoration; the threshold is what makes a metric a gate [4].