AI Evals › RAG Evals

RAG retrieval metrics: which to use and when

Recall@K, Precision@K, MRR, NDCG and Hit-Rate compared: what each answers, the labels it needs, and when to reach for it.

Retrieval evaluation asks one question, "did the right chunks make it into the top K," and answers it with metrics that information retrieval has spent thirty years validating. Use those rather than inventing new ones ^[1]. This page is the map: what each metric answers, the labels it needs, and which one to reach for. Each metric then has its own page with the formula, a worked example, and a copyable implementation.

The metrics at a glance

Metric	Question it answers	Labels needed	Order-sensitive	Reach for it when
Recall@K	Of all relevant chunks, how many are in the top K?	relevant chunk ids	no	missing a relevant chunk is costly (multi-hop, compliance)
Precision@K	Of the top K, how many are relevant?	relevant chunk ids	no	the context budget is tight and noise hurts generation
MRR	How high does the first relevant chunk rank?	relevant chunk ids	yes (first hit)	the generator reads only the top one or two chunks
NDCG	Are the most relevant chunks ranked highest?	graded relevance	yes	you have graded labels and order matters (reranking)
Hit-Rate	Did any relevant chunk appear in the top K?	relevant chunk ids	no	you want a fast, flake-free CI smoke tier
Context precision / recall	LLM-judged relevance and coverage against a gold answer	gold answer (recall); none (precision)	precision: yes	you want a RAG-native score, reference-based or reference-free

Setup: the ground truth comes first

Every retrieval metric needs a labeled set: queries paired with the ids of the chunks genuinely relevant to each query. Binary labels are enough for Recall@K, Precision@K, MRR, and Hit-Rate; NDCG needs graded labels ^[2]. Producing that ground truth, not running the metrics, is the hard part. Bootstrap it from real user interactions (a click on a chunk is a weak positive), from synthetic queries against known documents (the document is ground truth by construction), or from manual labeling on a hundred queries (enough for a first calibration) ^[1].

Which metric should you use?

Recall@K and MRR are the default pair: Recall@K for "did we find everything relevant," MRR for "did we find something relevant near the top." Add Precision@K when the prompt budget is tight and every irrelevant chunk displaces a useful one. Graduate to NDCG when you can afford graded labels and chunk order affects the generator, which is the case for reranker evaluation. Keep Hit-Rate for the CI smoke tier: crude, but cheap and flake-free. Pair the IR metrics with Ragas context precision on the same set to catch noise the binary IR metrics miss ^[3] ^[4].

A scorecard and thresholds

Track Recall@K, MRR, and Hit-Rate per release, per chunking strategy, per embedding model. A change that moves Recall@5 up but MRR down means the relevant chunks are in the set but lower in the ranking, which usually makes generation worse even though the headline retrieval number went up. That diagnostic only exists if you log all three.

Gate them at the tier that fits. For PR-time CI, gate on Hit-Rate at production K. For release gates, gate on Recall@K and MRR with explicit thresholds (Recall@5 above 0.85 and MRR above 0.6 are reasonable starts). For quarterly reviews, look at NDCG and Context Precision. Numbers without thresholds are decoration; the threshold is what makes a metric a gate ^[5].

When the numbers drop

When a retrieval metric regresses, which metric moved tells you where to look, and retriever debugging is the elimination path. Recall down means the relevant chunk is not being retrieved at any K, which points at chunking or the embedding model. Recall steady but MRR down means the chunk is present but ranked too low, which points at similarity scoring or a missing reranker. Both fine but Context Precision low means the top K is padded with noise.

TIP

Report every metric at the K your pipeline actually passes to the model, not an arbitrary large K. A retriever that returns the whole corpus scores Recall 1.0 at K equal to corpus size and tells you nothing.