AI Evals › RAG Evals

Recall@K: measuring retrieval coverage in RAG

Recall@K is the fraction of a query's relevant chunks that land in the top K retrieved. It scores coverage and ignores rank.

Recall@K is the fraction of all chunks relevant to a query that appear in the top K retrieved. It measures coverage: did the retriever put the evidence in front of the generator? It says nothing about where in the top K each chunk landed.

Definition

Recall@K is set membership, not ranking. Take the set of chunks a labeler marked relevant for a query. Take the top K the retriever returned. Recall@K is the size of the intersection over the size of the relevant set ^[1]. Average across queries for a run-level number. This is the classic IR definition of recall truncated at rank K; it is the single most load-bearing retrieval number in a RAG pipeline, because a chunk that never enters the top K can never be cited by the generator ^[2].

One-hot labels are enough. You need the ids of the relevant chunks per query and nothing more; graded relevance is only required for NDCG. Producing that ground truth is the hard part, not computing the metric.

The formula

$\text{Recall@K} = \frac{|\{\text{relevant}\} \cap \{\text{retrieved}_{1..K}\}|}{|\{\text{relevant}\}|}$

The denominator is fixed by the labels, not by K. Adding more retrieved chunks can only raise the numerator or leave it flat, so Recall@K is monotonic non-decreasing in K. At $K$ equal to the corpus size, Recall@K is 1.0 by construction. That property is the source of the most common misreport of this metric; see the caution below.

A worked example

A query has 4 truly relevant chunks in the corpus: $\{d_1, d_2, d_3, d_4\}$ . The retriever returns 10 chunks. Three of the relevant four appear somewhere in that top 10; $d_4$ does not.

$\text{Recall@10} = \frac{|\{d_1, d_2, d_3\}|}{|\{d_1, d_2, d_3, d_4\}|} = \frac{3}{4} = 0.75$

The answer is 0.75 whether the three hits sit at ranks 1, 2, 3 or at ranks 8, 9, 10. Order does not enter the formula. If you care that the hits are buried near rank 10, that is an MRR or NDCG concern, not a recall one. Recall@10 = 0.75 means one supporting chunk in four is missing from the prompt entirely.

The Python

Reuse the function from the retrieval metrics hub verbatim. No new code for this metric.

# retrieval_metrics.py

def recall_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
    if not relevant:
        return 0.0
    return len(set(retrieved[:k]) & relevant) / len(relevant)

The set(retrieved[:k]) truncation is the whole metric. Everything about rank order is discarded on that line, which is correct: Recall@K is deliberately order-blind. Guard the empty-relevant case; a query with no gold chunks should not contribute a divide-by-zero, and returning 0.0 keeps it from silently inflating the average.

When does Recall@K matter most?

Recall@K is the metric to gate on when a single miss is expensive and rank is secondary.

Multi-hop questions. A question that needs several supporting documents to answer, such as the two-hop chains in HotpotQA, fails if any one supporting chunk is missing, regardless of how well the others ranked ^[3]. Recall@K over the full supporting set is the metric that tracks this; MRR, which only credits the first hit, does not.
Compliance and legal retrieval. Missing a relevant clause is a correctness failure, not a ranking annoyance. High recall at a defensible K is the requirement.
Any pipeline with a downstream reranker. The reranker fixes order; it cannot recover a chunk the first-stage retriever never surfaced. Optimize the first stage for recall at a generous K, then let the reranker optimize precision and order over that candidate pool ^[2].

Report Recall@K at the K that matches how many chunks the prompt actually consumes. A three-chunk chat prompt cares about Recall@3; a twenty-chunk long-context summarizer cares about Recall@20. Pair it with Ragas context recall on the same labeled set for an LLM-graded view of the same coverage question ^[4] ^[5].

Recall@K versus Hit-Rate

Hit-Rate is the binary collapse of Recall@K: did any relevant chunk appear in the top K? It answers yes or no, ignoring how many of the relevant set you found. Reuse the hub function:

def hit_rate(retrieved: list[str], relevant: set[str], k: int) -> int:
    return int(bool(set(retrieved[:k]) & relevant))

Hit-Rate is the right crude metric for a CI smoke tier: cheap to compute, trivial to read, and flake-free. A run dropping below 90 percent hit-rate at production K has a real regression. Recall@K is the finer instrument you graduate to when "we found at least one" is not a high enough bar, which is exactly the multi-hop and compliance cases above. On the earlier example, Hit-Rate@10 is 1 (three hits, so at least one) while Recall@10 is 0.75. The gap between them is the information Hit-Rate throws away.

TIP

Gate PR-time CI on Hit-Rate at production K and release gates on Recall@K with an explicit threshold. Hit-Rate is stable enough not to flake the fast tier; Recall@K carries the coverage signal the release actually depends on.

Recall@K vs Hit-Rate vs Precision@K

Metric	Question answered	Labels needed	Order-sensitive?	When to use
Recall@K	What fraction of relevant chunks are in the top K?	one-hot relevant set	no	multi-hop, compliance, pre-reranker recall
Hit-Rate	Did any relevant chunk appear in the top K?	one-hot relevant set	no	CI smoke tier, cheap regression alarm
Precision@K	What fraction of the top K are relevant?	one-hot relevant set	no (set-based)	prompt-noise budget, tight-grounding generators

All three are set-based and order-blind. The moment rank matters, move to MRR or NDCG. Recall and Precision@K trade off against each other: raising K almost always lifts recall and lowers precision.

The failure mode to avoid

CAUTION

A retriever that returns the entire corpus scores Recall 1.0 at K equal to corpus size. Recall@K is only meaningful at a fixed, production-aligned K, the K your prompt actually consumes. A headline Recall number without its K is not a metric, it is decoration ^[1].

The second half of the same trap: raising K to lift recall floods the prompt with irrelevant chunks. Those chunks are the noise that degrades generation, distracts the model, and inflates cost and latency. A retriever tuned for Recall@50 can produce worse answers than one tuned for Recall@5 despite the higher recall number, because the generator drowns. Recall@K is a first-stage diagnostic, not the objective. Optimize it at the K you serve, then verify downstream that generation held. For the workflow of separating a first-stage recall miss from a rank problem from a generation problem, see retriever debugging. To wire all of this into an LLM-graded suite, follow the RAG eval with Ragas recipe.

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.