Recall@K is the fraction of all chunks relevant to a query that appear in the top K retrieved. It measures coverage: did the retriever put the evidence in front of the generator? It says nothing about where in the top K each chunk landed.
Definition
Recall@K is set membership, not ranking. Take the set of chunks a labeler marked relevant for a query. Take the top K the retriever returned. Recall@K is the size of the intersection over the size of the relevant set [1]. Average across queries for a run-level number. This is the classic IR definition of recall truncated at rank K; it is the single most load-bearing retrieval number in a RAG pipeline, because a chunk that never enters the top K can never be cited by the generator [2].
One-hot labels are enough. You need the ids of the relevant chunks per query and nothing more; graded relevance is only required for NDCG. Producing that ground truth is the hard part, not computing the metric.
Recall@K=∣{relevant}∣∣{relevant}∩{retrieved1..K}∣
The denominator is fixed by the labels, not by K. Adding more retrieved chunks can only raise the numerator or leave it flat, so Recall@K is monotonic non-decreasing in K. At K equal to the corpus size, Recall@K is 1.0 by construction. That property is the source of the most common misreport of this metric; see the caution below.
A worked example
A query has 4 truly relevant chunks in the corpus: {d1,d2,d3,d4}. The retriever returns 10 chunks. Three of the relevant four appear somewhere in that top 10; d4 does not.
Recall@10=∣{d1,d2,d3,d4}∣∣{d1,d2,d3}∣=43=0.75
The answer is 0.75 whether the three hits sit at ranks 1, 2, 3 or at ranks 8, 9, 10. Order does not enter the formula. If you care that the hits are buried near rank 10, that is an MRR or NDCG concern, not a recall one. Recall@10 = 0.75 means one supporting chunk in four is missing from the prompt entirely.
The Python
Reuse the function from the retrieval metrics hub verbatim. No new code for this metric.
# retrieval_metrics.py
def recall_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
if not relevant:
return 0.0
return len(set(retrieved[:k]) & relevant) / len(relevant)
The set(retrieved[:k]) truncation is the whole metric. Everything about rank order is discarded on that line, which is correct: Recall@K is deliberately order-blind. Guard the empty-relevant case; a query with no gold chunks should not contribute a divide-by-zero, and returning 0.0 keeps it from silently inflating the average.
When does Recall@K matter most?
Recall@K is the metric to gate on when a single miss is expensive and rank is secondary.
- Multi-hop questions. A question that needs several supporting documents to answer, such as the two-hop chains in HotpotQA, fails if any one supporting chunk is missing, regardless of how well the others ranked [3]. Recall@K over the full supporting set is the metric that tracks this; MRR, which only credits the first hit, does not.
- Compliance and legal retrieval. Missing a relevant clause is a correctness failure, not a ranking annoyance. High recall at a defensible K is the requirement.
- Any pipeline with a downstream reranker. The reranker fixes order; it cannot recover a chunk the first-stage retriever never surfaced. Optimize the first stage for recall at a generous K, then let the reranker optimize precision and order over that candidate pool [2].
Report Recall@K at the K that matches how many chunks the prompt actually consumes. A three-chunk chat prompt cares about Recall@3; a twenty-chunk long-context summarizer cares about Recall@20. Pair it with Ragas context recall on the same labeled set for an LLM-graded view of the same coverage question [4] [5].
Recall@K versus Hit-Rate
Hit-Rate is the binary collapse of Recall@K: did any relevant chunk appear in the top K? It answers yes or no, ignoring how many of the relevant set you found. Reuse the hub function:
def hit_rate(retrieved: list[str], relevant: set[str], k: int) -> int:
return int(bool(set(retrieved[:k]) & relevant))
Hit-Rate is the right crude metric for a CI smoke tier: cheap to compute, trivial to read, and flake-free. A run dropping below 90 percent hit-rate at production K has a real regression. Recall@K is the finer instrument you graduate to when "we found at least one" is not a high enough bar, which is exactly the multi-hop and compliance cases above. On the earlier example, Hit-Rate@10 is 1 (three hits, so at least one) while Recall@10 is 0.75. The gap between them is the information Hit-Rate throws away.
TIP
Gate PR-time CI on Hit-Rate at production K and release gates on Recall@K with an explicit threshold. Hit-Rate is stable enough not to flake the fast tier; Recall@K carries the coverage signal the release actually depends on.
Recall@K vs Hit-Rate vs Precision@K
| Metric | Question answered | Labels needed | Order-sensitive? | When to use |
|---|
| Recall@K | What fraction of relevant chunks are in the top K? | one-hot relevant set | no | multi-hop, compliance, pre-reranker recall |
| Hit-Rate | Did any relevant chunk appear in the top K? | one-hot relevant set | no | CI smoke tier, cheap regression alarm |
| Precision@K | What fraction of the top K are relevant? | one-hot relevant set | no (set-based) | prompt-noise budget, tight-grounding generators |
All three are set-based and order-blind. The moment rank matters, move to MRR or NDCG. Recall and Precision@K trade off against each other: raising K almost always lifts recall and lowers precision.
The failure mode to avoid
CAUTION
A retriever that returns the entire corpus scores Recall 1.0 at K equal to corpus size. Recall@K is only meaningful at a fixed, production-aligned K, the K your prompt actually consumes. A headline Recall number without its K is not a metric, it is decoration [1].
The second half of the same trap: raising K to lift recall floods the prompt with irrelevant chunks. Those chunks are the noise that degrades generation, distracts the model, and inflates cost and latency. A retriever tuned for Recall@50 can produce worse answers than one tuned for Recall@5 despite the higher recall number, because the generator drowns. Recall@K is a first-stage diagnostic, not the objective. Optimize it at the K you serve, then verify downstream that generation held. For the workflow of separating a first-stage recall miss from a rank problem from a generation problem, see retriever debugging. To wire all of this into an LLM-graded suite, follow the RAG eval with Ragas recipe.