Mean Reciprocal Rank is the average, across queries, of 1 divided by the rank of the first relevant chunk. It rewards getting a relevant result near the top of the ranking and ignores everything after the first hit.
What MRR measures
MRR scores one thing per query: how deep you had to look before the first relevant chunk appeared. The first position scores 1.0, the second 0.5, the third 0.33, the tenth 0.1; a query with no relevant chunk in the list scores 0. Averaging those reciprocals over the query set gives a single number in [0, 1] where higher is better. It is a rank-position metric, not a set-coverage metric: a second relevant chunk at rank 2 changes nothing once rank 1 is already a hit [1].
The metric comes from the TREC-8 Question Answering track, where systems returned a ranked list of candidate answers and were scored on the reciprocal rank of the first correct one [2]. The RAG usage is the same idea with chunks instead of answer strings, and BEIR reports MRR alongside NDCG and Recall as a standard retrieval metric [3].
MRR=∣Q∣1i=1∑∣Q∣ranki1
Q is the set of queries. ranki is the 1-indexed position of the first relevant chunk for query i. When no relevant chunk appears in the returned list, the reciprocal 1/ranki is defined as 0 for that query. Only the first hit enters the sum; the rest of the ranking is invisible to MRR.
Worked example
Three queries. For each, the rank of the first relevant chunk in the returned list:
| Query | Rank of first relevant chunk | Reciprocal rank |
|---|
| q1 | 1 | 1/1=1.0 |
| q2 | 4 | 1/4=0.25 |
| q3 | 2 | 1/2=0.5 |
MRR=31.0+0.25+0.5=31.75=0.583
The single miss at rank 4 (q2) drags the mean down hard: moving that hit from rank 4 to rank 1 would lift MRR from 0.583 to 0.833. That sensitivity to the top few positions is the whole point of the metric.
The Python
Reuse mrr from the retrieval metrics hub without modification. It walks the ranked list once and returns the reciprocal of the first hit, or 0 if there is none.
# retrieval_metrics.py
def mrr(retrieved: list[str], relevant: set[str]) -> float:
for i, doc_id in enumerate(retrieved, start=1):
if doc_id in relevant:
return 1.0 / i
return 0.0
The corpus MRR is the mean of mrr(...) over the query set. Reproducing the worked example:
runs = [
(["c1", "c9", "c3"], {"c1"}), # first relevant at rank 1
(["c2", "c8", "c7", "c4"], {"c4"}), # first relevant at rank 4
(["c5", "c6", "c0"], {"c6"}), # first relevant at rank 2
]
scores = [mrr(r, rel) for r, rel in runs] # [1.0, 0.25, 0.5]
corpus_mrr = sum(scores) / len(scores) # 0.583...
When is MRR the right metric?
MRR fits when only the first hit matters to downstream generation. Three cases.
Single-answer lookup and FAQ bots: the query has one correct chunk, and finding it near the top is the entire job. Generators that reliably read only the top one or two chunks: if the prompt template pastes retrieved[:2], a relevant chunk at rank 5 is functionally a miss, and MRR captures that where a generous Recall@10 would not. Comparing retrievers on time-to-first-relevant: two retrievers with identical Recall@10 can differ sharply on where the first hit lands, and MRR is the one number that separates them [4].
Skip MRR when the generator synthesizes across several chunks or the query is multi-hop. There, coverage and ranking of all relevant chunks matter, so reach for Recall@K and NDCG instead.
MRR versus Recall@K versus NDCG
Each metric answers a different question about the same ranked list. Log all three; the divergence between them is the diagnostic.
| Metric | Question it answers | Uses labels | Credits chunks after the first hit | Uses graded relevance |
|---|
| Recall@K | Of all relevant chunks, what fraction is in the top K? | one-hot | yes | no |
| MRR | How high is the first relevant chunk? | one-hot | no | no |
| NDCG | Are the strongest chunks ranked near the top? | graded | yes | yes |
A run where Recall@5 rises but MRR falls means the relevant chunks are still in the top 5 but sitting lower in the order, which usually degrades a generator that leans on the first chunk. That contradiction is only visible if you track both [5].
CAUTION
MRR ignores every relevant chunk after the first, so it misleads on multi-hop questions that require evidence from several chunks. Read the absolute value carefully too: an MRR of 0.4 means the top result is wrong more than half the time (a reciprocal of 0.4 corresponds to a first hit around rank 2.5 on average). That is a different failure from a low Recall@5, where the relevant chunks may simply be absent from the top 5 entirely. Diagnose them separately.
How to gate on it
For release gates, pair MRR with Recall@K and set explicit thresholds. MRR above 0.6 is a reasonable start for a single-answer retriever; a lookup-style FAQ system should target higher. A drop in MRR with flat Recall@K points at the reranker or the scoring function, not at recall. When the first hit lands too deep, inspect the ranking with retriever debugging and validate the end-to-end pattern against the RAG eval with Ragas recipe. A metric without a threshold is decoration; the threshold is what turns MRR into a gate.