AI Evals › RAG Evals

Mean Reciprocal Rank (MRR): time-to-first-relevant

MRR is the average of 1 over the rank of the first relevant chunk. It rewards a hit near the top and ignores everything after it.

Mean Reciprocal Rank is the average, across queries, of 1 divided by the rank of the first relevant chunk. It rewards getting a relevant result near the top of the ranking and ignores everything after the first hit.

What MRR measures

MRR scores one thing per query: how deep you had to look before the first relevant chunk appeared. The first position scores 1.0, the second 0.5, the third 0.33, the tenth 0.1; a query with no relevant chunk in the list scores 0. Averaging those reciprocals over the query set gives a single number in [0, 1] where higher is better. It is a rank-position metric, not a set-coverage metric: a second relevant chunk at rank 2 changes nothing once rank 1 is already a hit ^[1].

The metric comes from the TREC-8 Question Answering track, where systems returned a ranked list of candidate answers and were scored on the reciprocal rank of the first correct one ^[2]. The RAG usage is the same idea with chunks instead of answer strings, and BEIR reports MRR alongside NDCG and Recall as a standard retrieval metric ^[3].

The formula

\text{MRR} = \frac{1}{|Q|}\sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}

$Q$ is the set of queries. $\text{rank}_i$ is the 1-indexed position of the first relevant chunk for query $i$ . When no relevant chunk appears in the returned list, the reciprocal $1/\text{rank}_i$ is defined as 0 for that query. Only the first hit enters the sum; the rest of the ranking is invisible to MRR.

Worked example

Three queries. For each, the rank of the first relevant chunk in the returned list:

Query	Rank of first relevant chunk	Reciprocal rank
q1	1	$1/1 = 1.0$
q2	4	$1/4 = 0.25$
q3	2	$1/2 = 0.5$

\text{MRR} = \frac{1.0 + 0.25 + 0.5}{3} = \frac{1.75}{3} = 0.583

The single miss at rank 4 (q2) drags the mean down hard: moving that hit from rank 4 to rank 1 would lift MRR from 0.583 to 0.833. That sensitivity to the top few positions is the whole point of the metric.

The Python

Reuse mrr from the retrieval metrics hub without modification. It walks the ranked list once and returns the reciprocal of the first hit, or 0 if there is none.

# retrieval_metrics.py
def mrr(retrieved: list[str], relevant: set[str]) -> float:
    for i, doc_id in enumerate(retrieved, start=1):
        if doc_id in relevant:
            return 1.0 / i
    return 0.0

The corpus MRR is the mean of mrr(...) over the query set. Reproducing the worked example:

runs = [
    (["c1", "c9", "c3"],       {"c1"}),  # first relevant at rank 1
    (["c2", "c8", "c7", "c4"], {"c4"}),  # first relevant at rank 4
    (["c5", "c6", "c0"],       {"c6"}),  # first relevant at rank 2
]
scores = [mrr(r, rel) for r, rel in runs]   # [1.0, 0.25, 0.5]
corpus_mrr = sum(scores) / len(scores)      # 0.583...

When is MRR the right metric?

MRR fits when only the first hit matters to downstream generation. Three cases.

Single-answer lookup and FAQ bots: the query has one correct chunk, and finding it near the top is the entire job. Generators that reliably read only the top one or two chunks: if the prompt template pastes retrieved[:2], a relevant chunk at rank 5 is functionally a miss, and MRR captures that where a generous Recall@10 would not. Comparing retrievers on time-to-first-relevant: two retrievers with identical Recall@10 can differ sharply on where the first hit lands, and MRR is the one number that separates them ^[4].

Skip MRR when the generator synthesizes across several chunks or the query is multi-hop. There, coverage and ranking of all relevant chunks matter, so reach for Recall@K and NDCG instead.

MRR versus Recall@K versus NDCG

Each metric answers a different question about the same ranked list. Log all three; the divergence between them is the diagnostic.

Metric	Question it answers	Uses labels	Credits chunks after the first hit	Uses graded relevance
Recall@K	Of all relevant chunks, what fraction is in the top K?	one-hot	yes	no
MRR	How high is the first relevant chunk?	one-hot	no	no
NDCG	Are the strongest chunks ranked near the top?	graded	yes	yes

A run where Recall@5 rises but MRR falls means the relevant chunks are still in the top 5 but sitting lower in the order, which usually degrades a generator that leans on the first chunk. That contradiction is only visible if you track both ^[5].

CAUTION

MRR ignores every relevant chunk after the first, so it misleads on multi-hop questions that require evidence from several chunks. Read the absolute value carefully too: an MRR of 0.4 means the top result is wrong more than half the time (a reciprocal of 0.4 corresponds to a first hit around rank 2.5 on average). That is a different failure from a low Recall@5, where the relevant chunks may simply be absent from the top 5 entirely. Diagnose them separately.

How to gate on it

For release gates, pair MRR with Recall@K and set explicit thresholds. MRR above 0.6 is a reasonable start for a single-answer retriever; a lookup-style FAQ system should target higher. A drop in MRR with flat Recall@K points at the reranker or the scoring function, not at recall. When the first hit lands too deep, inspect the ranking with retriever debugging and validate the end-to-end pattern against the RAG eval with Ragas recipe. A metric without a threshold is decoration; the threshold is what turns MRR into a gate.

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.