NDCG: graded-relevance ranking for RAG retrieval

NDCG rewards putting the most relevant chunks highest, normalized to [0,1] so scores compare across queries.

NDCG (Normalized Discounted Cumulative Gain) is a ranking metric for graded relevance that rewards placing the most relevant chunks highest, normalized to [0,1] so scores are comparable across queries. It is the metric to reach for when relevance is a scale, not a yes/no, and when the order chunks arrive in changes what the generator produces.

What NDCG measures

NDCG answers a question the binary metrics cannot: given that some chunks are more relevant than others, did the retriever put the strongest evidence at the top? Recall@K treats every relevant chunk as equal and ignores order. MRR looks only at the first hit. NDCG credits every relevant chunk, weights each by a graded relevance label, and discounts the credit the further down the ranking the chunk sits ^[1]. The discount is the whole point. A relevant chunk at rank 1 helps the generator more than the same chunk at rank 8, and the metric encodes that.

The formula

Two pieces. Discounted Cumulative Gain accumulates graded relevance with a positional discount:

$\text{DCG@K} = \sum_{i=1}^{K} \frac{\text{rel}_i}{\log_2(i+1)}$

where $\text{rel}_i$ is the graded relevance of the chunk at rank $i$ . The $\log_2(i+1)$ term is the discount: it grows slowly, so the penalty for a low rank is smooth rather than a cliff ^[1]. Normalize by the ideal DCG, the DCG of the same labels sorted into their best possible order:

$\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$

IDCG@K is the DCG of the perfect ranking ^[2]. Dividing by it maps every query onto [0,1]: 1.0 means the retriever produced the ideal order, and the normalization is what makes a hard query and an easy query averageable ^[1].

The discount factors

The discount depends only on rank, so it is worth internalizing. Note that rank 1 and rank 2 are not discounted the same despite $\log_2(2) = 1$ : rank 1 gets full credit, rank 2 keeps 63 percent of its gain.

Rank $i$	$\log_2(i+1)$	Weight $1/\log_2(i+1)$
1	1.000	1.000
2	1.585	0.631
3	2.000	0.500
4	2.322	0.431
5	2.585	0.387
6	2.807	0.356
7	3.000	0.333
8	3.170	0.315

A worked example

Take a retrieval of 5 chunks with graded relevance labels [3, 2, 3, 0, 1] at ranks 1 through 5. Compute DCG@5 term by term:

$\text{DCG@5} = \frac{3}{\log_2 2} + \frac{2}{\log_2 3} + \frac{3}{\log_2 4} + \frac{0}{\log_2 5} + \frac{1}{\log_2 6}$

Substituting the discounts:

$3 / 1.000 = 3.000$
$2 / 1.585 = 1.262$
$3 / 2.000 = 1.500$
$0 / 2.322 = 0.000$
$1 / 2.585 = 0.387$

Sum: $\text{DCG@5} = 6.149$ .

Now the ideal ranking. Sort the same labels descending: [3, 3, 2, 1, 0]. Compute IDCG@5:

$3 / 1.000 = 3.000$
$3 / 1.585 = 1.893$
$2 / 2.000 = 1.000$
$1 / 2.322 = 0.431$
$0 / 2.585 = 0.000$

Sum: $\text{IDCG@5} = 6.324$ .

Divide:

$\text{NDCG@5} = \frac{6.149}{6.324} = 0.972$

The retriever scores 0.972. The gap from 1.0 comes entirely from the rank-2 chunk (grade 2) sitting where a grade-3 chunk belongs, and the grade-3 chunk at rank 3 sitting where the grade-2 chunk belongs. Two grade-3 chunks split across ranks 1 and 3 instead of 1 and 2 costs the metric a small, principled amount.

The code

Reuse ndcg_at_k() from the retrieval metrics hub. It takes graded labels as a dict from chunk id to grade and computes DCG and IDCG in one pass. Note the implementation indexes from 0, so the discount is log2(i + 2), which is $\log_2(\text{rank}+1)$ once you account for the offset ^[2].

# retrieval_metrics.py
import math

def ndcg_at_k(retrieved: list[str], grades: dict[str, float], k: int) -> float:
    dcg = sum(grades.get(d, 0.0) / math.log2(i + 2)
              for i, d in enumerate(retrieved[:k]))
    ideal = sum(g / math.log2(i + 2)
                for i, g in enumerate(sorted(grades.values(), reverse=True)[:k]))
    return dcg / ideal if ideal else 0.0

Feeding the worked example through it returns 0.972, matching the arithmetic above.

When is NDCG worth the labeling cost?

Graded relevance labels cost three to five times as much to produce as binary labels: a rater deciding "relevant or not" is faster and more consistent than one deciding "grade 0, 1, 2, or 3" ^[3]. Spend that budget in two situations. First, reranker evaluation. A reranker's entire job is ordering, so a metric blind to order (Recall@K) cannot see its effect, and MRR only sees the top slot. NDCG is the standard reranker metric in the IR literature for exactly this reason ^[1]. Second, when chunk order strongly affects the generator: long-context models weight earlier chunks more, and a generator that anchors on the first chunk needs the strongest evidence there. If order does not change the output, the labeling premium buys nothing over Recall@K.

TIP

Start with Recall@K and MRR on binary labels. They cover most retrieval regressions and cost nothing extra to produce. Graduate to NDCG when you can afford graded labels and chunk order measurably moves the generation. BEIR ships graded qrels for its harder datasets precisely so NDCG@10 can be the leaderboard metric ^[3].

The misstep: NDCG on binary labels

With only binary labels and a small result set, NDCG loses most of its value and adds complexity for no signal. Two failure modes. With binary grades every relevant chunk has the same gain, so DCG reduces to a discounted count of hits and the graded-relevance advantage disappears. With a tiny K, say K=3, there are few positions to reorder, so IDCG normalization has little room to distinguish rankings and NDCG tracks close to MRR-like behavior. In both cases you pay for a more complex metric and a harder-to-read number without learning anything Recall@K or MRR would not have told you ^[2]. The Ragas retrieval metrics make the same point from the other direction: context precision and recall use LLM-judged relevance rather than hand-graded labels, trading label cost for judge cost ^[4] ^[5].

NDCG against the other retrieval metrics

Metric	Labels	Sees order	Sees multiple hits	Best for
Recall@K	binary	no	yes	did the relevant chunk make the top K
Precision@K	binary	no	yes	how noisy is the top K
MRR	binary	first hit only	no	is the top chunk right
NDCG@K	graded	yes	yes	reranker quality, order-sensitive generation

Run the binary metrics on every retrieval run. Add NDCG when you have graded qrels and a reranker or an order-sensitive generator to evaluate. For the end-to-end setup that pairs these with Ragas judges, see the cookbook recipe RAG eval with Ragas.

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.