NDCG (Normalized Discounted Cumulative Gain) is a ranking metric for graded relevance that rewards placing the most relevant chunks highest, normalized to [0,1] so scores are comparable across queries. It is the metric to reach for when relevance is a scale, not a yes/no, and when the order chunks arrive in changes what the generator produces.
What NDCG measures
NDCG answers a question the binary metrics cannot: given that some chunks are more relevant than others, did the retriever put the strongest evidence at the top? Recall@K treats every relevant chunk as equal and ignores order. MRR looks only at the first hit. NDCG credits every relevant chunk, weights each by a graded relevance label, and discounts the credit the further down the ranking the chunk sits [1]. The discount is the whole point. A relevant chunk at rank 1 helps the generator more than the same chunk at rank 8, and the metric encodes that.
Two pieces. Discounted Cumulative Gain accumulates graded relevance with a positional discount:
DCG@K=∑i=1Klog2(i+1)reli
where reli is the graded relevance of the chunk at rank i. The log2(i+1) term is the discount: it grows slowly, so the penalty for a low rank is smooth rather than a cliff [1]. Normalize by the ideal DCG, the DCG of the same labels sorted into their best possible order:
NDCG@K=IDCG@KDCG@K
IDCG@K is the DCG of the perfect ranking [2]. Dividing by it maps every query onto [0,1]: 1.0 means the retriever produced the ideal order, and the normalization is what makes a hard query and an easy query averageable [1].
The discount factors
The discount depends only on rank, so it is worth internalizing. Note that rank 1 and rank 2 are not discounted the same despite log2(2)=1: rank 1 gets full credit, rank 2 keeps 63 percent of its gain.
| Rank i | log2(i+1) | Weight 1/log2(i+1) |
|---|
| 1 | 1.000 | 1.000 |
| 2 | 1.585 | 0.631 |
| 3 | 2.000 | 0.500 |
| 4 | 2.322 | 0.431 |
| 5 | 2.585 | 0.387 |
| 6 | 2.807 | 0.356 |
| 7 | 3.000 | 0.333 |
| 8 | 3.170 | 0.315 |
A worked example
Take a retrieval of 5 chunks with graded relevance labels [3, 2, 3, 0, 1] at ranks 1 through 5. Compute DCG@5 term by term:
DCG@5=log223+log232+log243+log250+log261
Substituting the discounts:
- 3/1.000=3.000
- 2/1.585=1.262
- 3/2.000=1.500
- 0/2.322=0.000
- 1/2.585=0.387
Sum: DCG@5=6.149.
Now the ideal ranking. Sort the same labels descending: [3, 3, 2, 1, 0]. Compute IDCG@5:
- 3/1.000=3.000
- 3/1.585=1.893
- 2/2.000=1.000
- 1/2.322=0.431
- 0/2.585=0.000
Sum: IDCG@5=6.324.
Divide:
NDCG@5=6.3246.149=0.972
The retriever scores 0.972. The gap from 1.0 comes entirely from the rank-2 chunk (grade 2) sitting where a grade-3 chunk belongs, and the grade-3 chunk at rank 3 sitting where the grade-2 chunk belongs. Two grade-3 chunks split across ranks 1 and 3 instead of 1 and 2 costs the metric a small, principled amount.
The code
Reuse ndcg_at_k() from the retrieval metrics hub. It takes graded labels as a dict from chunk id to grade and computes DCG and IDCG in one pass. Note the implementation indexes from 0, so the discount is log2(i + 2), which is log2(rank+1) once you account for the offset [2].
# retrieval_metrics.py
import math
def ndcg_at_k(retrieved: list[str], grades: dict[str, float], k: int) -> float:
dcg = sum(grades.get(d, 0.0) / math.log2(i + 2)
for i, d in enumerate(retrieved[:k]))
ideal = sum(g / math.log2(i + 2)
for i, g in enumerate(sorted(grades.values(), reverse=True)[:k]))
return dcg / ideal if ideal else 0.0
Feeding the worked example through it returns 0.972, matching the arithmetic above.
When is NDCG worth the labeling cost?
Graded relevance labels cost three to five times as much to produce as binary labels: a rater deciding "relevant or not" is faster and more consistent than one deciding "grade 0, 1, 2, or 3" [3]. Spend that budget in two situations. First, reranker evaluation. A reranker's entire job is ordering, so a metric blind to order (Recall@K) cannot see its effect, and MRR only sees the top slot. NDCG is the standard reranker metric in the IR literature for exactly this reason [1]. Second, when chunk order strongly affects the generator: long-context models weight earlier chunks more, and a generator that anchors on the first chunk needs the strongest evidence there. If order does not change the output, the labeling premium buys nothing over Recall@K.
TIP
Start with Recall@K and MRR on binary labels. They cover most retrieval regressions and cost nothing extra to produce. Graduate to NDCG when you can afford graded labels and chunk order measurably moves the generation. BEIR ships graded qrels for its harder datasets precisely so NDCG@10 can be the leaderboard metric [3].
The misstep: NDCG on binary labels
With only binary labels and a small result set, NDCG loses most of its value and adds complexity for no signal. Two failure modes. With binary grades every relevant chunk has the same gain, so DCG reduces to a discounted count of hits and the graded-relevance advantage disappears. With a tiny K, say K=3, there are few positions to reorder, so IDCG normalization has little room to distinguish rankings and NDCG tracks close to MRR-like behavior. In both cases you pay for a more complex metric and a harder-to-read number without learning anything Recall@K or MRR would not have told you [2]. The Ragas retrieval metrics make the same point from the other direction: context precision and recall use LLM-judged relevance rather than hand-graded labels, trading label cost for judge cost [4] [5].
NDCG against the other retrieval metrics
| Metric | Labels | Sees order | Sees multiple hits | Best for |
|---|
| Recall@K | binary | no | yes | did the relevant chunk make the top K |
| Precision@K | binary | no | yes | how noisy is the top K |
| MRR | binary | first hit only | no | is the top chunk right |
| NDCG@K | graded | yes | yes | reranker quality, order-sensitive generation |
Run the binary metrics on every retrieval run. Add NDCG when you have graded qrels and a reranker or an order-sensitive generator to evaluate. For the end-to-end setup that pairs these with Ragas judges, see the cookbook recipe RAG eval with Ragas.