The number on the spec sheet (200K, 1M, 2M tokens) is the advertised context. The number your application can actually rely on is smaller, often a lot smaller, and depends on the task. The benchmark literature since 2023 has converged on this distinction, with RULER, BABILong, ∞Bench, NoCha, and NoLiMa each probing different failure modes. This page is the map of which benchmark answers which question for a RAG system.
The honest framing
Long-context evaluation is not one task. It is at least four: single-needle retrieval, multi-needle retrieval, multi-hop reasoning across distant chunks, and aggregation or counting across the entire context. Each has a different effective-context profile [1]. NIAH became the default early benchmark and is now saturated for frontier models on single-needle retrieval [2]. A model that passes NIAH at 200K is not a model that can reason over 200K. The literature has moved on, and so should your eval.
RULER
RULER (NVIDIA, Hsieh et al., 2024) is the benchmark to run first. It generalizes NIAH into thirteen tasks across retrieval, multi-hop tracing, aggregation, and QA. The headline finding: effective context length, measured as the longest input where a model retains 85 percent of its short-context performance, is far below the advertised context length for most production models [1]. Run RULER on the chunking strategy and prompt you actually deploy.
BABILong
BABILong (Kuratov et al., 2024) embeds bAbI-style reasoning tasks inside long PG-19 passages, scaling to millions of tokens. The probes require identifying the few sentences that matter from a much longer passage and reasoning over them. The failure mode exposed: a model can find the relevant sentence and still reason incorrectly because surrounding context distracts the attention pattern [3]. BABILong is the right benchmark when your application stuffs context and then reasons.
∞Bench
∞Bench (Zhang et al., 2024) packages twelve tasks at average length above 100K across English and Chinese, mixing retrieval, code understanding, math, and novel QA. The realism is the contribution: real documents instead of synthetic noise [4]. Use ∞Bench when you want one number that approximates "how well does the model handle realistic long inputs."
NoCha
NoCha (Karpinska et al., 2024) tests true-or-false claims about full-length novels (often 100K-plus tokens), requiring whole-book understanding rather than isolated retrieval. GPT-4o landed at 55.8 percent (roughly above chance) on the original release, even on a setup where human readers handle the task reliably [5]. The right benchmark when the application asks holistic questions about long documents (legal contracts end-to-end, books, multi-chapter compliance).
NoLiMa
NoLiMa (Modarressi et al., 2025) strips lexical overlap between query and needle. Where NIAH and earlier benchmarks let the model use direct token matches, NoLiMa forces a one-hop semantic association. 11 of 13 models tested at 128K dropped below half of their short-context baseline [6]. The benchmark that exposes whether your application leans on lexical match or on actual semantic understanding. The fix is usually better embeddings or hybrid lexical-plus-dense retrieval, not a model swap.
A test plan
The practical battery: run RULER at the deployed context length; sample a hundred production queries with labeled gold chunks; score Recall@K and MRR at production K; construct a position-stratified subset (gold at depth 0-10%, 40-60%, 90-100%) to check the lost-in-the-middle dip [7]; build a paraphrased-queries subset that strips lexical overlap with the gold chunk.
# position_strata.py
def stratify_by_depth(eval_set: list[dict], context_tokens: int) -> dict:
bands = {"early": [], "middle": [], "late": []}
for ex in eval_set:
depth = ex["gold_chunk_offset"] / context_tokens
if depth < 0.10:
bands["early"].append(ex)
elif 0.40 <= depth <= 0.60:
bands["middle"].append(ex)
elif depth > 0.90:
bands["late"].append(ex)
return bands
A clean stratification turns lost-in-the-middle into a number. An early-band score of 0.92 and middle-band of 0.74 reveals a position bias no headline metric will surface.
Stuff vs retrieve
The benchmarks settle this for most cases: retrieve. The effective window where a model preserves multi-hop accuracy is much shorter than the advertised window. Stuff context only when the application requires whole-document reasoning retrieval cannot decompose [8]. Cost compounds the answer: stuffed 200K prompts are 50-100x the cost of retrieved 4K prompts at frontier rates.
TIP
Run RULER at the actual context length you deploy, not at the model's headline max. The advertised 1M context is informational; the 32K, 64K, or 128K you push in production is the number that needs the benchmark.