What "good" looks like
Vanilla needle-in-a-haystack is saturated. Every frontier model passes it; the score tells you almost nothing about whether the model can actually use long context for real work[1]. The primary metric is RULER scores at your actual production context length, paired with answer faithfulness over the retrieved passages[2]. RULER extends NIAH with multi-needle retrieval, multi-hop tracing, value aggregation, and frequent-word extraction; the result is a credible probe of long-context behavior that NIAH does not catch. The Hsieh et al. paper reports that only half of 17 models maintained satisfactory performance at 32K despite all of them passing NIAH; the gap is the eval. Pair RULER with RAGAS faithfulness, context precision, and context recall on a domain set[3].
The "good" target is RULER scores that do not degrade past your production context length and a faithfulness score above 0.9 on a domain-representative set. Numbers below those are real product risk, not benchmark noise.
Secondary metrics
Position bias. RULER reports per-depth scores; the "lost in the middle" curve is the diagnostic[4]. Faithfulness, defined as the absence of fabricated quotes or claims not in the context. Latency and cost per query; long context is expensive and slow. Position-stratified recall: split your eval by where the answer appears in the context (first 10%, middle 80%, last 10%) and report each separately.
Offline eval design
Run RULER at your production context length, not at the model's max. If you ship 32K, evaluate at 32K. Build a 50-query domain test set with span-level gold answers (start and end character offsets in the source) so substring presence is verifiable. For the generation step, use RAGAS faithfulness, context precision, context recall, and response relevancy[5]. The faithfulness metric specifically scores whether claims in the answer are entailed by the retrieved context; it is the single most useful number in this category.
Stratify the test set by question type: extractive (answer is a direct span), abstractive (answer requires synthesis), multi-hop (answer requires two or more spans), and aggregation (answer requires counting or summing). Different failure modes show up in different cells.
Online eval design
Log retrieved chunks and cited spans for every query. Require span citation in the output and verify substring presence; a citation that does not appear in the retrieved context is a fabrication. Alert on citation-miss rate above 2% and on faithfulness drops on a continuously sampled subset. Stratify online metrics by query length and corpus size; both shift the cost-quality frontier.
Top benchmarks
- RULER, https://arxiv.org/abs/2404.06654. 13 tasks beyond NIAH; the current standard for long-context probing.
- Lost in the Middle, https://arxiv.org/abs/2307.03172. The position-bias canon; not a benchmark you run so much as a diagnostic frame.
- LongBench v2, used selectively when you need multi-document reasoning beyond what RULER probes; 503 multi-choice questions, 8K to 2M words.
The NVIDIA RULER repo for the canonical implementation. Ragas (over 14k stars) for the generation-side metrics; the faithfulness metric is the single most-used scorer in this category[5]. Langfuse or LangSmith for production trace logging with chunk-level capture. Inspect AI for sandboxed, reproducible long-context evals across releases.
Common pitfalls
Reporting only vanilla NIAH. Saturated, tells you almost nothing about real long-context behavior[1]. The RULER paper exists to fix exactly this gap[2]. Conflating retrieval recall with answer faithfulness. A query can retrieve the right chunks and still produce a fabricated answer; both must be measured. Skipping position-stratified analysis. The lost-in-the-middle curve is real and product-meaningful; aggregate scores hide it[4]. Evaluating at the model's max context instead of your production context. The relevant number is the one at the length your users actually hit.
Canonical reading
RULER is the canonical paper because it operationalizes the gap between NIAH and useful long-context behavior with 13 task categories[2]. The blunt line from the abstract is the one to remember: "almost all models exhibit large performance drops as context length increases" despite near-perfect NIAH scores. Read it with Lost in the Middle[4] for the position-bias framing, and the RAGAS paper for the faithfulness scoring methodology[3].
Day-1 starter
- Build a 50-query test set over your real corpus with span-level gold answers (start and end offsets). This is the artifact that makes everything else credible.
- Run NIAH-multi (3 needles), multi-hop, and aggregation per RULER at your production context length, not the model's max. Report per-depth scores so the position bias is visible.
- Wire RAGAS faithfulness and context-precision to a CI job. Fail the build on a faithfulness drop greater than 5 points relative to the prior release.