Long-context RAG

Answer over very long contexts: codebases, contract bundles, transcripts. RULER plus answer faithfulness, not vanilla needle-in-a-haystack.

What "good" looks like

A long-context RAG eval measures two things: whether the model actually uses the long context (RULER at your production length), and whether the answer is entailed by the retrieved passages (faithfulness). Vanilla needle-in-a-haystack no longer measures either.

The ship targets on a domain set:

RULER scores that do not degrade past your production context length^[1].
Faithfulness above 0.9 on a domain-representative set^[2].

Numbers below those are real product risk, not benchmark noise.

Why not vanilla NIAH. It is saturated. Every frontier model passes it, and the score tells you almost nothing about whether the model can actually use long context for real work^[3].

What RULER adds. It extends NIAH with multi-needle retrieval, multi-hop tracing, value aggregation, and frequent-word extraction. The result is a credible probe of long-context behavior that NIAH does not catch^[1].

The gap is the eval. The Hsieh et al. paper reports that only half of 17 models maintained satisfactory performance at 32K despite all of them passing NIAH. Pair RULER with RAGAS faithfulness, context precision, and context recall on a domain set^[2].

flowchart TD
    Q["Query over long context"] --> RET["Retrieve passages"]
    RET --> RULER["RULER at production context length"]
    RET --> FAITH["RAGAS faithfulness, context precision, context recall"]
    RULER --> DEPTH["Per-depth scores: first 10%, middle 80%, last 10%"]
    DEPTH --> GATE{"RULER stable past your length and faithfulness above 0.9?"}
    FAITH --> GATE
    GATE -->|yes| SHIP["Ship"]
    GATE -->|no| FIX["Fix the weaker axis"]

Secondary metrics

Metric	What it catches	How to score
Position bias	The "lost in the middle" curve^[4]	RULER per-depth scores as the diagnostic
Faithfulness	Fabricated quotes or claims not in the context	Claims in the answer entailed by the retrieved context
Latency and cost per query	Long context is expensive and slow	Measured per query
Position-stratified recall	Where the answer sits changes retrieval quality	Split by first 10%, middle 80%, last 10% and report each separately

Offline eval design

Evaluate at your production length, not the model's max. Run RULER at your production context length. If you ship 32K, evaluate at 32K.

Build a span-verifiable domain set. Build a 50-query domain test set with span-level gold answers (start and end character offsets in the source) so substring presence is verifiable.

Score the generation step with RAGAS. Use RAGAS faithfulness, context precision, context recall, and response relevancy^[5]. The faithfulness metric specifically scores whether claims in the answer are entailed by the retrieved context; it is the single most useful number in this category.

Stratify by question type. Different failure modes show up in different cells:

Extractive: answer is a direct span.
Abstractive: answer requires synthesis.
Multi-hop: answer requires two or more spans.
Aggregation: answer requires counting or summing.

Online eval design

Log the retrieval evidence. Log retrieved chunks and cited spans for every query. Require span citation in the output and verify substring presence; a citation that does not appear in the retrieved context is a fabrication.

Alert on drift. Alert on citation-miss rate above 2% and on faithfulness drops on a continuously sampled subset.

Stratify online too. Stratify online metrics by query length and corpus size; both shift the cost-quality frontier.

Top benchmarks

RULER, https://arxiv.org/abs/2404.06654. 13 tasks beyond NIAH; the current standard for long-context probing.
Lost in the Middle, https://arxiv.org/abs/2307.03172. The position-bias canon; not a benchmark you run so much as a diagnostic frame.
LongBench v2, used selectively when you need multi-document reasoning beyond what RULER probes; 503 multi-choice questions, 8K to 2M words.

Tools especially suited

RULER (NVIDIA repo): the canonical implementation.
Ragas (over 14k stars): the generation-side metrics; faithfulness is the single most-used scorer in this category^[5].
Langfuse or LangSmith: production trace logging with chunk-level capture.
Inspect AI: sandboxed, reproducible long-context evals across releases.

Common pitfalls

Pitfall	Why it bites	Fix
Reporting only vanilla NIAH	Saturated, tells you almost nothing about real long-context behavior^[3]	Run RULER, which exists to fix exactly this gap^[1]
Conflating retrieval recall with answer faithfulness	A query can retrieve the right chunks and still produce a fabricated answer	Measure both
Skipping position-stratified analysis	Aggregate scores hide the real, product-meaningful lost-in-the-middle curve^[4]	Report per-depth scores
Evaluating at the model's max context	The relevant number is the one at the length your users actually hit	Evaluate at your production context length

Canonical reading

RULER is the canonical paper because it operationalizes the gap between NIAH and useful long-context behavior with 13 task categories^[1]. The blunt line from the abstract is the one to remember: "almost all models exhibit large performance drops as context length increases" despite near-perfect NIAH scores.

Read it with Lost in the Middle^[4] for the position-bias framing, and the RAGAS paper for the faithfulness scoring methodology^[2].

Day-1 starter

Build a 50-query test set over your real corpus with span-level gold answers (start and end offsets). This is the artifact that makes everything else credible.
Run NIAH-multi (3 needles), multi-hop, and aggregation per RULER at your production context length, not the model's max. Report per-depth scores so the position bias is visible.
Wire RAGAS faithfulness and context-precision to a CI job. Fail the build on a faithfulness drop greater than 5 points relative to the prior release.

Cross-cutting links

Long-context RAG and RULER: the chapter-length treatment of this playbook's offline design.
Faithfulness and atomic facts: the generation-side metric in depth.
Golden dataset methodology: how to build and maintain the 50-query span-gold set.
Ragas: the faithfulness and context precision/recall scorers.
Long-context eval with RULER: the runnable day-1 starter at your real context lengths.
RAG eval with Ragas: wiring the faithfulness CI job.