RULER is the right benchmark for long-context once vanilla Needle-in-a-Haystack saturates. It packages thirteen tasks beyond single-needle retrieval (multi-needle, multi-hop reasoning, multi-key value, variable tracking, aggregation) at context lengths up to 1M tokens 1. The point of running it on your model is that headline NIAH scores are misleading: almost all models post strong NIAH but degrade sharply on RULER's harder tasks as context length grows 1.
Planned outline
- Pick context lengths. Start at the length you actually ship at (16k, 32k, 128k). Add the next jump up to see the cliff.
- Subset the tasks. The full RULER suite is expensive. Pick: NIAH-multi-3, multi-hop, and aggregation as the smoke set.
- Run via the official repo. NVIDIA's RULER ships configs for the open models and a thin OpenAI/Anthropic adapter 2.
- Plot the position-stratified curve. RULER buckets results by depth-percentage of the needle. Lost-in-the-middle shows up as a U-shape on this curve 3.
- Compare against your domain corpus. Replace the synthetic noise with real chunks from your corpus to see if domain-specific degradation matches synthetic.
What this will cover
A 5-task by 4-length-bucket smoke run takes a few hours on API models. The walkthrough will provide the YAML, the run script, and the matplotlib code for the depth-stratified plot. It will also document where the official harness deviates from the paper (it does in a couple of spots).
What this will skip in v1
BABILong (separate benchmark, multi-hop reasoning under noise). NoLiMa (2025 contamination-resistant variant). Multilingual long-context. Production telemetry for context-length drift.
For background, the long-context RAG task type covers the conceptual framing. The original NIAH harness is still useful for sanity checks 4.