AI Evals › Cookbook

Long-context eval with RULER

Trim RULER's 13 synthetic tasks to a six-task smoke suite, run it at 4k to 128k against your endpoint, and compute effective context length the way the paper does.

Vanilla needle-in-a-haystack is a solved test: current models retrieve one planted sentence from filler at almost any length their API accepts ¹. RULER is what you run when NIAH stops discriminating. It generates thirteen synthetic tasks in four categories (needle variants, multi-hop variable tracing, aggregation, distractor QA) at whatever sequence lengths you configure, and its core finding has aged well: of the 17 models the paper evaluated, all claiming 32K context or more, only half maintained satisfactory performance at 32K ². The number this recipe produces is the effective context length: the longest length at which your model's average stays above 85.6, the bar the paper anchors to Llama2-7B's score at 4K. Read that number against the one on the model card, then size your product's context budget to the measured one.

The 90 minutes are hands-on time: about thirty to install and configure, ten to validate the 4K anchor, twenty to read the report. The long-length inference in between is wall-clock hours on an API model; launch it and walk away.

What you build

Three artifacts, two of them edits inside the official harness:

A six-task smoke config (one task per category, plus the harder needle variants) replacing the 13-task default.
A model stanza for your endpoint, API or local.
ruler_report.py, a standalone script that collapses the harness's per-length summary.csv files into one claimed-vs-effective table.

Step 1: pick lengths and tasks

Lengths first. Use three: a 4K anchor (every healthy config is near-perfect here, so it catches harness bugs), the length you actually ship at, and the next jump up to find the cliff. The harness defaults to six lengths from 4K to 128K; the full matrix is 13 tasks x 6 lengths x 500 samples, and the 128K column alone is roughly 13 x 500 x 128k, or 832M input tokens. The smoke matrix below (6 tasks, 100 samples, three lengths) lands near 101M input tokens. That arithmetic is the whole reason to trim.

The six tasks, with the reason each earns a slot:

niah_single_2: the vanilla control. If this is not near-perfect at 4K, fix your config, not your model ².
niah_multikey_2, niah_multiquery: retrieval with distractor keys and with several simultaneous queries; the hard siblings of the task everyone passes.
vt (variable tracking): multi-hop tracing through chained variable assignments; the category NIAH never touches.
fwe (frequent words extraction): aggregation; the model has to read the whole context, not search it.
qa_2: HotpotQA questions buried in distractor paragraphs; the closest synthetic stand-in for long-context RAG.

Step 2: install the harness

git clone https://github.com/NVIDIA/RULER && cd RULER
docker pull cphsieh/ruler:0.2.0        # the pinned environment the authors test
cd scripts/data/synthetic/json
python download_paulgraham_essay.py    # the essay haystack
bash download_qa_dataset.sh            # SQuAD + HotpotQA for qa_1 / qa_2

No Docker is workable (install docker/requirements.txt into a venv), but the image saves you a fight with flash-attn and TensorRT pins you do not need to have ³.

Step 3: register your model

Model configs live in scripts/config_models.sh as a case block; sequence lengths sit at the top of the same file. For an API model, copy the shipped GPT-4 stanza:

SEQ_LENGTHS=(
    131072
    32768
    4096
)

# inside the MODEL_SELECT case block:
    my-model)
        MODEL_PATH="gpt-4o-2024-08-06"   # the name your endpoint expects
        MODEL_TEMPLATE_TYPE="base"
        MODEL_FRAMEWORK="openai"
        TOKENIZER_PATH="cl100k_base"
        TOKENIZER_TYPE="openai"
        OPENAI_API_KEY="${OPENAI_API_KEY}"
        ;;

For a local checkpoint, the framework field switches to a serving backend and the template type must match the model's chat format:

    my-local-model)
        MODEL_PATH="${MODEL_DIR}/llama3.1-8b-Instruct"
        MODEL_TEMPLATE_TYPE="meta-llama3"
        MODEL_FRAMEWORK="vllm"
        ;;

Two details that burn people. First, MODEL_TEMPLATE_TYPE comes from scripts/data/template.py; a wrong template does not error, it just scores near zero, and you will blame the model. Second, generation is greedy by default (TEMPERATURE="0.0"), which is what you want: this is a capability measurement, not a sampling study.

Step 4: trim the task matrix

NUM_SAMPLES=100   # default 500; restore for the full run

synthetic=(
    "niah_single_2"
    "niah_multikey_2"
    "niah_multiquery"
    "vt"
    "fwe"
    "qa_2"
)

One honesty note before you quote results: the paper's 85.6 threshold is defined on the average over all 13 tasks. A six-task average is a screening read. When the smoke run puts a length near the bar, rerun the full 13 tasks at that length before you publish an effective-length claim anywhere that matters.

Step 5: run, then sanity-check the anchor

cd scripts && bash run.sh my-model synthetic

The harness writes benchmark_root/my-model/synthetic/<length>/ per length (set ROOT_DIR in run.sh to move it), each with data/ and pred/, and drops a summary.csv with three rows (Tasks, Score, Nulls) into pred/ as each length finishes ³.

Read the 4K directory before letting the 128K run spend your budget. Two gates: niah_single_2 should be near-perfect, since nearly every model aces vanilla single-needle retrieval at short lengths ²; and the Nulls row should be zeros. Nulls count empty predictions, and a nonzero value at 4K means a template mismatch, a truncation bug, or refusals; that is a harness problem to fix, not a model result to report.

Step 6: the claimed-vs-effective report

"""Aggregate RULER summary.csv files into a claimed-vs-effective report.

Run after the harness:
    python ruler_report.py benchmark_root/my-model 131072
where the second argument is the context length the model card claims.
"""
from __future__ import annotations

import csv
import sys
from pathlib import Path

THRESHOLD = 85.6  # Llama2-7B at 4K, the paper's effective-length bar


def read_summary(path: Path) -> dict[str, float]:
    rows = {r[0]: r[1:] for r in csv.reader(path.open())}
    return {t: float(s) for t, s in zip(rows["Tasks"], rows["Score"]) if s}


def main() -> None:
    root = Path(sys.argv[1])
    claimed = int(sys.argv[2])
    by_length: dict[int, dict[str, float]] = {}
    for summary in sorted(root.glob("synthetic/*/pred/summary.csv")):
        length = int(summary.parts[-3])
        by_length[length] = read_summary(summary)

    if not by_length:
        sys.exit(f"no summary.csv under {root}/synthetic/*/pred/")

    tasks = sorted({t for scores in by_length.values() for t in scores})
    lengths = sorted(by_length)

    print(f"{'task':<18}" + "".join(f"{length:>9}" for length in lengths))
    for t in tasks:
        cells = "".join(f"{by_length[length].get(t, float('nan')):>9.1f}"
                        for length in lengths)
        print(f"{t:<18}{cells}")

    avgs = {length: sum(s.values()) / len(s) for length, s in by_length.items()}
    print(f"{'avg':<18}" + "".join(f"{avgs[length]:>9.1f}" for length in lengths))

    effective = max(
        (length for length in lengths if avgs[length] >= THRESHOLD), default=0
    )
    print(f"\nclaimed length  : {claimed}")
    print(f"effective length: {effective}  (largest length with avg >= {THRESHOLD})")
    for length in lengths:
        if avgs[length] < THRESHOLD:
            print(f"cliff at {length}: avg {avgs[length]:.1f} drops below the bar")
            break


if __name__ == "__main__":
    main()

Output shape from one smoke run (an open 8B chat model served through vLLM; your numbers will differ):

task                   4096    32768   131072
fwe                    91.2     82.4     29.6
niah_multikey_2        98.0     92.0     61.0
niah_multiquery        97.5     93.5     67.5
niah_single_2          99.6     98.8     94.2
qa_2                   75.0     64.0     41.0
vt                     96.0     88.0     38.0
avg                    92.9     86.5     55.2

claimed length  : 131072
effective length: 32768  (largest length with avg >= 85.6)
cliff at 131072: avg 55.2 drops below the bar

Step 7: read the table, then act on the cliff

The single-needle row is the trap. At 128K this model still posts 94.2 on niah_single_2 while vt has collapsed to 38.0 and fwe to 29.6. A team that only ran NIAH would call this a 128K model. This per-category spread is the paper's whole argument ².

Effective below claimed is the norm, not a scandal. On the maintained leaderboard, GPT-4-1106-preview measures effective at 64K against a 128K claim, Yi-34B at 32K against 200K, and GLM4-9B at 64K against a 1M claim (fetched as of 2026-06-10) ³. Treat the model card's context window as the input limit, and RULER's number as the budget you may actually spend.

When the cliff sits below the length your product needs, you have four moves, in the order to try them:

Cap the context budget at the effective length: lower retrieval k, truncate documents, summarize history. This is free and usually recovers more quality than it costs.
Pack by position. Models retrieve best from the edges of the window and worst from the middle, so put load-bearing content first or last ⁴.
Split and reduce: chunked map-reduce over the corpus instead of one giant stuffing call.
Change the checkpoint or the serving config. RoPE-scaling settings, quantization, and cache tricks all move this number, which is why you re-run this suite, as the regression test, after any of these changes.

One more lever: the haystack itself. The smoke run uses Paul Graham essays as filler; scripts/synthetic.yaml exposes type_haystack (repeat, essay, needle) and the needle key/value/quantity knobs per task ³. Swapping in chunks of your own corpus as haystack text tells you whether semantically related distractors degrade retrieval faster than neutral essays do; for RAG products they usually do, and that gap does not show up in anyone's leaderboard.

What this does not do

It does not test reasoning over facts dispersed in genuinely unbounded noise; BABILong is the separate harness for that ⁵. It does not close the surface-match loophole: RULER needles still share wording with their queries, and NoLiMa exists precisely to rebuild the test with no lexical overlap between question and needle ⁶. And it says nothing about multilingual contexts or production drift at long lengths. For the conceptual framing around when long context beats retrieval, see the long-context RAG task type.

TIP

Record the effective length next to the model id in your eval config and re-run this suite on every model or serving-stack bump. It is a property of the checkpoint plus the serving configuration, not of the model family, and it moves between point releases.

What to do next

If your long-context use is conversational rather than document-shaped, the session-state checks in multi-turn conversation eval are the production cousin of vt. To put error bars on the per-length averages before you gate a launch on them, take the per-sample outputs into adding error bars.

Greg Kamradt, Needle In A Haystack repository. The original single-needle test this suite supersedes. ↩
Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" arXiv 2404.06654. 13 tasks, 17 models, the half-fail-at-32K finding, and the 85.6 effective-length bar. ↩ ↩² ↩³ ↩⁴
NVIDIA, RULER GitHub repository. The harness, the task configs, and the maintained claimed-vs-effective leaderboard. ↩ ↩² ↩³ ↩⁴
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts." arXiv 2307.03172. The U-shaped position curve behind the pack-by-position advice. ↩
Kuratov et al., "BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack." arXiv 2406.10149. ↩
Modarressi et al., "NoLiMa: Long-Context Evaluation Beyond Literal Matching." arXiv 2502.05167. ↩