Versioning your eval set

A content-hashed eval set pinned to every result is a contract with your past self.

If you cannot tell which version of the dataset produced last quarter's leaderboard, you do not have an eval program. You have a slideshow. The single piece of discipline that separates teams who can debug a regression from teams who cannot is content-hashing the dataset and pinning every eval result back to that hash. Everything else in this page is plumbing around that idea.

Why versioning is the boring backbone

The eval set changes for three legitimate reasons. New failure modes get added when error analysis surfaces them. Old examples get retired when the underlying behavior changes (product moved a button, policy changed, model now refuses what it used to answer). Mistakes in labels get corrected. Each of those reasons is necessary. Each of them also breaks the comparison between a result from yesterday and a result from today, unless you can point at the exact set of examples each result ran on ^[1].

The mistake is to mutate the canonical file and lose the hash. Three weeks later you compare model A at 0.82 against model B at 0.79 and conclude A is better. Both ran against subtly different sets, the difference is dataset drift, and you ship the worse model. Content-hashing makes that bug impossible.

Content hash, not file hash

A content hash is a deterministic function of the dataset's payload, computed in a way that ignores cosmetic differences (whitespace, key order, file path). The same examples in a different order should produce the same hash. The same examples with one label changed should produce a different hash. JSONL plus sorted keys plus sorted lines is enough for most cases.

# hash_dataset.py
import hashlib
import json
from pathlib import Path

def canonical_line(record: dict) -> str:
    # sort_keys ensures field order does not affect the hash.
    return json.dumps(record, sort_keys=True, ensure_ascii=False)

def content_hash(path: Path) -> str:
    records = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            records.append(json.loads(line))
    # sort lines so example reorder does not change the hash.
    canonical = "\n".join(sorted(canonical_line(r) for r in records))
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()[:12]

if __name__ == "__main__":
    h = content_hash(Path("evals/data/v1/golden.jsonl"))
    print(f"golden.jsonl@{h}")

Twelve hex characters of SHA-256 is plenty of collision resistance for a dataset of a few thousand examples. The git commit SHA is not the right hash here: a commit that updates a README also changes the commit but not the dataset, and a dataset that gets pulled from one branch and labeled differently in another would share a commit prefix.

The versioning policy

Numbered versions, immutable. The shape is golden_v1.jsonl, golden_v2.jsonl, with the integer in the path. Once a version ships, it does not change. A typo fix is a new version. A label correction is a new version. The previous file stays in git history; the new file lives next to it. Old eval results keep pointing at the old version's hash. Production stays on the version it was pinned to. Migration to a new version is an explicit, dated decision.

The naive alternative ("we just mutate the file and bump a version field inside") fails the moment two engineers run different forks. The numbered-file policy makes the bug visible: their PRs touch different files. Code review catches it.

Within a version, examples are append-only as long as the version is shipping. Labels can be amended within a draft. Once labeled and run, mutating an example produces a new version. The discipline sounds extreme; it takes thirty seconds and pays back the first time you need to reproduce a six-month-old result.

Pinning eval results to a version

Every eval result emits a record that includes the dataset hash. That record is what your leaderboard, dashboard, and PR comment all read from. You should be able to point at any number on any chart and ask "which version of the dataset?" and get an answer in one click.

# run_eval.py
import json
from datetime import datetime, timezone
from pathlib import Path
from hash_dataset import content_hash

def run(dataset_path: Path, system_id: str, judge_id: str) -> dict:
    hash_ = content_hash(dataset_path)
    records = [json.loads(line) for line in dataset_path.read_text().splitlines() if line.strip()]
    results = [score_one(r, system_id) for r in records]
    pass_rate = sum(r["pass"] for r in results) / len(results)
    return {
        "dataset_path": str(dataset_path),
        "dataset_hash": hash_,
        "dataset_size": len(records),
        "system_id": system_id,        # e.g. "prompt-v7+claude-opus-4-7"
        "judge_id": judge_id,          # e.g. "judge-prompt-v3+claude-opus-4-7"
        "pass_rate": pass_rate,
        "ran_at": datetime.now(timezone.utc).isoformat(),
        "per_example": results,
    }

def score_one(record: dict, system_id: str) -> dict:
    # placeholder: invoke the system, invoke the judge, return {"id": ..., "pass": bool, "rationale": ...}
    raise NotImplementedError

The result row should be small enough to log into a database and large enough to reconstruct what happened. At minimum: the dataset path, the dataset hash, the dataset size, the system identifier, the judge identifier, the headline metric, and the per-example breakdown. The system identifier is itself a content hash over the prompt, model, retriever config, and tool definitions; the judge identifier is the same for the judge stack. With those three hashes you have full provenance.

TIP

A result without a dataset hash is unfalsifiable. A result without a system hash is unattributable. A result without a judge hash is unreproducible. Reject any leaderboard that does not display all three.

Lineage: where each example came from

Lineage is the per-example version of the same story. Each example carries a source field that records where it came from and a provenance history of how it got into the current version.

example = {
    "id": "booking-lookup-001",
    "input": "...",
    "expected_output": {...},
    "labels": {...},
    "source": {
        "kind": "prod-trace",
        "trace_id": "8c7a2e6f-...",
        "captured_at": "2026-04-15T14:22:31Z",
        "labeled_by": "@ombharatiya",
        "labeled_at": "2026-04-16",
    },
    "provenance": [
        {"version": "v1", "action": "added", "by": "@ombharatiya", "at": "2026-04-16"},
        {"version": "v2", "action": "label-corrected", "by": "@om",
         "at": "2026-05-02", "note": "switched expected action from book to clarify"},
    ],
}

The lineage record is what lets you answer "why does this example exist?" three months later. It is also what lets you retire examples cleanly: a retired entry with a reason, instead of a silent deletion.

Migration notes

The first migration is always painful and the rest are routine. When you move from v1 to v2, write a migration note that records the diff: how many examples added, how many retired, how many labels corrected, and a single-paragraph rationale per category. Store the note alongside the dataset. The note is what future-you reads when last-quarter's regression turns into this-quarter's incident.

evals/data/
  v1/
    golden.jsonl
    MIGRATION.md       <- from-scratch construction notes
  v2/
    golden.jsonl
    MIGRATION.md       <- diff vs v1, retirement reasons
  v3/
    golden.jsonl
    MIGRATION.md

Production gets pinned to a specific version in config (DATASET_VERSION=v2). Switching production to v3 is a deliberate cutover, not an implicit consequence of a merge. The flywheel that Shankar describes, where production traces and labeled examples feed each other, only works if the loop is closed by an explicit version bump ^[2].

Separate datasets vs versions of one

Newcomers often ask whether their adversarial set should be a different version of the main set or a separate dataset entirely. The answer is almost always "separate dataset, same versioning discipline." A regression suite for the booking flow is one dataset, your safety red-team probes are another, your long-context stress test is a third. Each has its own JSONL file, its own version sequence, its own hash, its own pinning. Eval results display the per-dataset scores side by side, and the gate for each dataset can be different.

The temptation to merge them into one giant set is real and you should resist it. Merging hides per-slice signal, makes the dataset hash too large to interpret, and forces every change to one slice to invalidate the whole thing. Keep them small, keep them named, keep them hashed.

Tooling support

Vendors will offer to manage versions for you. Langfuse Datasets, Phoenix Datasets, Braintrust Datasets, and LangSmith Datasets each have a notion of dataset versions; some have content hashing built in, some do not ^[3]. Use them as a UI for browsing and annotation. Do not let them become the source of truth: the source of truth is the JSONL in git, with the content-hashing function you wrote in eight lines of Python. The vendor sync is a downstream consumer.

The reason for that asymmetry is portability. Switching judge platforms, observability vendors, or eval frameworks should be a routine operation, not a multi-quarter migration. As long as the dataset and the hash live in your repo, you can rebuild any vendor's view from scratch in a day.

What good looks like

The eval set lives at evals/data/vN/golden.jsonl. A pre-commit hook recomputes the content hash and writes it into vN/HASH. Every CI run that scores the system emits a result row with the dataset hash, the system hash, and the judge hash. The leaderboard displays all three. The PR comment that fails a gate includes the hash and a link to the specific examples that flipped. When the model is upgraded six months later, the upgrade runs against the same hash, and the comparison is honest. That is the contract with your past self.