Datasets & Golden Sets

Construction, versioning, lineage, and dimensional coverage. Also available as a free PDF download.

Datasets and Golden Sets

The eval dataset is the part of the system that decides what ships. Models change, prompts change, retrievers change; the set decides whether the change was an improvement or a regression. Treat the dataset as a production artifact, not as a notebook scratchpad. Version it, hash it, pin every result back to a specific revision, and read every example yourself at least once.

Most teams underinvest here and then wonder why their eval scores are noisy and uncorrelated with user complaints. The pages in this section walk through construction (how to bootstrap from real traces and synthetic prompts), versioning and lineage (how to make a dataset reproducible), and the supporting topics around regression suites, dimensional coverage, and storage platforms. For the condensed operational treatment, see golden dataset methodology and contamination detection.

Chapters:

Constructing your eval dataset. The first hundred labeled examples, and how to grow past five hundred without losing signal.
Synthetic vs human-authored examples. Where each shape of data buys signal, where it leaks, and how to mix the two safely.
Versioning your eval set. Content-hash the set and pin every result to a revision; a contract with your past self.
Building regression sets. The set that fails loudly when you reintroduce a bug you already fixed.
Dimensional coverage: features, scenarios, personas. A three-axis grid showing which cells of the failure surface your set actually probes.
Where to keep your eval datasets. Langfuse, Phoenix, Braintrust, LangSmith, or plain JSONL in git, and which choice matters.