Datasets and Golden Sets
The eval dataset is the part of the system that decides what ships. Models change, prompts change, retrievers change; the set decides whether the change was an improvement or a regression. Treat the dataset as a production artifact, not as a notebook scratchpad. Version it, hash it, pin every result back to a specific revision, and read every example yourself at least once.
Most teams underinvest here and then wonder why their eval scores are noisy and uncorrelated with user complaints. The pages in this section walk through construction (how to bootstrap from real traces and synthetic prompts), versioning and lineage (how to make a dataset reproducible), and the supporting topics around regression suites, dimensional coverage, and storage platforms.
Chapters: