A golden dataset is the contract between your past self and your future self. It pins what the system was asked to do, what it answered, what a human said was correct, and which version of the rubric scored it. Every other eval pattern depends on having one.
Features times scenarios times personas
The construction axis is three-dimensional. Features are product capabilities (search, summarize, refund handling). Scenarios are the user's situation (first-time visitor, error recovery, edge-case input). Personas are who the user is (novice, expert, hostile, non-English speaker) 1. The Cartesian product across the three generates the input grid; each cell is one or more eval examples.
For a customer-support assistant: features (refund, status, escalation), scenarios (new customer, returning customer, complaint after refund), personas (calm, frustrated, non-native English, abusive). 3 times 3 times 4 equals 36 cells; one per cell is a minimum starter, three per cell pushes the dataset above 100.
The axis labels are not academic. They are the same labels the team uses when triaging failures ("the system fails on refund-request from frustrated returning customers"), so the fix is targeted.
Default construction pattern: synthesize the inputs against the grid, run them through the real system in the real environment, label the outputs. Synthesizing inputs is fast; labeling outputs is the expensive step.
Why not use production outputs as the gold answer: they are what you are evaluating. A "gold" that is last week's output rewards stasis. The gold comes from a human expert or a verifiable program.
Complementary pattern: seed from production traces. Shankar's flywheel timestamps every example, stores the production input plus the (later, human-corrected) gold answer, and labels the production output 2.
Size and stop rules
| Purpose | Minimum | Practical | Stop rule |
|---|
| Sanity check | 30 | 50 | no obvious failure remains |
| Judge calibration | 100 | 200 | TPR and TNR both above 0.90 on dev |
| Regression set | 50 | 200 | covers every known failure mode |
| Release-gate eval | 200 | 500 | smallest delta of interest detectable |
| Error analysis pool | 100 | 300 | 20 traces with no new failure category |
| Production A/B headline | 500 | 2000+ | smallest delta of interest detectable |
Error-analysis sampling stop rule: theoretical saturation, keep adding until twenty traces in a row produce no new failure category 3. Release-gate stop rule: statistical, Wilson interval width below the smallest delta of business interest. Most prompt or model changes deliver four to seven points; Wilson half-width below five points typically requires N above 200. See statistical methods.
Hamel's "30 to 100 to 500" progression matches: thirty is the seed, one hundred is the calibration target, five hundred is where CIs stop trembling for release-gate decisions 1.
Versioning
A golden dataset is a contract. Contracts have versions.
Store the dataset in git, in JSONL or CSV, with stable IDs. Tag schema versions (dataset_version: "v1.2.0"). Pin every eval result row to the dataset version it scored. Content-hash on every commit; the hash goes into the manifest. An eval run without the hash in its output cannot be reproduced. Treat dataset updates as releases with migration notes 2.
{"id": "refund-001", "feature": "refund", "scenario": "returning", "persona": "frustrated", "input": "...", "gold_label": "PASS", "rubric_version": "v1.2", "dataset_version": "v1.2.0", "created": "2025-12-01"}
A judge calibrated against v1.2 has TPR and TNR scoped to v1.2; v1.3 requires a recalibration if more than five percent of rows changed 4.
Separate sets for separate jobs
The most common mistake is one eval set doing five jobs. The cure is multiple sets with separate update cadences. The calibration set: 100 to 300 human-labeled rows; train/dev/test stratified on PASS/FAIL; updates only when the rubric updates. The regression set: curated rows that capture historical bugs; rows added every time a fix lands; smaller (50 to 200); Hamel's L1 unit tests 5; CI runs this on every PR. The release-gate eval: the headline metric for go or no-go; larger (200 to 500); changes only on dataset version bumps. The error-analysis pool: sampled production traces for open coding; refreshed continuously. Mixing the four is the failure mode.
Synthetic plus human
Pure-synthetic datasets degrade as the generator drifts away from real users. Pure-human is prohibitively expensive at scale. The hybrid wins. SPADE auto-generates assertions from prompt-revision history, deploys across two thousand plus LangSmith pipelines, and ships a 21% drop in false failures relative to a pure-synthetic baseline 6. For dataset construction: synthesize 50 inputs against the grid, expert-reviewed; add ten to twenty real production traces every week.
Dimensional coverage as a metric
Once the grid exists, coverage becomes a first-class metric. Report per-cell pass rate, not just the aggregate. A 95% aggregate masking a 60% rate on (refund, frustrated, non-English) is the failure the aggregate was supposed to surface.
The dashboard is a heatmap. Rows are features, columns are scenarios faceted by persona. Each cell is the pass rate plus N. Cells with N below five are grayed out as undersampled. Cells more than two interval widths below the aggregate are flagged red.
What good looks like
A production-grade golden dataset program ships a git-tracked JSONL or CSV with stable IDs, schema version, and content hash; coverage matrices on every eval run; four separate sets for the four jobs; a weekly add of ten to twenty new production traces with human labels; a migration note on every dataset version bump 7. The dataset construction chapter walks the initial build; the versioning and lineage chapter walks the operational discipline.