AI Evals › Techniques

Golden dataset methodology

Features times scenarios times personas, synthetic plus production, stop rules, versioning, the case for content-hashed eval sets.

A golden dataset is the contract between your past self and your future self. It pins what the system was asked to do, what it answered, what a human said was correct, and which version of the rubric scored it. Every other eval pattern depends on having one. If you have zero labels today, Evals without a gold dataset covers how the first hundred rows come into existence; this page covers what to do with them.

Features times scenarios times personas

The construction axes start with Hamel's unit-test framing: break the scope of the product into features and scenarios ¹. The evals FAQ generalizes this to query dimensions, and its worked examples (customer mood, prior context) add the third axis ². The grid this site standardizes on is three-dimensional. Features are product capabilities (search, summarize, refund handling). Scenarios are the user's situation (first-time visitor, error recovery, edge-case input). Personas are who the user is (novice, expert, hostile, non-English speaker). The Cartesian product across the three generates the input grid; each cell is one or more eval examples.

For a customer-support assistant: features (refund, status, escalation), scenarios (new customer, returning customer, complaint after refund), personas (calm, frustrated, non-native English, abusive). 3 times 3 times 4 equals 36 cells; one per cell is a minimum starter, three per cell pushes the dataset above 100.

The axis labels are not academic. They are the same labels the team uses when triaging failures ("the system fails on refund-request from frustrated returning customers"), so the fix is targeted.

Synthetic inputs, real outputs

Default construction pattern: synthesize the inputs against the grid, run them through the real system in the real environment, label the outputs. Synthesizing inputs is fast; labeling outputs is the expensive step. The FAQ's mechanical advice: write the first 20 dimension tuples by hand before letting an LLM scale the generation, so you catch unrealistic combinations early ².

Why not use production outputs as the gold answer: they are what you are evaluating. A "gold" that is last week's output rewards stasis. The gold comes from a human expert or a verifiable program.

Complementary pattern: seed from production traces. Shankar's flywheel timestamps every example, stores the production input plus the (later, human-corrected) gold answer, and labels the production output ³.

Size and stop rules

Purpose	Minimum	Practical	Stop rule
Sanity check	30	50	no obvious failure remains
Judge calibration	100	200	TPR and TNR both above 0.90 on dev
Regression set	50	200	covers every known failure mode
Release-gate eval	200	500	smallest delta of interest detectable
Error analysis pool	100	300	20 traces with no new failure category
Production A/B headline	500	2000+	smallest delta of interest detectable

Error-analysis sampling stop rule: theoretical saturation. Review at least 100 traces, and stop when roughly 20 in a row produce no new failure category ². Release-gate stop rule: statistical. The Wilson interval width has to come in under the smallest delta of business interest, and most prompt or model changes deliver four to seven points, so a half-width below five points typically requires N above 200 ⁴. See statistical methods for the machinery.

Versioning

A golden dataset is a contract. Contracts have versions.

Store the dataset in git, in JSONL or CSV, with stable IDs. Tag schema versions (dataset_version: "v1.2.0"). Pin every eval result row to the dataset version it scored. Content-hash on every commit; the hash goes into the manifest. An eval run without the hash in its output cannot be reproduced. Treat dataset updates as releases with migration notes ³.

{"id": "refund-001", "feature": "refund", "scenario": "returning", "persona": "frustrated", "input": "...", "gold_label": "PASS", "rubric_version": "v1.2", "dataset_version": "v1.2.0", "created": "2025-12-01"}

A judge calibrated against v1.2 has TPR and TNR scoped to v1.2. Criteria drift is the mechanism: grading outputs reshapes the graders' criteria, so labels are not stable across rubric revisions ⁵. Our default rule: recalibrate whenever a version bump changes more than five percent of labeled rows.

Separate sets for separate jobs

The most common mistake is one eval set doing five jobs. The cure is multiple sets with separate update cadences. The calibration set: 100 to 300 human-labeled rows; train/dev/test stratified on PASS/FAIL; updates only when the rubric updates. The regression set: curated rows that capture historical bugs; rows added every time a fix lands; smaller (50 to 200); Hamel's L1 unit tests ¹; CI runs this on every PR. The release-gate eval: the headline metric for go or no-go; larger (200 to 500); changes only on dataset version bumps. The error-analysis pool: sampled production traces for open coding; refreshed continuously. Mixing the four is the failure mode.

Synthetic plus human

Pure-synthetic datasets degrade as the generator drifts away from real users. Pure-human is prohibitively expensive at scale. The hybrid wins. SPADE generates assertions from prompt version history, has been used across more than 2,000 LangSmith pipelines, and reduces assertion count by 14% and false failures by 21% against simpler baselines ⁶. For dataset construction: synthesize 50 inputs against the grid, expert-reviewed; add ten to twenty real production traces every week.

Dimensional coverage as a metric

Once the grid exists, coverage becomes a first-class metric. Report per-cell pass rate, not just the aggregate. A 95% aggregate masking a 60% rate on (refund, frustrated, non-English) is the failure the aggregate was supposed to surface.

The dashboard is a heatmap. Rows are features, columns are scenarios faceted by persona. Each cell is the pass rate plus N. Cells with N below five are grayed out as undersampled. Cells more than two interval widths below the aggregate are flagged red.

What good looks like

A production-grade golden dataset program ships a git-tracked JSONL or CSV with stable IDs, schema version, and content hash; coverage matrices on every eval run; four separate sets for the four jobs; a weekly add of ten to twenty new production traces with human labels; a migration note on every dataset version bump ³. The dataset construction chapter walks the initial build; the versioning and lineage chapter walks the operational discipline.

Hamel Husain, "Your AI Product Needs Evals." ↩ ↩²
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩ ↩² ↩³
Shreya Shankar, "The AI Engineering Flywheel." ↩ ↩² ↩³
Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩
Shankar et al., "Who Validates the Validators?" UIST 2024. ↩
Shankar et al., SPADE, 2024. ↩