A team without an eval dataset is a team running A/B tests on vibes. The fix is not to buy a vendor with a UI. The fix is to write down thirty examples this afternoon, label them yourself, and run them through your system twice. The page that follows is the practical recipe for that work, from the first thirty examples through the five-hundred-plus regression set you eventually need.
What a dataset is for
The eval dataset is the contract between the model and the product. Every change to a prompt, a retriever, a chunking strategy, a model version, or a tool definition has to be defensible against the same set of inputs and expected behaviors. Without that contract you cannot tell whether a "tweak to the system prompt" recovered a regression or introduced three new ones [1]. The dataset is also the substrate for error analysis: open coding produces labels, labels produce features, features become slices, slices become the questions your dataset answers.
The point is not to mirror production. The point is to cover the failure surface. A dataset with five hundred easy queries and zero hard ones tells you nothing about what your product actually does. A dataset with thirty examples picked from a real bug report you wrote up last Tuesday will move the product more than ten thousand auto-generated prompts.
The three-axis framework
Hamel's recipe is to think of inputs along three axes: features, scenarios, and personas [2]. A feature is a capability the product offers (booking lookup, document Q+A, multi-turn refinement). A scenario is a state of the world the feature has to handle (no results, ambiguous query, conflicting documents, irate user). A persona is who is asking (first-time visitor, power user, admin, attacker). Cover the cells of that grid and you have a dataset that probes the product instead of confirming it works on the demo path.
# axes.py
features = [
"booking-lookup",
"policy-question",
"multi-turn-refinement",
"refund-request",
]
scenarios = [
"happy-path",
"no-results",
"ambiguous-query",
"conflicting-docs",
"irate-user",
"out-of-scope",
]
personas = [
"first-time-visitor",
"returning-customer",
"admin",
"adversarial",
]
cells = [(f, s, p) for f in features for s in scenarios for p in personas]
# 4 x 6 x 4 = 96 cells. Aim for 1 to 3 examples per cell.
print(f"plan {len(cells)} cells, ~{len(cells) * 2} examples")
A 96-cell grid with two examples per cell lands at around two hundred examples. That is the right order of magnitude for the second-generation dataset, not the first. Start with the cells that map to actual user complaints. Backfill empty cells later.
There are three sources, and a good dataset uses all three.
The first source is real production traces, sampled with intent. Uniform random sampling gives you the distribution your users actually generate; oversampling on rare-but-critical slices (refund requests, multi-step queries, queries that triggered a fallback) gives you signal on the failures that matter [3]. If you have not yet shipped, your friends and pilot users are the closest substitute. The bias from "I know what my product does" is real, but it is smaller than the bias from "I am inventing a query I think a user might type."
The second source is synthetic inputs run through the real system. Have an LLM rewrite a real query ten ways, or generate queries that depend on a specific fact in a specific document. Then feed those inputs through production and label the outputs. The synthetic part is the input, not the output. You are buying breadth; you still have to do the labeling work [2].
The third source is hand-crafted adversarial cases. Take a known bug, freeze it as an example, label the expected behavior. Take a competitor's product, ask it the queries you struggle with, and write down where you should do better. Take a policy your legal team gave you, write the inputs that would violate it, label the refusal as the expected output. These are the examples that catch regressions you would not have noticed otherwise.
NOTE
Synthetic inputs are fine; synthetic ground truth is not. If you cannot label an example because you cannot decide what the correct answer is, the example is too hard for the dataset and is probably a research question pretending to be a test case.
Starting sizes
The right size is the smallest size that distinguishes the changes you care about. Hamel's heuristic ladder, validated against dozens of shipped products, lands around: thirty examples to get a vibe-check baseline, one hundred examples to calibrate an LLM judge against a human expert, five hundred examples when you start running CI gates that distinguish small effects [2].
The math for the five hundred number is binomial. A Wilson interval on a 90 percent pass rate from one hundred examples is roughly plus or minus six points; from five hundred it tightens to plus or minus two and a half. If your release gate is "ship if we are within one point of baseline," one hundred examples cannot answer the question and you will ship regressions you cannot see. Past five hundred the returns flatten unless you have severe class imbalance (rare failure modes) or clustered structure (multiple questions per topic) that needs explicit handling.
A minimum-viable pipeline
The first version of your dataset lives in a single JSONL file in git. One example per line. That is intentional: it is diff-able, blame-able, and trivially reproducible. The schema is small enough to memorize.
# build_dataset.py
import json
from pathlib import Path
SCHEMA_FIELDS = {
"id", # stable identifier, never reuse
"input", # user input or full conversation
"context", # optional retrieved chunks, tool outputs
"expected_output", # gold answer or expected behavior
"labels", # dict: feature, scenario, persona, tags
"source", # "prod-trace:2026-04-15", "synth", "redteam"
"author", # who labeled it
"created_at", # ISO date
"notes", # free text
}
def write(records: list[dict], path: Path) -> None:
seen_ids = set()
for r in records:
missing = SCHEMA_FIELDS - r.keys()
if missing:
raise ValueError(f"{r.get('id','?')}: missing {missing}")
if r["id"] in seen_ids:
raise ValueError(f"duplicate id {r['id']}")
seen_ids.add(r["id"])
with path.open("w") as f:
for r in records:
f.write(json.dumps(r, sort_keys=True) + "\n")
if __name__ == "__main__":
records = [
{
"id": "booking-lookup-001",
"input": "do you have anything for next friday after 7?",
"context": None,
"expected_output": {
"action": "ask_clarification",
"fields": ["party_size", "venue"],
},
"labels": {
"feature": "booking-lookup",
"scenario": "ambiguous-query",
"persona": "returning-customer",
"tags": ["temporal-relative", "missing-required"],
},
"source": "prod-trace:2026-04-15",
"author": "@ombharatiya",
"created_at": "2026-04-16",
"notes": "Trace #38219. Bot guessed party_size=2 and booked.",
},
]
write(records, Path("evals/data/v0/golden.jsonl"))
Two properties are non-negotiable. First, the id is stable forever: when you change an example you do not reuse its id. Second, the labels dict is rich enough to slice on. The dimensional axes you defined upstream show up here as label keys.
Reading every example
Before the dataset is real, you have to read every example yourself. Not skim, read. Hamel's rule is that the person who owns the product should be able to recite the top five failure categories from memory after their first pass [4]. If you cannot, the dataset is not yet a window into the system; it is a wall of text.
The practical workflow is a notebook that streams one example at a time, runs it through the current system, shows the input, the output, the expected output, and a free-text field for notes. Spend a morning. You will rewrite a quarter of your labels, delete a fifth of your examples, and add fifty new ones drawn from real edge cases you noticed while reading. That morning is the highest-return time you will spend on the project.
When to scale beyond five hundred
There are three triggers. The first is a CI gate that needs to distinguish a change of less than two points; the binomial math forces more N. The second is severe class imbalance, where a rare failure mode (one in fifty examples) needs a usable confidence interval on its own slice; that pushes per-slice N into the hundreds. The third is structural clustering: when your dataset has many questions per topic or many turns per conversation, naive standard errors are too small by a factor of three or more, and you need either explicit clustered standard errors or roughly nine times the example count to recover the same precision [2].
If none of those triggers are present and your dataset is sitting at five hundred examples, the next dollar is better spent on regression slices, judge calibration, or production trace sampling.
Hybrid synthetic plus human
The SPADE method automates assertion generation from prompt-revision history; deployed across two thousand-plus LangSmith pipelines it cut false-failure rates by roughly twenty-one percent [5]. The lesson generalizes. Synthetic generation is best at producing breadth, taxonomy, and assertion candidates; humans are best at picking which of those assertions are actually load-bearing for the product. Pair the two: an LLM proposes labels and edge cases, a human accepts or rejects each in a tight review loop. Anything fully automated will drift away from product reality within a release cycle.
What good looks like
A team that has done this work has a JSONL file in git, under five hundred lines, that any engineer can read in an hour. Each line carries a stable id, a real-looking input, a labeled expected behavior, and dimensional tags that match how the team thinks about the product. The file has been read end-to-end by the product owner. The file has a hash on every commit, and every eval result references the hash. When the model gets upgraded, the team runs against this file and ships only if the regression budget holds. That dataset is the moat. Build it first, build it small, and read every line.