The cheapest way to mislead yourself with an eval set is to seed it with inputs that look like the demo. The demo path is the one the team practiced; the failures live elsewhere. Dimensional sampling is the discipline of constructing inputs that probe the corners on purpose, by enumerating axes that matter for the product and generating combinations across them [1].
The three-axis frame
Hamel's recipe is features by scenarios by personas. A feature is a capability the product offers (booking lookup, document Q+A, refund flow). A scenario is a state of the world the feature has to handle (happy path, no results, ambiguous query, conflicting documents, irate user). A persona is who is asking (first-time visitor, power user, admin, attacker). Cover the cells of that grid and the eval set probes the product instead of confirming the demo [2].
The three axes are not sacred. Some products need four (add channel: voice, SMS, chat, email). Some collapse to two. Pick the axes whose interactions you suspect will surface failures, not the labels in the original recipe.
Skip dimensional sampling and pull a thousand production traces at uniform random and two things go wrong.
The production distribution is heavily skewed; eighty percent of inputs fall into a small number of templates. A random sample is dominated by the head, and the head is mostly working. The failures live in the sparse tail. A targeted sample built from the three-axis grid hits the tail on purpose [3].
Some scenarios are rare in production but high-severity. "User is hostile and the model refuses incorrectly" may show up in two out of a thousand traces, but a wrong response is a policy violation. Dimensional sampling forces the team to write that case in even when production has not surfaced enough of it.
A working example
The grid for a recipe assistant might look like this.
DIMENSIONS = {
"feature": [
"recipe-lookup",
"substitution-question",
"dietary-conversion",
"multi-turn-refinement",
],
"scenario": [
"happy-path",
"ambiguous-query",
"conflicting-constraints",
"out-of-scope-request",
"user-corrects-bot",
],
"persona": [
"first-time-visitor",
"power-user",
"dietary-restricted",
"hostile-tester",
],
}
# 4 x 5 x 4 = 80 cells. Sample 1 to 2 per cell for ~100 inputs.
import itertools
seed_set = [
{"feature": f, "scenario": s, "persona": p}
for f, s, p in itertools.product(*DIMENSIONS.values())
]
The grid does not need to be fully populated. Pick the cells whose interaction is most likely to break the product (hostile tester plus out-of-scope-request plus refund flow) and skip the implausible combinations (admin plus happy-path plus dietary-conversion). Commit to which cells the team covered and which they did not [2].
Convert cells to natural language
The grid is a tuple of labels; the eval input is a sentence. Use an LLM to do the conversion, then have a human edit for realism. The LLM produces fluent variation; it is mediocre at writing the way a real annoyed customer writes. The human pass tightens phrasing and catches generic outputs.
A single conversion prompt suffices: "Convert this tuple into one realistic user message. Vary tone. Do not name the dimensions. Be specific." Run with high temperature, sample one or two outputs per cell, hand-edit. The full hundred-input seed set takes a senior engineer plus a domain expert about half a day on the first pass.
When to grow the grid
As error analysis surfaces failure modes the grid did not anticipate, add new axes. If the bot fails systematically on Spanish-language inputs, language is an axis. If failures concentrate around midnight requests, time-of-day is an axis. The grid evolves with the product [3].
The signal that the grid is too small is that error analysis stops surfacing new failure modes. If three consecutive open-coding passes find the same modes, the grid has saturated the input space. Expand by adding an axis the team has been ignoring, not by adding more examples per existing cell.
What dimensional sampling does not do
It does not replace open coding on production traces. The grid covers the inputs the team can anticipate; production covers the ones the team did not [4]. The right setup is a grid-built seed set for offline regression plus a rolling sample of production traces for ongoing error analysis.
It does not pick the test set for a model evaluation. The grid is for the stage-1 and stage-2 evals the team builds and grades against. Picking the model itself against a frontier benchmark is a different problem; see the datasets section.