A dataset of five hundred examples drawn from the same six conversations is not a dataset of five hundred examples. It is a dataset of six, with stylistic variation. The trap is hard to see from the inside because the size looks reasonable and the labels look careful. The fix is a three-axis grid that makes coverage gaps visible.
The three axes
Hamel and Shreya's recipe is to think of any input as a point in a three-dimensional space: features, scenarios, and personas [1]. The cross-product is the failure surface; cells with no examples are blind spots.
A feature is the noun in the product spec (booking lookup, refund processing, contract review). The right granularity is what a product owner would call a feature when speaking to a customer, not the internal microservice decomposition.
A scenario is a condition that changes the right behavior (happy path, ambiguous input, no results, conflicting evidence, irate user). A taxonomy with six entries on the first pass and twelve on the second is healthy [2].
A persona is a behavioral profile (first-time visitor, returning power user, admin, internal QA, adversary). A persona is not a demographic; it is a usage shape. Two users with the same demographics but different goals are two personas.
Why the grid matters
End-to-end pass rates conflate cells. A model with 90 percent on booking-lookup may be 99 percent on happy-path cells and 50 percent on ambiguous-query cells, and the aggregate hides the second number. A regression that breaks one cell moves the headline metric by a single point because the cell holds six examples out of five hundred. The cell-level view makes the regression visible [3].
The grid also forces a conversation about what the product is supposed to do. When you sit down to fill cells, you discover the team disagrees about expected behavior in the no-results scenario, or whether the admin persona should see different responses from the first-time visitor. That conversation is the dataset construction work.
A coverage budget
Cells are not equally important. Weight by three multipliers: production volume, severity of failure, and how confident the team is that current examples cover the cell.
# coverage.py
from dataclasses import dataclass
@dataclass
class Cell:
feature: str
scenario: str
persona: str
prod_volume: float # share of traffic, 0 to 1
severity: float # 1 (low) to 5 (catastrophic)
examples: int
def target_n(cell: Cell, budget: int) -> int:
return max(2, int(round(budget * cell.prod_volume * cell.severity)))
def gap(cell: Cell, budget: int) -> int:
return max(0, target_n(cell, budget) - cell.examples)
cells = [
Cell("booking-lookup", "happy-path", "returning", 0.20, 1, 12),
Cell("booking-lookup", "ambiguous", "returning", 0.08, 3, 4),
Cell("refund-request", "conflicting", "irate-user", 0.02, 5, 1),
Cell("policy-question","jailbreak", "adversarial", 0.01, 5, 0),
]
for g, c in sorted(((gap(c, 200), c) for c in cells), key=lambda x: -x[0]):
print(f"+{g:3d} {c.feature} / {c.scenario} / {c.persona}")
The output is a prioritized backlog ranked by gap-to-target. The next labeling session goes after the top of that list, not whichever cell happens to be the easiest to write examples for.
Reading the grid
A grid worth reading has three properties. Every cell with positive production volume has at least two examples (the binomial CI on one example is useless). High-severity cells have ten or more each. Every persona is represented across the scenarios that matter for it, not just the happy path.
The pathological grid has three hundred examples in booking-lookup × happy-path × returning-customer and zero in refund-request × conflicting × irate-user. Eval scores look healthy because the dominant cell is easy and the harder cells are absent. Production users discover the gap first; the dataset never warned.
NOTE
Cells with zero examples are easier to see than cells with two examples that are not actually probing the cell. Read the examples in each cell. If two of three "ambiguous-query" examples are minor typos, the cell is unfilled in spirit even if filled on the spreadsheet.
Tagging
The grid only works if every example carries its coordinates. The labels dict holds three controlled-vocabulary keys (feature, scenario, persona) and a free-text tags field for finer-grained slicing. Tags accumulate; the top-level vocabulary does not. When a tag appears in more than five percent of examples, consider promoting it to a scenario or feature [4].
For most products three axes are enough. Add a fourth (language, modality, channel, turn depth) only when the grid hides systematic failures along that dimension. If you cannot point at a real bug a fourth axis would have predicted, the axis is overhead.