The most actionable empirical regularity in LLM error analysis: three failure modes usually account for somewhere between 60% and 80% of your bug volume [1]. If you find those three modes and fix them, the user-visible quality lifts in a measurable way. If you spend the same effort sprinkled across thirty marginal categories, you ship a smaller improvement and burn the team out.
The rule does not say the long tail is fake. It says the head of the distribution is where the payoff is, and the head is small.
The NurtureBoss case
Hamel's NurtureBoss writeup is the canonical example. NurtureBoss is a real-estate AI assistant. The team sat down with one hundred logged conversations, read each in detail, wrote a note per trace, and clustered the notes into failure modes [1].
The result fit on a small table. Three categories accounted for 60% of all problems. The largest, by a wide margin, was conversational flow failures: the assistant told the user it would do something and then never did it. The second was misinterpreting handoff requests. The third, also large, was a tight cluster of date-format errors: the model produced dates the downstream calendar could not parse.
The interesting part is what the team did next. They prioritized by frequency times severity. The date-format errors were both frequent and severe (a bad date cancels a tour). The team fixed them in one targeted prompt change plus a small post-processing pass. The frequency of that category dropped from roughly 66% of date-related interactions to roughly 5% in the next cycle of error analysis [1].
The point is not that prompt engineering is magic. The point is that the eval program let the team find the lever that mattered and pull it without guessing.
Why the rule shows up
Three reasons LLM failures tend to cluster.
First, the model is a single artifact. It fails in correlated ways across inputs. A bad system prompt that does not specify the date format produces a date-format error on every date-bearing query, not on a random 50% of them. A missing tool description produces tool-call errors that look superficially varied but share a root cause.
Second, the input distribution is heavily skewed. Most user requests fall into a small number of templates; the tail is real but small. A failure mode that triggers on 10% of inputs accounts for a much larger fraction of the bug volume than its frequency in isolation suggests, because the input volume is concentrated [2].
Third, the people writing the notes during open coding cluster their own observations. If two annotators read fifty traces, the categories they converge on are the ones that are different from the surrounding behavior in a noticeable way; the marginal cases get rolled into the head categories or filed under "other." This is a feature, not a bug, of qualitative coding [3].
How to find your top three
The recipe is simple and durable.
- Pull one hundred traces from the relevant production segment. Hamel's guidance is one hundred for the first pass, then top up weekly [4]. Fewer than fifty and you do not have enough variety; more than two hundred and the marginal categories stop showing up and you are wasting time.
- Open the trace viewer. Read each trace in fifteen to forty-five seconds. Write a one-line note per trace if anything is wrong, blank if nothing is.
- Cluster the notes into four to seven failure categories. Use an LLM to suggest categories from your notes, then rewrite them by hand for specificity [5].
- Label each trace with the category it fits. Count.
- Sort by frequency times severity. Pick the top three.
The whole cycle takes a single person between three and five hours. That includes calibration with one other person to keep the labels honest. If it takes longer than five hours, the trace viewer is the bottleneck and you should fix that before doing more passes; see Build your trace viewer in an afternoon.
"""Use an LLM to suggest failure-mode labels from open-coding notes."""
import anthropic
from collections import Counter
client = anthropic.Anthropic()
def suggest_categories(notes: list[str], n_target: int = 5) -> list[str]:
"""Given a list of one-line notes from open coding, return n_target
candidate failure-mode labels. Run twice with different seeds and
take the intersection; the labels that survive are the durable ones."""
prompt = f"""You are reading {len(notes)} analyst notes from an error-analysis
session on an LLM product. Each note is a one-line description of a
failure observed in a single trace.
NOTES:
{chr(10).join(f'- {n}' for n in notes)}
Cluster these notes into exactly {n_target} failure-mode labels.
Each label must:
- Be 2 to 4 words.
- Name a specific failure pattern (not "quality issue").
- Be distinguishable from the other labels by anyone reading them cold.
Return the labels as a JSON list of strings. No prose, no preamble."""
msg = client.messages.create(
model="claude-opus-4-5",
max_tokens=400,
messages=[{"role": "user", "content": prompt}],
)
import json
return json.loads(msg.content[0].text)
def label_traces(notes: list[str], categories: list[str]) -> Counter:
"""For each note, pick the best-fitting category. Return a Counter of
category -> count. Stop early if a category passes 33% of total volume:
that one is your top failure mode and should be fixed before relabeling."""
counts: Counter[str] = Counter()
for note in notes:
cat = best_fit(note, categories)
counts[cat] += 1
return counts
def best_fit(note: str, categories: list[str]) -> str:
"""Single-pass classifier. Cache and batch in real code."""
prompt = f"""NOTE: {note}
LABELS: {categories}
Return the single best label. Reply with just the label."""
msg = client.messages.create(
model="claude-opus-4-5",
max_tokens=50,
messages=[{"role": "user", "content": prompt}],
)
return msg.content[0].text.strip()
The LLM is doing two things you would otherwise do by hand: proposing labels and applying them. Both steps benefit from one human pass at the end. The labels that an LLM proposes are usually too generic ("quality issues") and need a rewrite. The labels it applies are usually right on the obvious cases and wrong on the borderline ones, which is exactly where you, the human, add value [3].
Prioritization: frequency times severity
Frequency alone is misleading. A failure mode that hits 30% of traces but produces no user-visible harm (an extra polite line of text) ranks below a 5% mode that violates a policy or sends a wrong number.
The simplest prioritization that works is a two-column score. For each category, write down the count and a severity tier on a three-level scale (low, medium, high). Multiply count by a severity weight (low=1, medium=3, high=10). Sort by the product. Top three get the next sprint.
| Category | Count | Severity | Weighted | Rank |
|---|
| Date-format errors | 12 | High | 120 | 1 |
| Conversational-flow promise drops | 18 | Medium | 54 | 2 |
| Handoff misinterpretation | 6 | High | 60 | 3 |
| Politeness padding | 24 | Low | 24 | 4 |
| Tool-call argument typos | 8 | Medium | 24 | 4 |
The severity tier is a judgment call from a domain expert, not a model output. The whole point of error analysis is that the human is doing the prioritization on the basis of what the product actually owes the user.
TIP
If two categories tie on weighted score, pick the one with the cheaper fix first. Momentum beats elegance in the first cycle.
Why three, not ten
The cap on the head matters. You are not trying to surface every category at once. You are trying to fix the top three before the next cycle.
Three categories at a time is the right cap for a few reasons. The first reason is that fixes interact. A prompt change that fixes the date-format errors will move metrics on adjacent categories (because the date errors were eating into the handoff success rate). You want to remeasure before you queue the next fix [6].
The second reason is that the head shifts after each cycle. The category that was second largest last week becomes largest this week (because last week's largest got fixed). Open coding is an iterative process; the labels themselves move. Locking in a ten-category taxonomy on cycle one usually means rewriting it on cycle three.
The third reason is that team attention is the real constraint. A team that is tracking ten categories is, in practice, tracking three carefully and seven sloppily. Three carefully tracked categories beat ten sloppy ones every time.
Weekly cadence
The cadence that compounds is weekly. Each week, the team pulls fresh traces, repeats the open-coding pass on a smaller batch (twenty to thirty new traces is enough to top up the existing labels), and re-sorts the top three. Hamel's guidance is explicit: ten to twenty traces per week is the minimum to keep the labels fresh as the input distribution drifts [4].
If the team finds itself doing the full hundred-trace pass every week, the dataset has drifted enough that the prior labels do not fit. That is a signal: rebaseline. Pull a fresh hundred, recluster, and rerun the labeling. This happens roughly once a quarter in a stable product, more often in a product that is growing fast.
NOTE
The cadence has to be on the calendar with a named owner. "We do error analysis sometimes" is the failure mode that kills the loop.
What this rule is not
The 60-80% rule is empirical, not theoretical. It is the regularity Hamel observed across many teams; it is not a law of nature. Two caveats matter.
It does not say the long tail is unimportant. Safety failures live in the tail and have to be addressed independently; a 1% failure rate on a policy violation is not acceptable just because it is small. The tail goes into your regression set and your guardrails; the head goes into your evals.
It does not say the categories are stable across products. A consumer chatbot, a SQL agent, and a coding assistant have entirely different failure-mode distributions. The rule is about the shape of the distribution within a product, not about a portable taxonomy across products. Generic taxonomies ("hallucination, refusal, toxicity") are almost always too coarse; they fail the test of being specific enough that someone else could label errors using them [1].
The discipline transfers; the categories do not.
What to do after the top three
After the first cycle you have three fixes in flight and a labeled dataset of one hundred traces. The next steps are mechanical.
Promote the labeled dataset into a small offline eval. Write one binary judge per category. Calibrate each judge against the human labels you already have until TPR and TNR are above 0.9. Wire the judges into the stage-2 eval. From this point on, the team can detect when the same failure modes drift back into the product without re-doing the hundred-trace pass each time [5].
Add the top fixes to the regression set. Every category you fix becomes a small set of frozen examples the system has to pass forever. This is Hamel's Level 1: the unit tests that catch the easy regressions cheaply [7].
Plan the next cycle. The cadence is weekly for the small top-up, monthly for the rebaseline, quarterly for the structured taxonomy review. If you do not have any of those on the calendar, the program is not running.