The temptation when starting an eval program is to lift a taxonomy from a paper or a vendor doc. Hallucination, refusal, toxicity, off-topic, formatting. Five categories, ready to use. The trouble is that those labels are too coarse to drive any specific fix. A failure-mode taxonomy useful for your product is one you derived by reading your own traces [1].
Bottom-up beats top-down
A bottom-up taxonomy starts with open-coding notes and clusters them into the smallest set of distinguishable categories that covers the observed failures. A top-down taxonomy starts with a vendor's category list and tries to fit traces into it. The first method produces categories the team can act on; the second produces a column of "other" that grows every cycle [2].
"Hallucination" is a top-down label; nobody can write a fix because the category contains five mechanisms (entity confusion, fabricated number, extrapolation past evidence, refusal-then-make-up, citation invention). "Bot stated the rental price without checking inventory" is a bottom-up label; the fix is to add an inventory tool call before any price statement. The bottom-up label points at code. The top-down label points at a Slack thread.
What good labels look like
Five properties separate a useful label from a useless one [3].
Specific enough that two team members would tag the same trace the same way. "Quality issue" fails. "Price stated without inventory check" passes.
Actionable: the fix it points at is one a team can prioritize. "Bot was rude" is not actionable. "Bot used capitals in a friendly channel" suggests a prompt edit.
Distinguishable from its siblings: a coder picks exactly one of the top three for any trace without arguing. If two labels overlap on more than a quarter of traces, merge or split.
Counted: paired with frequency numbers. Without counts, prioritization is opinion.
Revisable: the team rewrites the label when the next pass surfaces a sharper version.
The three-stage coding process
Open coding produces notes. Axial coding clusters notes into themes. Structured coding applies the themes back to a labeled dataset [3]. For an AI product the process collapses to a single half-day session: read one hundred traces, write one-line notes, cluster into four to seven candidate failure modes, apply the labels back and count.
"""Cluster open-coding notes into candidate failure-mode labels."""
import anthropic
client = anthropic.Anthropic()
def propose_labels(notes: list[str], n_target: int = 5) -> list[str]:
joined = "\n".join(f"- {n}" for n in notes)
prompt = f"""Analyst notes from an error-analysis session on an LLM
product. Each note describes a failure in one trace.
NOTES:
{joined}
Cluster into exactly {n_target} labels. Each label must:
- Be 2 to 5 words.
- Name a specific failure pattern, not a generic category.
- Point at a fix a team could prioritize.
Return one label per line."""
msg = client.messages.create(
model="claude-opus-4-5",
max_tokens=400,
messages=[{"role": "user", "content": prompt}],
)
return [ln.strip() for ln in msg.content[0].text.splitlines() if ln.strip()]
The labels the LLM proposes will be too generic on the first pass. The human rewrite is the high-value step: convert "quality issues" into "price stated without inventory check", convert "formatting problems" into "markdown used in SMS channel" [1].
When to merge or split
Failure modes drift over the life of a product. The team finds new variants of existing modes and finds that two old modes were the same thing seen from two angles. Revise the taxonomy quarterly at the slowest, and track the version each result was scored under [4].
Merge two labels when more than twenty-five percent of traces could plausibly be either. Split one label when the fix would be different depending on which sub-pattern triggered. Either operation invalidates prior counts on those labels and triggers a relabel.
Why "hallucination" almost always misleads
The five generic LLM-failure categories (hallucination, refusal, toxicity, off-topic, formatting) are a useful starting checklist but useless as labels in a working taxonomy. Hallucination is the worst offender: a trace tagged "hallucination" could be entity confusion, numeric fabrication, evidence-free extrapolation, or refusal-then-make-up. Each has a different fix [1].
A taxonomy that uses generic labels signals that the team has not done the work. The taxonomy after a half-day pass will be longer (six to nine categories), more specific (each label points at a mechanism), and immediately more useful.
What ships back into the eval program
The output is a labeled dataset and a one-binary-judge-per-category set of evals. Each category becomes a stage-2 LLM judge with a binary rubric calibrated against the human labels. The taxonomy is what lets the team write each judge with confidence it measures a distinct thing [3]. A judge written against "hallucination" cannot decide a borderline case; a judge written against "price stated without inventory check" decides every case in seconds.