Open coding is the qualitative-research method that pretty much every working LLM team has reinvented. Read raw traces. Write down what is wrong in your own words. Cluster the notes into categories. Reuse the categories as labels for a larger dataset. The method has been around for fifty years in sociology and it works on LLM traces because LLM failures, like the failures classical sociology studied, are too varied to specify a rubric for in advance [1].
The reason to formalize the process is that teams that do it ad hoc converge on different categories every week, and the eval program never stabilizes. The reason to use an LLM in the middle of the process is that one stage (axial coding) is mechanical enough to automate without sacrificing the human judgment that makes the rest of the process work.
Why open coding instead of a fixed rubric
The temptation is to write a rubric first and grade against it. Skip open coding, write a rubric of "is the answer helpful, faithful, and safe," and have an LLM judge score every trace on those three axes.
That fails for one reason. The rubric you would write is wrong. You cannot specify the failure modes of a system you have not observed [1]. The first time you read fifty real traces you discover three classes of failure your rubric did not cover (and three rubric axes that never actually fail). Shankar et al. named this criteria drift: the criteria emerge from grading outputs, not the other way around. Writing the rubric first guarantees that you measure the things you anticipated and miss the things that matter.
Open coding inverts the order. You read first, write notes second, build the rubric third, and only then trust an automated judge. The price is one to two days of reading per cycle. The payoff is a rubric that catches the failures that actually happen.
NOTE
Hamel and Shreya put it bluntly: "the rubric is downstream of the data, not upstream." If your team is debating rubric items before they have read traces, the debate is theoretical and will be re-litigated as soon as someone looks at the data.
The three stages
Open coding has three stages in the original methodology and that maps cleanly to LLM trace review.
flowchart LR
A["Stage 1: Open coding\nfree-text notes per trace"] --> B["Stage 2: Axial coding\ncluster notes into categories"]
B --> C["Stage 3: Structured coding\nlabel a larger dataset with the categories"]
C --> D["Calibrated judge"]
C -.-> E["Saturation check"]
E -.-> A
Stage 1 is open. You read traces and write notes in your own words. Stage 2 is axial. You group the notes into categories. Stage 3 is structured. You apply the categories as labels to a larger dataset. Each stage has different ergonomics and different failure modes.
Stage 1: Open coding
The mechanics: open a viewer, pull a fresh batch of traces (one hundred for the first pass, twenty to thirty for top-up cycles), and scan each one in fifteen to forty-five seconds. Write a one-line note if anything is wrong. Leave the slot blank if nothing is.
Three rules keep the pass fast and useful.
First, write in your own words. Do not snap to a category you used last week. If the trace looks like a hallucination but also like a refusal, write "model said something not in the doc and then refused to elaborate" not "hallucination." The whole point of stage 1 is to capture the texture before it is compressed [2].
Second, do not debate. The pace is fifteen to forty-five seconds per trace. If a trace requires three minutes to decide, write down what is ambiguous about it and move on. The ambiguous ones are diagnostic; they tell you the rubric is underspecified.
Third, skip the system prompt unless it changes between traces. Reading the same system prompt one hundred times is a waste of attention.
The output of stage 1 is forty to sixty notes spread across one hundred traces (most traces do not have a problem). Each note is one line. The total time is forty-five minutes for an experienced annotator, ninety for a first-timer.
Stage 2: Axial coding
Stage 2 is where the LLM helps. You have a few dozen notes; you want four to six categories. The LLM is good at proposing categories; the human is good at rewriting them for specificity.
"""LLM-assisted axial coding: turn free-text notes into a small set of
failure-mode labels. The LLM proposes; the human rewrites."""
import anthropic
import json
client = anthropic.Anthropic()
def propose_categories(notes: list[str], n_categories: int = 5) -> list[dict]:
"""Ask the model for n candidate categories from the notes.
Run twice with different seeds and intersect the results."""
prompt = f"""You are analyzing notes from an error-analysis session on an
LLM product. Each note is a one-line description of a failure.
NOTES:
{chr(10).join(f'- {n}' for n in notes)}
Cluster these notes into exactly {n_categories} candidate failure-mode labels.
Each label must:
- Be 2 to 4 words.
- Name a specific failure pattern, not a generic class like "quality issue".
- Be applicable to multiple notes in the input.
- Be distinguishable from the other labels by a reader who has not seen the notes.
For each label, return:
- name: the 2-4 word label
- definition: one sentence rewriting the failure pattern in concrete terms
- example_note: one note from the input that fits the label
Return a JSON list of these objects. No prose, no preamble."""
msg = client.messages.create(
model="claude-opus-4-5",
max_tokens=1000,
messages=[{"role": "user", "content": prompt}],
)
return json.loads(msg.content[0].text)
def reconcile(seed_runs: list[list[dict]]) -> list[dict]:
"""Two seed runs produce two label sets. The labels that survive both
runs are the durable ones. Labels that appear in one but not the other
are candidates for the next pass or for the long tail."""
names_per_run = [{c["name"] for c in run} for run in seed_runs]
durable = set.intersection(*names_per_run)
return [c for c in seed_runs[0] if c["name"] in durable]
Two passes with different seeds is a small but useful discipline. A category that survives both passes is more likely to be a real cluster than an artifact of the model's preferred phrasing. If you want a third signal, run with a different model family; categories that appear across both model families are the most durable.
The human pass at the end is the part you cannot automate. Read each proposed category. Rewrite the name to be specific to your product ("dietary restriction violated" not "preference ignored"). Rewrite the definition to include the test you would apply: "fail if the recipe contains an ingredient on the user's no-go list." Drop categories that read as generic ("quality issue"); add categories the model missed.
A reasonable axial-coding output is four to seven categories per cycle. Fewer than four and you are probably under-labeling. More than seven and you are over-fitting to the input batch. The cap forces clarity.
Stage 3: Structured coding
Once the categories are written, you can apply them to a larger dataset. The structured stage is where the LLM does the bulk of the work and the human spot-checks.
"""Apply axial-coded categories to a larger dataset. The LLM labels every
trace; the human reviews a sample to catch drift."""
import anthropic
import random
import json
from typing import Iterable
client = anthropic.Anthropic()
def classify(trace: str, categories: list[dict]) -> dict:
"""Label one trace with the best-fitting category, or NONE if no
category fits. Always return a one-sentence critique so reviewers
can spot misclassifications without re-reading the full trace."""
cat_block = "\n".join(
f'- {c["name"]}: {c["definition"]}' for c in categories
)
prompt = f"""Label this assistant trace with the failure-mode that best fits.
If no category fits, return label=NONE.
TRACE:
{trace}
CATEGORIES:
{cat_block}
Return JSON with:
- label: the category name or NONE
- critique: one sentence explaining the choice"""
msg = client.messages.create(
model="claude-opus-4-5",
max_tokens=200,
messages=[{"role": "user", "content": prompt}],
)
return json.loads(msg.content[0].text)
def calibration_sample(
labels: list[dict],
spot_check_fraction: float = 0.1,
) -> list[dict]:
"""Pull a random spot-check sample, stratified by label. A 10% sample
is usually enough to detect label drift at 500-trace scale; sample
more if any category has fewer than 10 instances."""
by_label: dict[str, list[dict]] = {}
for record in labels:
by_label.setdefault(record["label"], []).append(record)
sample = []
for label, records in by_label.items():
n = max(5, int(len(records) * spot_check_fraction))
sample.extend(random.sample(records, k=min(n, len(records))))
return sample
The spot-check is the audit. A human reads the sampled traces and overrides the labels that are wrong. If the override rate is below 5%, the labels are good. If it is between 5% and 20%, the rubric is underspecified; tighten the definitions and rerun. If it is above 20%, the categories are wrong; go back to axial coding [3].
Choosing the principal domain expert
The output of open coding is only as good as the annotator. The label is "principal domain expert" because the rubric they write becomes the contract for what the product owes the user.
The wrong choice is an engineer with no product context. They will label for what the code does, not what the user needs.
The wrong choice is also a busy executive who skims twenty traces in five minutes. The pace matters; the discipline of reading at fifteen-to-forty-five seconds per trace is what catches the patterns that vanish on a skim.
The right choice is the person who would, on a normal day, be answering escalations from the user. A product manager for a consumer app, a customer-success lead for an enterprise tool, an SDET who knows the system end to end. The role title matters less than the access to user reality [2].
The principal domain expert does not work alone. A second annotator does a parallel pass on a smaller batch and the two compare labels. If their Cohen's kappa is below 0.6, the rubric is the bug, not the model.
Sample size and the saturation rule
The literature term is theoretical saturation: the point at which reading more traces stops surfacing new categories. The empirical heuristic is one hundred traces for the first pass, twenty to thirty for top-ups [4].
The check is mechanical. After the first hundred traces, count categories. After the next twenty-five, count again. If the new batch added zero new categories, you are saturated for now. If it added one or two, top up another twenty-five. If it added five, the input distribution has shifted and you should rebaseline with a fresh hundred.
CAUTION
Saturation is conditional on the input distribution. Saturation today does not mean saturation next month. The rule of thumb is one full hundred-trace pass per quarter, plus weekly top-ups, plus rebaseline on any product change that shifts the user base.
Sharing the codebook
The output of the workflow is a codebook: the labels, the definitions, the example traces, and the inter-rater agreement metric. The codebook is a versioned artifact that other people use. It belongs in git, not in a Notion doc that drifts.
A minimal codebook entry has five fields. Label, definition, two positive examples, two negative examples (traces that look like the failure but are not), and the inter-rater kappa from the last calibration round. Five fields, one Markdown table, checked in with the dataset [5].
The codebook is the input to two downstream things. It is the source of truth for the judge prompt (the definition becomes the judge's rubric). It is also the source of truth for the labeling guidelines that future annotators read before they do a pass. Both downstream uses depend on the codebook being unambiguous. If the labels are vague, the judge will be vague, and the next annotator will drift in a different direction. The discipline pays off twice [6].
Common failure modes of the process
Three patterns kill the loop. Watch for them.
The first is rubric-first reflex. The team writes the rubric before reading traces, and open coding becomes a confirmation exercise for the prior beliefs. The fix is to enforce the order: no rubric edits until the read is done.
The second is over-aggregation. The team reads one hundred traces, finds twelve categories, and squeezes them into three. Three is the right cap for prioritization, not for labeling. The discarded nine become a single bucket called "other" that grows over time until nobody knows what is in it. The fix is to use the full set during stage 2 and only collapse to a top three for prioritization, while keeping the long tail labeled in stage 3 [7].
The third is the orphaned codebook. The team does one pass, writes the labels, and never refers back. Two months later the labels are stale, nobody remembers what they meant, and the next pass starts from scratch. The fix is the weekly top-up cadence: small effort each week beats heroic effort once a quarter.
The discipline is the part that compounds. Each cycle the codebook gets sharper, the judge gets more accurate, and the team's reading speed goes up. Five cycles in, you can run a top-up in ninety minutes and the labels will hold for weeks. None of that happens if the loop is run once and abandoned.