AI Evals › Techniques

Error analysis (open coding)

Open, axial, structured coding for AI traces. The 60-80 percent rule, sample-size saturation, LLM-assisted clustering, the bottom-up failure taxonomy.

Error analysis is the process of reading actual model outputs, naming the failure modes you see, clustering the names into categories, and prioritizing fixes by frequency and severity. It is the activity with the largest payoff per hour in an eval program. Hamel's NurtureBoss case study found three categories accounted for 60% of all problems; fixing them dropped date-handling errors from 66% to 5% ¹.

The three-stage workflow

The pattern inherits from qualitative social science.

Open coding. Read individual traces and write down what went wrong in your own words. No predefined categories. The phrases are the data: "got the date wrong by a day," "confused vegetarian with vegan," "ignored the system message about formality." Aim for 100 traces in the first pass ².

Axial coding. Cluster the open-coding phrases into named categories. Clusters emerge from the data, not from a prefabricated taxonomy. "Got the date wrong by a day," "missed the timezone," and "off-by-one on the recurrence" cluster into "date handling."

Structured coding. Re-label the original 100 traces against the now-stable categories. Compute per-category frequency. Order by frequency. The top three almost always account for 60 to 80 percent of all problems ¹.

The whole loop takes a principal expert one or two days the first time and four hours the next. The cadence after that is weekly: ten to twenty fresh production traces, re-label against the existing categories, watch for new categories, prioritize fixes.

Why bottom-up beats top-down

The reflex pattern is still to organize failures into "hallucination, toxicity, bias, refusal." That taxonomy is too coarse for product-team decisions. "Hallucination" lumps "made up a product feature that does not exist" (catastrophic) with "stated a fact slightly off" (cosmetic).

Bottom-up taxonomies are product-specific. NurtureBoss had a "date handling" bucket that dwarfed every general category. A customer-support team finds "refund policy misquoting." A code agent finds "missed test case." The buckets are operational because the fix is mechanical once named: change the date parsing logic, add the refund policy to the system prompt, add a verifier on missed tests ¹.

Top-down taxonomies are useful exactly once: when reporting to a board or a regulator. Not for the team that ships the fix.

The 60-80 percent rule

Across teams running open coding, three to five named categories cover 60 to 80 percent of all problems. Failure modes are skew-distributed ¹. The prioritization implication is direct: fix the top three, rerun the eval, the aggregate metric moves five to fifteen points, pick the new top three, repeat.

The inverse failure: spreading fixes across ten "important" categories at one percent each, watching the aggregate move a half percentage point per quarter. Chase frequency, not novelty.

Sample-size saturation

The standard stop rule for the first pass is theoretical saturation. Keep labeling traces. When twenty traces in a row produce no new category, stop ².

The first 30 traces produce most of the categories. The next 30 add one or two. The last 40 add zero or one. Total is roughly 100, often shared with the calibration set.

LLM-assisted axial coding

Axial coding is the step LLMs accelerate the most.

import json
from anthropic import Anthropic

client = Anthropic()

def cluster_failure_modes(open_codes: list[str]) -> list[dict]:
    prompt = f"""You are clustering qualitative failure-mode descriptions
from an LLM eval program into 3 to 7 named categories.

Open-coded descriptions:
{json.dumps(open_codes, indent=2)}

Return JSON. Each category has:
- name: a short noun phrase ("date handling", "policy misquoting")
- description: one sentence defining the category
- examples: the original descriptions that map to this category
"""
    out = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=4096,
        temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    return json.loads(out.content[0].text.strip())

The principal expert reviews the proposed categories before they go into structured coding. The pattern is human-in-the-loop, not automated. SPADE-style assertion generation extends the same pattern to assertion synthesis ³.

Principal expert and data viewer

The program needs exactly one principal domain expert. They define the rubric, own the categories, sign off on category changes. Distributing rubric ownership is the failure mode Shankar et al. document: criteria drift compounds when no human has the final say ⁴. The expert needs deep domain knowledge, has looked at the product's outputs, has two hours a week for this.

Open coding is bottlenecked by reading speed. The single highest-payoff tool an AI team can build in an afternoon is a custom trace viewer with: side-by-side input and output, copy-trace-as-JSON, filter on metadata, annotate with a category tag, export labels to JSONL ¹. Vendor UIs do not beat a 100-line Streamlit app at this loop because they optimize for observability, not for the expert's reading pattern.

Weekly cadence and failure modes

After the first pass: sample ten to twenty production traces uniformly (or oversample failures if base rate is too low). Label against the existing category set; flag new category candidates. Update the priority list. Spot-check the LLM judge on five labeled traces. Shankar's flywheel formalizes the cadence as a database-backed process; the spreadsheet version works for teams of one to three ⁵.

Three failure modes. Top-down taxonomy creep (team imports OWASP or NIST categories; bottom-up labels stop being product-specific). One-shot labeling (run error analysis once before launch and never again). Solo coding (one engineer labels without expert review). Fixes: keep operational labels separate from reporting labels ¹, maintain the weekly cadence, use the principal expert pattern.

What good looks like

A production error-analysis program ships a bottom-up category list (3 to 7 items, owned by a principal expert, versioned with the rubric), a weekly cadence with ten to twenty new production traces labeled, a regression set that grows from analysis output, and a trace viewer that supports the reading pattern. The 60-80 rule and open coding workflow chapters cover the operational scaffolding.

Hamel Husain, "A Field Guide to Rapidly Improving AI Products." ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩ ↩²
Shankar et al., SPADE, 2024. ↩
Shankar et al., "Who Validates the Validators?" UIST 2024. ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩