Classification

Assign one or more labels per input. Macro-F1 plus calibration on a stratified set, not accuracy on imbalanced data.

What "good" looks like

A classification eval scores per-class quality, not a single average, and pairs it with a confidence check. The primary metric is macro-F1 on a stratified, balanced evaluation set, alongside a calibration check (Expected Calibration Error, ECE) on a held-out scoring set.

Why macro-F1 and not accuracy: accuracy is misleading on imbalanced data because the trivial classifier that predicts the majority class gets a high score. Macro-F1 averages per-class F1 evenly, which is the right framing when minority-class performance matters (and it almost always does in production).

The "good" target depends on the use case and the class set. The right framing is from Eugene Yan: calibrate your eval bar to actual risk^[1]. A spam classifier with 1% FPR and 95% recall on positives is shippable. A medical screening classifier needs much higher precision in the relevant class.

flowchart TD
    IN["Input to classify"] --> CLF["Classifier"]
    CLF --> LABEL["Predicted label + confidence"]
    LABEL --> F1["Per-class F1, averaged to macro-F1"]
    LABEL --> ECE["Calibration: ECE on held-out set"]
    NOISE["Label-noise audit: 10% noise caps accuracy at 90%"] --> F1
    F1 --> GATE{"Macro-F1 clears the risk bar and ECE acceptable?"}
    ECE --> GATE
    GATE -->|yes| SHIP["Ship"]
    GATE -->|no| FIX["Fix the weak class, never the aggregate"]

Secondary metrics

Metric	What it catches	How to score
Confusion matrix patterns	The specific cells where errors concentrate	Read the matrix per class pair
Per-segment fairness gaps	Uneven performance across slices	F1 by demographic or other slice
Latency	Slow predictions	P95 per prediction
Cost versus baseline	Overspending on an LLM classifier	Compare to a small-classifier baseline; an LLM classifier that costs 100x what a fine-tuned BERT costs has to clear a higher bar
Refusal rate	An LLM asked for a label that refuses	Fraction returning "I don't know" instead of a label

Offline eval design

Build a stratified labeled set, ideally balanced across classes. If class balance is impossible, weight metrics appropriately. Cross-validate when the eval set is small.

Include an out-of-distribution slice that the classifier was not explicitly trained on. OOD performance is often the difference between a brittle and a robust model.

For LLM-based classifiers, test the rubric against human labels first. Shankar et al.'s findings on validator alignment apply directly: the LaaJ rubric or prompt is a downstream artifact of error analysis, not an upstream input^[2].

The minimum useful artifact is a stratified 500-item eval set with per-class precision, recall, and F1 reported. The ECE calibration plot belongs in the release report alongside the confusion matrix.

Online eval design

Track the confidence distribution per class. A confidence histogram is more informative than a single calibration number.

Watch human-override rate when the classification is surfaced to a user who can disagree.

Compute per-class F1 drift over time from a continuously labeled subset. Alert on:

Per-class F1 drops greater than 5 points.
Confidence-distribution shifts, often the leading indicator of distribution drift before accuracy drops.

Top benchmarks

GLUE / SuperGLUE for NLP classification breadth; useful for model selection at the macro level.
Your domain labeled set, stratified by class. 500 items minimum. The eval that predicts production quality.
Cleanlab-detected label noise in your training set. The label-noise rate is the floor for achievable accuracy; if labels are 10% noisy, no model can exceed 90% on accurate scoring.

Tools especially suited

scikit-learn: the standard metrics and the calibration plot.
Cleanlab: label-noise detection in the training set and the eval set; eval-set noise sets a ceiling on achievable accuracy.
Argilla: active labeling and managing the labeling queue at scale.
Inspect AI: sandboxed reproducible runs across releases.

Common pitfalls

Pitfall	Why it bites	Fix
Accuracy on imbalanced data	The trivial majority-class classifier wins	Use macro-F1 as the default
No per-class breakdown	Minority-class collapse is invisible; the aggregate looks fine while small classes silently degrade	Report per-class precision, recall, and F1
No calibration check	A classifier with high accuracy and poor calibration is dangerous when its confidence is used downstream	Put ECE in every release report
Skipping label-noise audits	If 10% of eval labels are wrong, no model can score above 90%, so you may chase impossible improvements	Audit label noise before trusting the score^[3]

Canonical reading

Eugene Yan's evals post is the canonical reading because it makes the "calibrate your eval bar to actual risk" argument operational, and it covers calibration alongside accuracy^[1]. Shankar et al.'s validator-alignment paper applies directly to LLM classifiers: the rubric is downstream of error analysis, not upstream^[2].

Day-1 starter

Stratified 500-item eval set with per-class precision, recall, and F1. Macro-F1 as the headline; per-class as the diagnostic.
Calibration plot (ECE) on a held-out set. The plot belongs in every release report.
Confusion matrix in the release report. The matrix is where you find the patterns that the headline metrics hide.

Cross-cutting links

Constructing your eval dataset: stratification and class balance for the 500-item set.
Confidence intervals: Wilson and Clopper-Pearson: honest intervals on per-class F1 at small per-class N.
Golden dataset methodology: versioning the labeled set as the taxonomy evolves.
Giskard: automated scans for per-segment performance gaps.
Your first eval in 30 minutes: run this first; the confusion matrix it prints is the day-1 deliverable.
Adding error bars to your evals: macro-F1 deltas with error bars.