What "good" looks like
The primary metric is macro-F1 on a stratified, balanced evaluation set, paired with a calibration check (Expected Calibration Error, ECE) on a held-out scoring set. Accuracy is misleading on imbalanced data because the trivial classifier that predicts the majority class gets a high score. Macro-F1 averages per-class F1 evenly, which is the right framing when minority-class performance matters (and it almost always does in production).
The "good" target depends on the use case and the class set. The right framing is from Eugene Yan: calibrate your eval bar to actual risk[1]. A spam classifier with 1% FPR and 95% recall on positives is shippable; a medical screening classifier needs much higher precision in the relevant class.
Secondary metrics
Confusion matrix patterns, the specific cells where errors concentrate. Per-segment fairness gaps, F1 by demographic or other slice. Latency P95 per prediction. Cost versus small-classifier baseline; an LLM classifier that costs 100x what a fine-tuned BERT costs has to clear a higher bar to justify the spend. Refusal rate when an LLM classifier is asked to return a label and refuses or returns "I don't know".
Offline eval design
A stratified labeled set, ideally balanced across classes; if class balance is impossible, weight metrics appropriately. Cross-validate when the eval set is small. Include an out-of-distribution slice that the classifier was not explicitly trained on; OOD performance is often the difference between a brittle and a robust model. For LLM-based classifiers, the LaaJ rubric or prompt should be tested against human labels first; Shankar et al.'s findings on validator alignment apply directly: the rubric itself is a downstream artifact of error analysis, not an upstream input[2].
The minimum useful artifact is a stratified 500-item eval set with per-class precision, recall, and F1 reported. The ECE calibration plot belongs in the release report alongside the confusion matrix.
Online eval design
Confidence distribution per class; a confidence histogram is more informative than a single calibration number. Human-override rate when the classification is surfaced to a user who can disagree. Per-class F1 drift over time, computed from a continuously labeled subset. Alert on per-class F1 drops greater than 5 points and on confidence-distribution shifts (often the leading indicator of distribution drift before accuracy drops).
Top benchmarks
- GLUE / SuperGLUE for NLP classification breadth; useful for model selection at the macro level.
- Your domain labeled set, stratified by class. 500 items minimum. The eval that predicts production quality.
- Cleanlab-detected label noise in your training set. The label-noise rate is the floor for achievable accuracy; if labels are 10% noisy, no model can exceed 90% on accurate scoring.
scikit-learn for the standard metrics and the calibration plot. Cleanlab for label-noise detection in the training set and the eval set; eval-set noise sets a ceiling on achievable accuracy. Argilla for active labeling and for managing the labeling queue at scale. Inspect AI for sandboxed reproducible runs across releases.
Common pitfalls
Accuracy on imbalanced data. The trivial classifier wins; macro-F1 is the right default. No per-class breakdown. Without per-class numbers, minority-class collapse is invisible; the aggregate looks fine while the small classes silently degrade. No calibration check. A classifier with high accuracy and poor calibration is dangerous when its confidence is used downstream; the ECE belongs in every release report. Skipping label-noise audits. If 10% of eval labels are wrong, no model can score above 90%; without an audit, you may chase improvements that are impossible to achieve[3].
Canonical reading
Eugene Yan's evals post is the canonical reading because it makes the "calibrate your eval bar to actual risk" argument operational, and it covers calibration alongside accuracy[1]. Shankar et al.'s validator-alignment paper applies directly to LLM classifiers: the rubric is downstream of error analysis, not upstream[2].
Day-1 starter
- Stratified 500-item eval set with per-class precision, recall, and F1. Macro-F1 as the headline; per-class as the diagnostic.
- Calibration plot (ECE) on a held-out set. The plot belongs in every release report.
- Confusion matrix in the release report. The matrix is where you find the patterns that the headline metrics hide.