What "good" looks like
The primary automatic metric is COMET (or its higher-fidelity variant XCOMET) supplemented with chrF for cheap regression detection. The primary human metric is MQM (Multidimensional Quality Metrics) or its 2024 successor ESA (Error Span Annotations), which WMT24 adopted as the human protocol[1]. BLEU is OK for trend detection because it is fast and stable, but the WMT findings line is operational: do not ship a decision on BLEU alone in 2026[1]. The Applied LLMs writeup makes the broader case for combining automatic and human protocols on rotating panels[2].
The "good" target is COMET in the 0.85+ range on your domain pairs and MQM error-density below thresholds set per content category. The MQM panel is the source of truth; COMET is the gate that decides which releases are worth panel time.
Secondary metrics
Terminology accuracy. Domain glossaries enforced as substring or paraphrase matches. Formality preservation. Locale, en-US versus en-GB, pt-BR versus pt-PT. Gender-neutral handling for languages with grammatical gender. These four often drive customer escalation more than the headline COMET score does, because errors here are visible and category-defining.
Offline eval design
WMT test sets for breadth across language pairs. On top of those, build a domain parallel corpus: 200 segments per language pair, gold human translations, ideally drawn from your actual product traffic distribution. Score COMET and XCOMET continuously. For each release, an MQM or ESA panel reviews 200 segments per priority language pair. The COMET-to-MQM correlation on your domain is the relationship you should audit annually; if it has decoupled, retrain or recalibrate.
CometKiwi (reference-free COMET) is useful for evaluating translations where reference translations may legitimately differ from the model output; it scores quality from source and hypothesis alone.
Online eval design
For systems where humans post-edit the model output, track TER (translation error rate) between the model output and the human-edited version; the per-segment edit distance is the highest-signal production metric in this category. Alert on COMET drops by language pair (per-pair, not aggregate) and on terminology-glossary miss rates. For systems without post-editing, sample for spot-check translation and feed flagged segments back to the MQM queue.
Top benchmarks
- WMT24 test sets, https://aclanthology.org/2024.wmt-1.1/. The standard for shared-task evaluation; introduces ESA.
- FLORES-200. The standard multilingual evaluation set across 200 languages.
- Your domain parallel corpus. 200 segments per pair, hand-translated; the eval that predicts customer impact.
COMET and XCOMET for automatic scoring with strong correlation to human judgment. sacreBLEU for reproducible BLEU and chrF (the reproducibility argument is the whole point of sacreBLEU; the metric itself is secondary). Internal MQM and ESA annotation tools or commercial vendors that support the protocol. Inspect AI for sandboxed reproducible runs across releases.
Common pitfalls
BLEU-only reporting. Poor correlation with human judgment on modern systems; the WMT findings paper makes this case directly[1]. Reference-based scoring when the output may legitimately differ. Use CometKiwi (reference-free) when the reference is plausibly not unique. Ignoring locale. Mixing en-US and en-GB in the eval set conflates two different products. Treating the MQM panel as optional. Automatic metrics are the gate; the panel is the truth. Without periodic human evaluation, COMET drift goes undetected[2].
Canonical reading
The WMT24 findings paper is the most useful single read because it documents both the shared-task methodology and the move to ESA, and it makes the practical argument about combining automatic and human metrics[1]. The Applied LLMs writeup on rotating panels covers the operational pattern for keeping human evaluation tractable[2]. Hamel's field guide framing about removing friction from looking at data applies directly: post-edit traces are your highest-signal production data[3].
Day-1 starter
- 200 segments per top-3 language pair as gold. Hand-translated, drawn from real production distribution.
- COMET and chrF in CI on every prompt or model change. chrF for fast regression, COMET for the decision.
- Quarterly MQM or ESA panel on 100 segments per priority pair. The panel is the truth; the automatic metrics are the gate.