Translation

Source language to target language. COMET and chrF for automatic scoring, MQM or ESA for human; not BLEU alone in 2026.

What "good" looks like

A translation eval pairs a trained automatic metric that gates releases with a human panel that decides ground truth. You run the expensive panel only on releases the cheap automatic gate says are worth the time.

The ship targets:

COMET in the 0.85+ range on your domain pairs.
MQM error-density below thresholds set per content category.

The primary automatic metric is COMET (or its higher-fidelity variant XCOMET), supplemented with chrF for cheap regression detection. COMET is a trained metric, not a string overlap: it embeds the source, the hypothesis, and a reference together and predicts quality directly. Its models are trained on human judgments (Direct Assessments, human-mediated translation edit rate, and MQM annotations), and the framework set new correlation records on the WMT 2019 metrics shared task^[1]. Training on human judgments is why COMET survives where BLEU fails.

The primary human metric is MQM (Multidimensional Quality Metrics) or its 2024 successor ESA (Error Span Annotations), which WMT24 adopted as the human protocol. Professional annotators mark error spans in the output, then assign the segment a score^[2]. This panel is the source of truth; COMET is the gate that decides which releases are worth panel time.

BLEU is OK for trend detection because it is fast and stable, but the WMT findings line is operational: do not ship a decision on BLEU alone in 2026^[2]. The Applied LLMs writeup makes the broader case for combining automatic and human protocols on rotating panels^[3].

flowchart TD
    SRC["Source segment"] --> MT["Model translation"]
    MT --> CHRF["chrF: fast regression check"]
    MT --> COMET["COMET / XCOMET vs reference"]
    COMET --> GATE{"COMET 0.85+ on domain pairs?"}
    CHRF --> GATE
    GATE -->|no| FIX["Fix the weaker pair, do not ship on BLEU alone"]
    GATE -->|yes| PANEL["MQM or ESA panel: mark error spans"]
    PANEL --> TRUTH{"MQM error-density below category threshold?"}
    TRUTH -->|no| FIX
    TRUTH -->|yes| SHIP["Ship"]

Secondary metrics

Metric	What it catches	How to score
Terminology accuracy	Off-glossary term choices	Domain glossaries enforced as substring or paraphrase matches
Formality preservation	Wrong register for the audience	Compare output register against the segment's required formality
Locale correctness	en-US versus en-GB, pt-BR versus pt-PT drift	Check output against the target locale, not just the language
Gender-neutral handling	Wrong gender in languages with grammatical gender	Verify neutral or correct forms where the source is ambiguous

These four often drive customer escalation more than the headline COMET score, because the errors are visible and category-defining.

Offline eval design

WMT test sets give breadth across language pairs. WMT24 is the scale reference: 11 language pairs, test sets spanning three to five domains each, plus collected outputs from 8 LLMs and 4 online translation providers^[2]. The domain spread is the design lesson; a single-domain test set measures a narrower product than you ship.

On top of the shared-task sets, build a domain parallel corpus. Use 200 segments per language pair with gold human translations, ideally drawn from your actual product traffic distribution. Score COMET and XCOMET continuously.

Run a human panel per release. An MQM or ESA panel reviews 200 segments per priority language pair. ESA trades MQM's full error typology for span marking plus a scalar judgment, which keeps panels affordable per release^[2]. The COMET-to-MQM correlation on your domain is the relationship you should audit annually; if it has decoupled, retrain or recalibrate.

Use CometKiwi where no reference exists. CometKiwi (reference-free COMET) scores quality from source and hypothesis alone, so it suits translations where reference translations may legitimately differ from the model output. Use it for production sampling where no reference exists, and keep reference-based COMET for the release gate where you control the test set.

Online eval design

For post-edit systems, track TER. Where humans post-edit the model output, track TER (translation error rate) between the model output and the human-edited version; the per-segment edit distance is the highest-signal production metric in this category.

Alert per language pair, never on the aggregate. Alert on COMET drops by language pair and on terminology-glossary miss rates. A 0.02 drop concentrated in one pair disappears in the average.

For systems without post-editing, sample for spot-check translation and feed flagged segments back to the MQM queue.

Top benchmarks

WMT24 test sets, https://aclanthology.org/2024.wmt-1.1/. The standard for shared-task evaluation; 11 language pairs, multi-domain test sets, and the move to ESA.
FLORES-200. The standard multilingual evaluation set across 200 languages.
Your domain parallel corpus. 200 segments per pair, hand-translated; the eval that predicts customer impact.

Tools especially suited

COMET and XCOMET: automatic scoring with strong correlation to human judgment^[1].
sacreBLEU: reproducible BLEU and chrF (the reproducibility argument is the whole point of sacreBLEU; the metric itself is secondary).
MQM and ESA annotation tools: internal tools or commercial vendors that support the protocol.
Inspect AI: sandboxed reproducible runs across releases.

Common pitfalls

Pitfall	Why it bites	Fix
BLEU-only reporting	Poor correlation with human judgment on modern systems	Gate on COMET; the WMT findings paper makes this case directly^[2]
Reference-based scoring when the output may legitimately differ	Penalizes valid translations that diverge from the reference	Use CometKiwi (reference-free) when the reference is plausibly not unique
Ignoring locale	Mixing en-US and en-GB in the eval set conflates two different products	Split the eval set by target locale
Single-domain test sets	System rankings shift across content types	Go multi-domain, as WMT24 did
Treating the MQM panel as optional	Automatic metrics are the gate; without periodic human evaluation, COMET drift goes undetected	Keep the panel as the truth check^[3]

Canonical reading

The WMT24 findings paper is the most useful single read because it documents both the shared-task methodology and the move to ESA, and it makes the practical argument about combining automatic and human metrics^[2]. The COMET paper is the second read; it explains what the metric actually models and why it correlates where BLEU does not^[1]. Hamel's field guide framing about removing friction from looking at data applies directly: post-edit traces are your highest-signal production data^[4].

Day-1 starter

200 segments per top-3 language pair as gold. Hand-translated, drawn from real production distribution, with the domain mix you actually ship; support tickets and marketing copy fail differently.
COMET and chrF in CI on every prompt or model change. chrF for fast regression, COMET for the decision. Alert per pair, never on the aggregate.
Quarterly MQM or ESA panel on 100 segments per priority pair. The panel is the truth; the automatic metrics are the gate. Recheck the COMET-to-panel correlation each cycle.

Cross-cutting links

Reference-based metrics: BLEU, chrF, and COMET as a metric family, and where each breaks.
Non-English RAG eval: multilingual eval pitfalls that transfer directly.
Inter-rater agreement: Cohen's kappa and Krippendorff's alpha: keeping the MQM panel honest.
Inspect AI: reproducible COMET runs across releases.
Adding error bars to your evals: deciding whether a COMET delta is real.