AI Evals › Techniques

Reference-based metrics

BLEU, ROUGE, METEOR, chrF, BERTScore, COMET, and exact match. What each correlates with on modern systems and when it lies.

A reference-based metric compares a candidate output to one or more references by surface or semantic similarity. The category includes exact match, token F1, BLEU, ROUGE, METEOR, chrF, BERTScore, BLEURT, and COMET. None measures quality directly. They measure proximity to a reference, which is sometimes a good proxy and often a misleading one. If you have no references at all yet, the question of whether to build them first is its own decision; Evals without a gold dataset covers that fork.

When reference-based works

Three task classes are reference-friendly. Extractive QA (exact match or token F1). Classification (accuracy, F1). Structured output where the schema is the contract (JSON validation, regex, SQL execution). OpenAI's evals cookbook calls logic-based checks against an ideal answer the simplest and most common type of eval, and recommends validating any model-graded eval against humans before scaling it: when the answer admits a programmatic checker, write the checker ¹.

Translation is the fourth, nuanced case. The WMT metrics shared task has moved decisively past BLEU: at WMT24, fine-tuned neural metrics (MetricX-24, XCOMET, COMET-22, BLEURT-20) topped the meta-evaluation, while BLEU, spBLEU, and chrF ranked 23rd, 22nd, and 20th of the 26 evaluated metrics ². The WMT24 general task itself now ranks systems with human error-span annotation, not BLEU ³. For new translation eval programs in 2026, default to COMET (reference-based) or COMETKiwi (reference-free) and keep BLEU as a regression check only.

When reference-based lies

Open-ended generation is where reference-based metrics fail loudly. Two outputs can mean the same thing with no token overlap; two outputs can share heavy token overlap and mean different things. ROUGE and BLEU were built for tasks where the reference is canonical and the surface form is constrained. Modern generative tasks violate both assumptions. Liu et al.'s G-Eval paper documents the breakdown on the SummEval benchmark: average Spearman correlation with human judgment is 0.192 for ROUGE-1, 0.205 for ROUGE-2, 0.165 for ROUGE-L, and 0.225 for BERTScore, while G-Eval with GPT-4 reaches 0.514 on the same data ⁴.

The deeper problem is that reference outputs themselves are often bad. Fabbri et al.'s SummEval human study found CNN/DailyMail reference summaries scored poorly on relevance, consistency, and coherence, and were outperformed by the outputs of T5, BART, and Pegasus; Yan's summarization-eval writeup walks the case ⁵ ⁶. ROUGE on those references rewards systems that mimic the reference's flaws and penalizes systems that improve on them.

The rule. If the criterion is "does it match a reference exactly," use exact match or F1. If the criterion is "is it good," reference-based metrics are at best a noisy floor and need an LLM judge or human label as the primary signal.

The metric zoo

Metric	Type	Use case	Caveat
Exact match	deterministic	classification, extractive QA	brittle to whitespace, case
Token F1	deterministic	extractive QA	rewards padding
BLEU	n-gram precision	translation regression only	23rd of 26 at WMT24 metrics task
ROUGE-L	LCS overlap	legacy summarization	Spearman 0.165 vs humans on SummEval
METEOR	weighted unigram	translation	reference-bound; better than BLEU
chrF	character n-gram	translation	beats BLEU on morphology-rich languages; still lexical
BERTScore	embedding cosine	paraphrase tolerance	poor separation across systems
BLEURT	learned regression	translation, summarization	needs a trained model
COMET	learned regression	translation	the default modern MT metric
COMETKiwi	learned, ref-free	translation without references	enables QE-only setups

Token F1 in practice

import re
from collections import Counter

def normalize(s: str) -> list[str]:
    return re.sub(r"[^\w\s]", "", s.lower()).split()

def token_f1(pred: str, refs: list[str]) -> float:
    p = normalize(pred)
    best = 0.0
    for ref in refs:
        r = normalize(ref)
        common = Counter(p) & Counter(r)
        n = sum(common.values())
        if not n or not p or not r:
            continue
        prec, rec = n / len(p), n / len(r)
        best = max(best, 2 * prec * rec / (prec + rec))
    return best

Two caveats. Token F1 rewards padding (longer outputs with the answer tokens score well). It also penalizes paraphrase. Follow with an LLM judge for paraphrase tolerance on free-form short answers.

COMET as the modern MT default

For translation eval in 2026, COMET is the default. It is a learned metric trained on direct-assessment human judgments from WMT shared tasks; it outputs a quality estimate per (source, hypothesis, reference) triple ⁷. In the WMT24 meta-evaluation, COMET-22 scored 0.688 average correlation across the six ranking tasks against BLEU's 0.589 ². COMETKiwi is the reference-free variant for QE-style evaluation when references are unavailable at scoring time; even without a reference it outscored chrF (0.640 vs 0.608) and BLEU in the same table ⁸ ².

The score scales are not comparable, and neither are the deltas. The WMT24 significance analysis found that on en-de, a BLEU improvement of 5.4 points is required for 80% confidence that human MQM scores actually differ (as high as 11 points on en-es), while a COMET-22 difference of 0.0043 carries the same confidence ². The folk rule that 0.5 or 2 BLEU is "a real improvement" does not survive contact with the human data.

If a system regresses on BLEU but holds COMET, the regression is a surface change and usually harmless. If it regresses on COMET but holds BLEU, the system is worse and BLEU did not catch it.

Faithfulness via atomic facts

For long-form generation against a source (summarization, RAG, document QA), a hybrid pattern beats any single reference-based metric. Decompose the candidate into atomic facts, check each against the source, report the supported fraction. FActScore is the canonical implementation; RAGAS implements the same pattern for RAG outputs ⁹ ¹⁰. The decomposition handles paraphrase tolerance and partial coverage. The generation faithfulness chapter covers the pattern.

What to default to

The decision tree.

Extractive QA or classification: exact match or token F1.
Structured output: schema validation or regex; never a similarity metric.
Translation: COMET. Add COMETKiwi when references are sparse.
Summarization or RAG: atomic-fact decomposition plus an LLM judge.
Open-ended chat or creative: not reference-based. Use pairwise or LLM-as-Judge.
No references yet for a task that deserves them: build the small gold set first; Evals without a gold dataset sequences it.

If a metric in the existing eval program does not match a case above and the team cannot articulate which problem it solves, the metric is decorative.

OpenAI Cookbook, getting-started evals. ↩
Freitag et al., "Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task," WMT 2024. ↩ ↩² ↩³ ↩⁴
Kocmi et al., "Findings of the WMT24 General Machine Translation Shared Task," WMT 2024. ↩
Liu et al., G-Eval, EMNLP 2023. ↩
Fabbri et al., "SummEval: Re-evaluating Summarization Evaluation," TACL 2021. ↩
Eugene Yan, "Abstractive Summarization Eval." ↩
Rei et al., "COMET: A Neural Framework for MT Evaluation," EMNLP 2020. ↩
Rei et al., "CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task," WMT 2022. ↩
Min et al., FActScore, 2023. ↩
Es et al., RAGAS, 2023. ↩