A reference-based metric compares a candidate output to one or more references by surface or semantic similarity. The category includes exact match, token F1, BLEU, ROUGE, METEOR, chrF, BERTScore, BLEURT, and COMET. None measures quality directly. They measure proximity to a reference, which is sometimes a good proxy and often a misleading one.
When reference-based works
Three task classes are reference-friendly. Extractive QA (exact match or token F1). Classification (accuracy, F1). Structured output where the schema is the contract (JSON validation, regex, SQL execution). OpenAI's evals cookbook recommends logic-based validation as the default first eval for exactly this reason: when the answer admits a programmatic checker, write the checker 1.
Translation is the fourth nuanced case. WMT shared tasks have moved decisively from BLEU to chrF, BLEURT, COMET, and COMETKiwi as the human-correlated metrics; BLEU now sits at the bottom of the leaderboard despite being the field's reflex citation 2 3. For new translation eval programs in 2026, default to COMET (reference-based) or COMETKiwi (reference-free); use BLEU as a regression check only.
When reference-based lies
Open-ended generation is where reference-based metrics fail loudly. Two outputs can mean the same thing with no token overlap; two outputs can share heavy token overlap and mean different things. ROUGE and BLEU were built for tasks where the reference is canonical and the surface form is constrained. Modern generative tasks violate both assumptions. Liu et al.'s G-Eval paper documents the breakdown: ROUGE-style metrics correlate at Spearman around 0.2 to 0.3 with human judgment on summarization, while a calibrated LLM judge reaches 0.514 on the same data 4.
The deeper problem is that reference outputs themselves are often bad. Yan documents the CNN/DailyMail case where reference summaries score lower on human quality ratings than the outputs of T5, BART, and Pegasus 3. ROUGE on those references rewards systems that mimic the reference's flaws and penalizes systems that improve on them.
The rule. If the criterion is "does it match a reference exactly," use exact match or F1. If the criterion is "is it good," reference-based metrics are at best a noisy floor and need an LLM judge or human label as the primary signal.
The metric zoo
| Metric | Type | Use case | Caveat |
|---|
| Exact match | deterministic | classification, extractive QA | brittle to whitespace, case |
| Token F1 | deterministic | extractive QA | rewards padding |
| BLEU | n-gram precision | translation regression only | bottom of WMT 2024 |
| ROUGE-L | LCS overlap | legacy summarization | low human correlation on modern systems |
| METEOR | weighted unigram | translation | reference-bound; better than BLEU |
| chrF | character n-gram | translation | beats BLEU on morphology-rich languages |
| BERTScore | embedding cosine | paraphrase tolerance | poor separation across systems |
| BLEURT | learned regression | translation, summarization | needs a trained model |
| COMET | learned regression | translation | the default modern MT metric |
| COMETKiwi | learned, ref-free | translation without references | enables QE-only setups |
Token F1 in practice
import re
from collections import Counter
def normalize(s: str) -> list[str]:
return re.sub(r"[^\w\s]", "", s.lower()).split()
def token_f1(pred: str, refs: list[str]) -> float:
p = normalize(pred)
best = 0.0
for ref in refs:
r = normalize(ref)
common = Counter(p) & Counter(r)
n = sum(common.values())
if not n or not p or not r:
continue
prec, rec = n / len(p), n / len(r)
best = max(best, 2 * prec * rec / (prec + rec))
return best
Two caveats. Token F1 rewards padding (longer outputs with the answer tokens score well). It also penalizes paraphrase. Follow with an LLM judge for paraphrase tolerance on free-form short answers.
COMET as the modern MT default
For translation eval in 2026, COMET is the default. It is a learned metric trained on direct-assessment human judgments from WMT shared tasks; outputs a quality estimate per (source, hypothesis, reference) triple; correlates with human ratings at Pearson above 0.6 on most language pairs while BLEU sits below 0.4 2. COMETKiwi is the reference-free variant for QE-style evaluation.
If a system regresses on BLEU but holds COMET, the regression is a surface change and usually harmless. If it regresses on COMET but holds BLEU, the system is worse and BLEU did not catch it.
Faithfulness via atomic facts
For long-form generation against a source (summarization, RAG, document QA), a hybrid pattern beats any single reference-based metric. Decompose the candidate into atomic facts, check each against the source, report the supported fraction. FActScore is the canonical implementation; RAGAS implements the same pattern for RAG outputs 5 6. The decomposition handles paraphrase tolerance and partial coverage. The generation faithfulness chapter covers the pattern.
What to default to
The decision tree.
- Extractive QA or classification: exact match or token F1.
- Structured output: schema validation or regex; never a similarity metric.
- Translation: COMET. Add COMETKiwi when references are sparse.
- Summarization or RAG: atomic-fact decomposition plus an LLM judge.
- Open-ended chat or creative: not reference-based. Use pairwise or LLM-as-Judge.
If a metric in the existing eval program does not match a case above and the team cannot articulate which problem it solves, the metric is decorative.