What "good" looks like
The primary metric is faithfulness plus coverage, scored either by an NLI-based entailment check or a LaaJ with the source attached. ROUGE is the wrong metric. Eugene Yan is direct: "ROUGE and METEOR don't work" for evaluating abstractive summaries; the correlation with human judgment is weak and the metric rewards extractive copy[1]. The right composite is precision-on-claims (every claim in the summary is supported by the source) times recall-on-must-includes (every claim the user needs is in the summary). FActScore from Min et al. operationalizes the precision side at fine-grained atomic-claim level[2].
The "good" target is faithfulness above 0.95 on a domain set (less than one fabricated claim per 20 summaries) and coverage of the labeled must-include claims above 0.85. The two numbers trade off; a one-sentence summary can be perfectly faithful and useless.
Secondary metrics
Hallucination rate. Claims in the summary not entailed by the source. Coverage of key points, defined as the proportion of human-labeled must-include claims that appear in the summary. Conciseness, measured as summary length divided by source length within a target range. Faithful citation, when quotes are required, the model's quotes must be substring-present in the source.
Offline eval design
SummEval as a starting point and CNN/DM for general news summarization with NLI-based scoring. On top of that, build a domain set of 30 source documents with extracted must-include claim lists (a human reads each source, lists the 5 to 10 claims any acceptable summary must contain). Score with a LaaJ for faithfulness where the source is attached to the prompt and the rubric asks per-sentence "is this claim entailed by the source?" Score coverage as the fraction of must-include claims that appear in the summary.
RAGAS faithfulness is the most-used scorer in this category and it generalizes well from the RAG context[3]. FActScore is the more rigorous atomic-claim decomposition when you have the budget for it[2].
Online eval design
Sample summaries from production. Run a LaaJ faithfulness pass with the source attached. Alert on hallucination-rate drift greater than 2 percentage points and on coverage drops on continuously sampled labeled must-include claims. If you ship summaries to users, the explicit feedback rate (saves, shares, edits) is the highest-signal product metric; pair it with the offline faithfulness number to detect when a fluent summary is fabricating its way to user approval.
Top benchmarks
- SummEval. Starting point for general summarization; build a domain set in addition.
- CNN/DM with NLI scoring. Useful for breadth on news; the NLI methodology generalizes.
- Your domain set with must-include claims. The eval that predicts whether you ship.
Ragas faithfulness for the precision-on-claims score[3]. FActScore for atomic-claim decomposition when the budget allows[2]. Langfuse for production trace logging with annotation queues. Inspect AI for sandboxed reproducible runs.
Common pitfalls
Optimizing ROUGE. ROUGE rewards extractive copy; optimizing it reward-hacks the model toward longer, less abstractive summaries. Yan's blunt summary applies: ROUGE does not work[1]. Faithfulness without coverage. A one-sentence summary is perfectly faithful and useless; coverage is the counterweight. No human gold. Without labeled must-include claims, you cannot compute coverage; without coverage, you cannot detect the "faithful but empty" failure mode. Scoring summaries with a judge that does not have the source attached. The eval has to see what the summary saw, or it cannot score entailment.
Canonical reading
Eugene Yan's "Evaluation and Hallucination Detection for Abstractive Summaries" is the most useful single reading because it makes the ROUGE-does-not-work argument operational and walks through NLI-based scoring as the alternative[1]. His follow-up on abstractive summary evaluation extends the methodology[4]. Read with the FActScore paper for atomic-claim decomposition[2] and the RAGAS paper for the faithfulness metric implementation[5].
Day-1 starter
- 30 source documents with hand-labeled must-include claim lists. The list is the eval; without it, you cannot compute coverage.
- NLI-based or LaaJ faithfulness scorer. Precision over per-sentence claims. Score per release.
- Track coverage recall (the percent of must-include claims hit) alongside faithfulness. The two together force a useful Pareto frontier; either alone rewards the wrong behavior.