Summarization

Compress source documents into a faithful summary. Faithfulness plus coverage, not ROUGE.

What "good" looks like

A summarization eval scores two things and rejects a third. Score faithfulness (every claim in the summary is entailed by the source) and coverage (every claim the reader needs is present). Reject ROUGE. The composite that matters is precision on claims times recall on must-include claims: a summary that invents nothing but omits the point is as useless as a complete summary that fabricates.

The ship targets on a domain set:

Faithfulness above 0.95, which is fewer than one fabricated claim per 20 summaries.
Coverage above 0.85 of the labeled must-include claims.

The two trade off, so report both. A one-sentence summary is perfectly faithful and useless.

flowchart TD
    SRC["Source document"] --> SUM["Model summary"]
    GOLD["Must-include claims, human-labeled"] --> R
    SUM --> P["Precision: is every claim entailed by the source?"]
    SUM --> R["Recall: are the must-include claims present?"]
    SRC --> P
    P --> GATE{"Faithfulness above 0.95 and coverage above 0.85?"}
    R --> GATE
    GATE -->|yes| SHIP["Ship"]
    GATE -->|no| FIX["Fix the weaker axis, never ROUGE"]

Why not ROUGE. Eugene Yan is direct: "ROUGE and METEOR don't work" for abstractive summaries, because the correlation with human judgment is weak and the metric rewards extractive copy^[1]. SummEval supplies the evidence, re-evaluating 14 automatic metrics against expert and crowd annotations over 23 summarization models on CNN/DailyMail; the weak metric-human correlations it documented are why the field moved on^[2]. The replacement is quantified too. G-Eval with GPT-4 reaches a Spearman correlation of 0.514 with human judgment on SummEval, ahead of every prior automatic metric, and still far enough from agreement that the judge needs calibration against your own labels^[3]. FActScore makes the precision side concrete at atomic-claim level: its headline study found ChatGPT biographies only 58% supported, exactly the failure a holistic "looks good" check misses^[4].

Secondary metrics

Metric	What it catches	How to score
Hallucination rate	Claims not entailed by the source	NLI or LaaJ with the source attached
Coverage of key points	Missing must-include claims	Fraction of human-labeled claims present
Conciseness	Padding or over-compression	Summary length over source length, within a target band
Faithful citation	Misquoted or invented quotes	Required quotes must be substring-present in the source

Offline eval design

Start on SummEval and CNN/DailyMail with NLI-based scoring. SummEval ships its annotations and unified model outputs, so it is the cheapest place to sanity-check a new scorer against human labels before you trust it^[2].

Then build the domain set that actually predicts whether you ship: 30 source documents, each with a human-extracted list of the 5 to 10 must-include claims any acceptable summary has to contain. Score faithfulness with a LaaJ that has the source in the prompt and a per-sentence rubric ("is this claim entailed by the source?"), and score coverage as the fraction of must-include claims that appear.

On scorers, RAGAS faithfulness is the most-used and generalizes well from the RAG setting^[5]. FActScore is the more rigorous atomic-claim decomposition when the budget allows: its automated estimator tracks the human protocol within a 2% error rate, and one automated pass over 6,500 generations replaced roughly $26K of human evaluation^[4]. That ratio is the argument for automating precision and spending the human budget on coverage labels.

Online eval design

Sample production summaries and run a faithfulness LaaJ with the source attached. Alert on hallucination-rate drift above 2 percentage points and on coverage drops against continuously sampled must-include claims. If summaries reach users, explicit feedback (saves, shares, edits) is the highest-signal product metric; pair it with the offline faithfulness number to catch a fluent summary fabricating its way to approval.

Top benchmarks

SummEval, https://arxiv.org/abs/2007.12626. Expert and crowd annotations over 23 models; the metric-validation reference and the starting point^[2].
CNN/DailyMail with NLI scoring. Breadth on news; the NLI methodology generalizes.
Your domain set with must-include claims. The eval that predicts whether you ship.

Tools especially suited

Ragas faithfulness: the precision-on-claims score, the default scorer^[5].
FActScore: atomic-claim decomposition when the budget allows^[4].
Langfuse: production trace logging with annotation queues.
Inspect AI: sandboxed, reproducible runs.

Common pitfalls

Pitfall	Why it bites	Fix
Optimizing ROUGE	Rewards extractive copy, pushing summaries longer and less abstractive	Score faithfulness and coverage instead^[1]^[2]
Faithfulness without coverage	A one-sentence summary scores perfectly and says nothing	Report coverage as the counterweight
No human gold	Without labeled must-include claims there is no coverage number, so the "faithful but empty" failure is invisible	Label 5 to 10 must-include claims per source
Judge without the source	It cannot check entailment against text it never saw	Attach the source to every faithfulness call
Trusting the judge uncalibrated	The best published judge tops out at 0.514 Spearman	Calibrate against a domain expert^[3]

Canonical reading

Eugene Yan's "Evaluation and Hallucination Detection for Abstractive Summaries" is the most useful single read: it makes the ROUGE-does-not-work argument operational and walks through NLI-based scoring as the alternative^[1]. His follow-up on abstractive summary evaluation extends the method^[6]. Read it alongside the FActScore paper for atomic-claim decomposition^[4] and the RAGAS paper for the faithfulness implementation^[7].

Day-1 starter

Label 30 source documents with their must-include claim lists. The list is the eval; budget half a day of expert time, it is the only labeling this starter needs.
Stand up an NLI or LaaJ faithfulness scorer with the source attached, scoring per-sentence claims. Sanity-check it against SummEval labels before trusting it on your domain.
Report coverage recall (the percent of must-include claims hit) next to faithfulness on every release. Together they force a useful Pareto frontier; either alone rewards the wrong behavior.

Cross-cutting links

Faithfulness and atomic facts: the entailment-based scoring methodology in depth.
The four-part judge prompt: structuring the source-attached faithfulness judge.
Reference-based metrics: what ROUGE actually measures and where it fails.
Ragas: the faithfulness scorer this playbook defaults to.
RAG eval with Ragas: RAGAS faithfulness wired end to end.
LLM-judge calibration loop: calibrating the faithfulness judge before trusting its drift alerts.