"Does the answer hallucinate?" is the wrong question. A four-sentence answer can have one correct citation, two supported claims, and one fabricated fact. Asking a binary judge to summarize the whole answer as faithful or not produces a Likert-flavored eyeball score with no actionable signal. The fix that actually works in production is to decompose the answer into atomic facts and judge each one against the retrieved context. The two papers that anchor the practice are FActScore and SAFE; the implementation that ships is the Ragas faithfulness metric.
Why the single-judgment frame fails
A judge prompt that reads "rate this answer's faithfulness on a scale of 1 to 5" produces three problems in sequence. First, the rater (human or LLM) anchors on the longest or most fluent passage and skims the rest, missing the one wrong sentence. Second, the score is a weighted average over claims of different importance, and you cannot recover which claim drove the number. Third, the score correlates badly with itself; the same answer rated twice differs by a point or two of the scale, which is enough to drown small regressions.
The FActScore paper made the case empirically: when the authors decomposed long-form biographies into atomic facts and verified each, the per-fact verification rate (the headline FActScore) correlated tightly with hand-judged factuality, and the breakdown made every failure case attributable to specific claims [1]. The same pattern shows up at scale in SimpleQA's short-form factuality work and in HalluLens' hallucination taxonomy [2] [3]. The atomic-fact decomposition is the unit of analysis.
What an atomic fact is
A claim that can be verified against the context independently of the rest of the answer. The rule of thumb: a single subject, a single predicate, no compound clauses, no implicit assumptions. "The cookies bake at 350 degrees for 8 to 10 minutes and should be cooled on a rack" is two facts: the bake temperature/time and the cooling instruction. "Italy's capital is Rome, which has a population of about 2.8 million" is two facts: the capital and the population.
The decomposition is not always automatic. Compound sentences split naturally; coordinated noun phrases ("the Renaissance, which began in Florence") split on the relative clause; numeric ranges and dates stay together as one claim. A few minutes of practice on real outputs gives you a reliable hand, and a good LLM prompt with three or four worked examples produces decompositions that humans accept ninety-percent-plus of the time.
ATOMIC_FACT_PROMPT = """\
Decompose the following answer into atomic facts. Each fact must:
- Be a single declarative sentence.
- Have one subject and one predicate.
- Be independently verifiable.
- Not depend on any other fact in the list.
Return JSON: {"facts": ["...", "..."]}.
Examples:
Answer: "Italy's capital is Rome, with about 2.8 million residents."
Output: {"facts": ["Italy's capital is Rome.", "Rome has about 2.8 million residents."]}
Answer: "Bake at 350 degrees F for 8 to 10 minutes."
Output: {"facts": ["The bake temperature is 350 degrees F.", "The bake time is 8 to 10 minutes."]}
Answer: {answer}
"""
The model picks reasonable boundaries most of the time. Have a human spot-check on the first hundred examples to catch over-splitting (the model breaks "8 to 10 minutes" into two facts) and under-splitting (the model leaves a compound claim intact). After the first round the prompt stabilizes.
Per-fact verification
Each atomic fact is judged binary against the retrieved context. Three labels are useful: SUPPORTED (the context entails the fact), UNSUPPORTED (the context does not entail the fact), and CONTRADICTED (the context entails the opposite). Many implementations collapse UNSUPPORTED and CONTRADICTED into a single failure label; SAFE keeps them separate to enable finer-grained error analysis [4].
# faithfulness.py
import json
from anthropic import Anthropic
client = Anthropic()
MODEL = "claude-opus-4-7"
VERIFY_PROMPT = """\
You are given a CLAIM and a CONTEXT (a list of retrieved passages).
Decide whether the CONTEXT entails the CLAIM.
Reply with JSON only:
{"label": "SUPPORTED" | "UNSUPPORTED" | "CONTRADICTED",
"evidence": "<verbatim span from CONTEXT or empty string>"}
CLAIM: {claim}
CONTEXT:
{context}
"""
def verify_claim(claim: str, context_chunks: list[str]) -> dict:
context = "\n---\n".join(context_chunks)
msg = client.messages.create(
model=MODEL,
max_tokens=200,
temperature=0,
messages=[{"role": "user", "content": VERIFY_PROMPT.format(
claim=claim, context=context)}],
)
return json.loads(msg.content[0].text)
def faithfulness_score(facts: list[str], context_chunks: list[str]) -> dict:
judgments = [verify_claim(f, context_chunks) for f in facts]
supported = sum(1 for j in judgments if j["label"] == "SUPPORTED")
return {
"score": supported / len(facts) if facts else 0.0,
"total_facts": len(facts),
"supported": supported,
"unsupported": sum(1 for j in judgments if j["label"] == "UNSUPPORTED"),
"contradicted": sum(1 for j in judgments if j["label"] == "CONTRADICTED"),
"judgments": list(zip(facts, judgments)),
}
The score is the fraction of facts labeled SUPPORTED. The per-fact breakdown is what lets you debug a regression: when the score drops from 0.92 to 0.85, you scan the contradicted and unsupported rows and almost always find a single class of error (the model started attributing a quote to the wrong speaker, the model started inferring dates that were not in the context, the model started filling in product names from parametric memory).
FActScore mechanics
FActScore formalized the recipe on biographies of public figures: an InstructGPT generation, an atomic-fact decomposer, a knowledge-source-grounded verifier, and a final score equal to the fraction of supported facts [1]. The paper's two contributions that generalize are the decomposition discipline (single-subject, single-predicate, independently verifiable) and the policy of penalizing the model for being too vague to verify, which prevents the trivial gaming where the model stops making claims to avoid being marked wrong.
The "penalty for vagueness" is implemented as a minimum-fact-count threshold. An answer that produces five facts where the expected count is twenty is penalized to roughly the same level as an answer that produces twenty facts with quarter of them unsupported. The two failure modes (fabrication and abstention) get punished proportionally.
The SAFE pattern
SAFE extended the FActScore idea to long-form factuality with an automated verifier that issues Google search queries per claim [4]. The two extensions that matter for RAG evals are the iterative query refinement (one claim can need multiple search queries to find evidence) and the separation of "supported," "irrelevant," and "not supported" so that the metric distinguishes claims the context does not address from claims the context contradicts.
For RAG, the SAFE pattern translates as: instead of one verification call per claim with all retrieved chunks pasted together, do the verification call against each chunk individually, then take the most favorable judgment. The reason is granularity. If the relevant evidence is in chunk three out of five, the noise from chunks one, two, four, and five can flip the judge toward UNSUPPORTED even though the evidence exists. Per-chunk verification recovers the signal.
The Ragas implementation
Ragas' faithfulness metric implements the pattern with two LLM calls per sample: one to extract claims, one to verify each claim against the retrieved context [5]. The result is a fraction between zero and one, plus per-claim labels and rationale strings.
# ragas_faithfulness.py
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness
samples = [
{
"user_input": "What is the bake temperature for gingerbread castle cookies?",
"response": "Bake them at 350F for 8 to 10 minutes. Cool on a wire rack.",
"retrieved_contexts": [
"Gingerbread Castle Cookies: bake at 350 degrees F for 8 to 10 minutes. "
"Let cool on a wire rack before icing.",
],
},
{
"user_input": "How long should the bread dough rise?",
"response": "Let the dough rise for two hours in a warm spot.",
"retrieved_contexts": [
"Rustic Sourdough: bulk ferment 1 hour, then shape and proof 1 hour.",
],
},
]
result = evaluate(
dataset=Dataset.from_list(samples),
metrics=[faithfulness],
)
df = result.to_pandas()
print(df[["user_input", "faithfulness"]])
# faithfulness on row 1 should be ~1.0 (every claim supported)
# faithfulness on row 2 should be lower (rise time conflicts with context)
The Ragas docs walk through tuning the judge model, the temperature, and the prompt; the Langfuse cookbook wires the metric into a production trace pipeline so the score is computed online [6]. Use the off-the-shelf metric for the first version of the eval; swap in a custom decomposer and verifier if your domain has unusual decomposition rules (medical claims, legal claims, multi-step arithmetic).
Calibration discipline
A faithfulness judge is still a judge, and judges need calibration against humans on a held-out set [7]. The minimum viable calibration: hand-label one hundred answer/context pairs as faithful or not, label each atomic fact individually, then compare the LLM judge's per-fact labels to the human labels. Target ninety-percent agreement on both SUPPORTED and UNSUPPORTED (true positive rate and true negative rate, not raw accuracy on an imbalanced set).
If the judge consistently flags too many facts as UNSUPPORTED, tighten the prompt: require an evidence span, lower the temperature, increase the model size. If the judge consistently passes facts that humans flag as wrong, you probably have an over-eager judge family (often when the judge is the same or smaller than the generator); switch to a larger or out-of-family judge.
[!WARNING]
A judge from the same model family as the generator over-estimates faithfulness in production. The self-preference bias that shows up in MT-Bench shows up here too. Use a different family, or distill a small specialized verifier and pair it with a frontier-model judge on samples.
Reporting faithfulness, not "hallucination"
The headline number on the dashboard should be the per-fact faithfulness score, not a binary "is this hallucinated." Display the score with two breakdowns: the fraction of answers with score below 1.0 (any unsupported claim) and the average per-answer score. Both numbers matter. The first is the recall on "any unsupported fact slipped through"; the second is the magnitude of the failure when it happens. A system that fails on five percent of answers but with one unsupported fact each is in much better shape than a system that fails on five percent of answers with six unsupported facts each.
The aggregate also benefits from per-slice cuts. If multi-document syntheses score 0.71 while single-document lookups score 0.94, that gap tells you the prompt is fine for one source and breaks down on synthesis. Tag your eval examples with the slice (single-source, multi-source, requires-arithmetic, requires-temporal-reasoning) and report the faithfulness per slice on every release.
Common failure modes the atomic frame surfaces
A few patterns recur enough to be worth flagging.
Speaker attribution drift is when the model quotes the right words to the wrong person. The atomic fact "Smith said the bridge was safe" decomposes cleanly; the verification step against the context shows the words came from Jones, not Smith. The single-judge frame misses this because the overall answer "feels right."
Numerical specificity loss is when the model rounds, paraphrases, or extrapolates numbers. "About three hours" when the context says "two hours forty minutes." The atomic fact verification step catches the divergence as UNSUPPORTED when the verifier is tight on numerical claims.
Implicit-knowledge insertion is when the model adds a fact from parametric memory that the context does not mention. The fact may be correct in the world (the actual answer is fine) but the answer is unfaithful to the source, which is a real failure for a RAG product where the user expects citations to ground every claim.
Hedge inflation is when the model adds "may" or "could" to facts that the context states definitively. The decomposed fact reads "Aspirin may reduce inflammation" when the context says "Aspirin reduces inflammation." Strict verifiers correctly flag this as UNSUPPORTED because the context does not support the uncertain claim.
Each of these is invisible in a single-judgment score and immediately visible in the atomic-fact breakdown.
What good looks like
A team that takes faithfulness seriously has a fact-decomposer prompt, a per-fact verifier prompt, a calibrated judge on a hundred labeled examples, a per-fact score on every answer, a per-slice breakdown on every release, and a separate contradicted-claim counter that fires alerts in production. The headline metric is the per-fact rate, not a binary. The dashboard shows the breakdown of SUPPORTED, UNSUPPORTED, and CONTRADICTED counts over time. When a release moves the contradicted-claim curve, the team rolls back. That is the discipline that turns "we have hallucinations" into "we shipped a change last Tuesday that increased the rate of numerical-specificity loss by three points on multi-source queries," and that is the discipline that lets you fix it.