Groundedness asks whether every claim in a generated answer can be traced to a specific span of the retrieved evidence. It is near-synonymous with faithfulness but usually framed at the span level, and its exact definition varies by vendor.
The core definition
A claim is grounded when a concrete span of the retrieved context supports it. Groundedness is the fraction of claims in the answer that carry such a supporting span. The unit is the same atomic claim used for faithfulness, decomposed single-subject and single-predicate, then attributed rather than merely checked for non-contradiction [1]. The difference from faithfulness is the direction of the test: faithfulness verifies the context does not contradict the claim; groundedness demands the context positively attributes the claim to a locatable span [2].
Is groundedness the same as faithfulness?
Almost, and the gap is the whole point. Faithfulness passes a claim that the context does not contradict. Groundedness passes a claim only when a supporting span exists. A plausible inference the model draws from the context, with no textual support for the specific assertion, is faithful-ish (nothing in the context says otherwise) but ungrounded (nothing in the context says so either) [3]. Example: the context lists a product's release date as March 2024 and its price. The answer says "the product is over a year old." Nothing contradicts that. But no span states the age; the model computed it. Faithfulness leans toward pass, groundedness fails, and for a RAG product that promises a citation per claim, the groundedness verdict is the one you want.
The practical consequence: groundedness is the stricter gate. Anything ungrounded is a superset of anything unfaithful, because a contradicted claim is also unattributable. Set your alert on groundedness and you catch both the contradictions and the unsupported inferences.
How is groundedness measured?
Three steps, the same decomposition machinery as faithfulness.
- Decompose the answer into atomic claims.
- Attribute each claim to a supporting span in the retrieved context. If no span supports it, the claim is ungrounded.
- Report the supported fraction.
Groundedness=∣C∣∣{c∈C:∃s∈S, supports(s,c)}∣
C is the set of atomic claims in the answer, S is the set of candidate spans across the retrieved chunks, and supports(s,c) is true when span s entails claim c. When C is empty the metric is undefined; report it as such rather than defaulting to 1.0, because an answer that makes no claims should not score perfect.
The attribution step is what separates groundedness from a plain entailment check. A faithfulness verifier can return SUPPORTED without naming where; a groundedness verifier must return the span, which makes the failure mode ("no span found") a first-class output [4].
Vendor definitions do not agree
The label "groundedness" is not standardized. Read the metric card before you trust a cross-tool comparison.
| Tool / metric | What it computes | Span-level? | Failure signal |
|---|
Ragas faithfulness | Fraction of claims entailed by the retrieved context, two LLM calls (extract, verify) | No, chunk-level entailment | claim not entailed [4] |
| TruLens groundedness | Each statement scored against source; returns per-statement support and reasons | Yes, statement-to-source | statement lacks supporting source [2] |
| DeepEval faithfulness | Extracts truths and claims, checks claims against retrieval context via LLM judge | No, context-level | contradiction or unsupported claim |
| RAGTruth annotation | Human word-span labels of hallucinated text within the response | Yes, word-span in the answer | annotated hallucination span [5] |
Two takeaways. First, Ragas and DeepEval ship "faithfulness," not "groundedness," and both operate at the context level, so a plausible unsupported inference can slip through as faithful. Second, TruLens groundedness and RAGTruth's span annotation are the genuinely span-level references: TruLens attributes each statement to a source, and RAGTruth gives you human word-level hallucination spans to calibrate any automated verifier against [5].
A grounded but wrong answer
Groundedness measures attribution, not truth. The two come apart when the retrieved evidence is itself wrong.
Setup: the query is "What is the current API rate limit?" The retriever pulls one chunk, an internal doc from 2022:
Chunk: "The public API allows 100 requests per minute per API key."
The answer: "The public API allows 100 requests per minute per API key."
Groundedness: one atomic claim, one exactly-matching span, supported fraction 1 / 1 = 1.0. Perfect.
Correctness: the current limit was raised to 1000 in 2025. The chunk is stale. The answer is wrong.
Groundedness is 1.0 and answer correctness is 0.0 on the same answer. This is not a bug in the metric. Groundedness measures whether the generator was loyal to its evidence, which it was; correctness measures whether the evidence and the answer match the world, which they do not. A high-groundedness, low-correctness gap points the finger at retrieval and the corpus, not the generator. Track both, and never let a groundedness number stand in for correctness.
TIP
Surface the supporting span next to each claim in the UI. If you cannot find the span for a claim, the claim is probably ungrounded. This doubles as the fastest manual debugging move: skim an answer, try to point at the source line for each sentence, and the ones you cannot point at are your ungrounded claims before you run a single judge.
A minimal span-attribution scorer
The faithfulness verifier already returns an evidence span. Turn it into a groundedness scorer by requiring that span to be non-empty. This reuses the atomic-fact decomposition and per-claim verification from the faithfulness chapter; the only change is the pass condition.
# groundedness.py
def is_grounded(judgment: dict) -> bool:
# A claim is grounded only if a concrete supporting span was found.
return judgment["label"] == "SUPPORTED" and bool(judgment["evidence"].strip())
def groundedness_score(facts: list[str], judgments: list[dict]) -> dict:
grounded = [j for j in judgments if is_grounded(j)]
return {
"score": len(grounded) / len(facts) if facts else None,
"total_claims": len(facts),
"grounded": len(grounded),
"ungrounded": [f for f, j in zip(facts, judgments) if not is_grounded(j)],
}
judgments is the output of the per-claim verify_claim call from faithfulness; the evidence field is the verbatim span. An empty span with a SUPPORTED label means the judge asserted support without locating it, which for groundedness counts as ungrounded. Returning None for an empty answer keeps the "no claims" case out of the average instead of inflating it to 1.0.
Worked arithmetic
Four atomic claims, judged against the retrieved context with spans required.
| Claim | Label | Span found? | Grounded? |
|---|
| The bake temperature is 350 F. | SUPPORTED | yes | yes |
| The bake time is 8 to 10 minutes. | SUPPORTED | yes | yes |
| The cookies should cool on a wire rack. | SUPPORTED | yes | yes |
| The recipe yields two dozen cookies. | SUPPORTED | no | no |
The fourth claim reads as supported to a lenient judge (it is plausible and not contradicted) but no span states the yield; the model supplied it. Groundedness = 3 / 4 = 0.75. A context-level faithfulness judge that skipped the span requirement would likely score this 4 / 4 = 1.0. The 0.25 gap is exactly the class of unsupported inference that groundedness catches and faithfulness lets pass.
Where it fits the eval program
Run groundedness alongside faithfulness, not instead of it. Faithfulness is the cheaper, more permissive gate and the one every framework ships; groundedness is the strict cousin you reach for when the product contract is "every sentence cites a source." Wire either into a trace pipeline with the RAG eval with Ragas recipe, and read the generation metrics hub for how groundedness, faithfulness, and correctness divide the labor. Tool-specific setup lives on the Ragas and DeepEval pages.
CAUTION
Do not report groundedness as your quality number. A 1.0 groundedness answer built on a stale or wrong chunk is confidently wrong with a citation attached. Pair it with answer correctness and retrieval metrics so a corpus problem cannot hide behind a loyal generator.