AI Evals › RAG Evals

Groundedness vs faithfulness: span-level attribution

Groundedness asks whether every claim in an answer traces to a specific span of retrieved evidence. Stricter than faithfulness.

Groundedness asks whether every claim in a generated answer can be traced to a specific span of the retrieved evidence. It is near-synonymous with faithfulness but usually framed at the span level, and its exact definition varies by vendor.

The core definition

A claim is grounded when a concrete span of the retrieved context supports it. Groundedness is the fraction of claims in the answer that carry such a supporting span. The unit is the same atomic claim used for faithfulness, decomposed single-subject and single-predicate, then attributed rather than merely checked for non-contradiction ^[1]. The difference from faithfulness is the direction of the test: faithfulness verifies the context does not contradict the claim; groundedness demands the context positively attributes the claim to a locatable span ^[2].

Is groundedness the same as faithfulness?

Almost, and the gap is the whole point. Faithfulness passes a claim that the context does not contradict. Groundedness passes a claim only when a supporting span exists. A plausible inference the model draws from the context, with no textual support for the specific assertion, is faithful-ish (nothing in the context says otherwise) but ungrounded (nothing in the context says so either) ^[3]. Example: the context lists a product's release date as March 2024 and its price. The answer says "the product is over a year old." Nothing contradicts that. But no span states the age; the model computed it. Faithfulness leans toward pass, groundedness fails, and for a RAG product that promises a citation per claim, the groundedness verdict is the one you want.

The practical consequence: groundedness is the stricter gate. Anything ungrounded is a superset of anything unfaithful, because a contradicted claim is also unattributable. Set your alert on groundedness and you catch both the contradictions and the unsupported inferences.

How is groundedness measured?

Three steps, the same decomposition machinery as faithfulness.

Decompose the answer into atomic claims.
Attribute each claim to a supporting span in the retrieved context. If no span supports it, the claim is ungrounded.
Report the supported fraction.

\text{Groundedness} = \frac{\lvert \{\, c \in C : \exists\, s \in S,\ \text{supports}(s, c) \,\} \rvert}{\lvert C \rvert}

$C$ is the set of atomic claims in the answer, $S$ is the set of candidate spans across the retrieved chunks, and $\text{supports}(s, c)$ is true when span $s$ entails claim $c$ . When $C$ is empty the metric is undefined; report it as such rather than defaulting to 1.0, because an answer that makes no claims should not score perfect.

The attribution step is what separates groundedness from a plain entailment check. A faithfulness verifier can return SUPPORTED without naming where; a groundedness verifier must return the span, which makes the failure mode ("no span found") a first-class output ^[4].

Vendor definitions do not agree

The label "groundedness" is not standardized. Read the metric card before you trust a cross-tool comparison.

Tool / metric	What it computes	Span-level?	Failure signal
Ragas `faithfulness`	Fraction of claims entailed by the retrieved context, two LLM calls (extract, verify)	No, chunk-level entailment	claim not entailed ^[4]
TruLens groundedness	Each statement scored against source; returns per-statement support and reasons	Yes, statement-to-source	statement lacks supporting source ^[2]
DeepEval faithfulness	Extracts truths and claims, checks claims against retrieval context via LLM judge	No, context-level	contradiction or unsupported claim
RAGTruth annotation	Human word-span labels of hallucinated text within the response	Yes, word-span in the answer	annotated hallucination span ^[5]

Two takeaways. First, Ragas and DeepEval ship "faithfulness," not "groundedness," and both operate at the context level, so a plausible unsupported inference can slip through as faithful. Second, TruLens groundedness and RAGTruth's span annotation are the genuinely span-level references: TruLens attributes each statement to a source, and RAGTruth gives you human word-level hallucination spans to calibrate any automated verifier against ^[5].

A grounded but wrong answer

Groundedness measures attribution, not truth. The two come apart when the retrieved evidence is itself wrong.

Setup: the query is "What is the current API rate limit?" The retriever pulls one chunk, an internal doc from 2022:

Chunk: "The public API allows 100 requests per minute per API key."

The answer: "The public API allows 100 requests per minute per API key."

Groundedness: one atomic claim, one exactly-matching span, supported fraction 1 / 1 = 1.0. Perfect.

Correctness: the current limit was raised to 1000 in 2025. The chunk is stale. The answer is wrong.

Groundedness is 1.0 and answer correctness is 0.0 on the same answer. This is not a bug in the metric. Groundedness measures whether the generator was loyal to its evidence, which it was; correctness measures whether the evidence and the answer match the world, which they do not. A high-groundedness, low-correctness gap points the finger at retrieval and the corpus, not the generator. Track both, and never let a groundedness number stand in for correctness.

TIP

Surface the supporting span next to each claim in the UI. If you cannot find the span for a claim, the claim is probably ungrounded. This doubles as the fastest manual debugging move: skim an answer, try to point at the source line for each sentence, and the ones you cannot point at are your ungrounded claims before you run a single judge.

A minimal span-attribution scorer

The faithfulness verifier already returns an evidence span. Turn it into a groundedness scorer by requiring that span to be non-empty. This reuses the atomic-fact decomposition and per-claim verification from the faithfulness chapter; the only change is the pass condition.

# groundedness.py
def is_grounded(judgment: dict) -> bool:
    # A claim is grounded only if a concrete supporting span was found.
    return judgment["label"] == "SUPPORTED" and bool(judgment["evidence"].strip())

def groundedness_score(facts: list[str], judgments: list[dict]) -> dict:
    grounded = [j for j in judgments if is_grounded(j)]
    return {
        "score": len(grounded) / len(facts) if facts else None,
        "total_claims": len(facts),
        "grounded": len(grounded),
        "ungrounded": [f for f, j in zip(facts, judgments) if not is_grounded(j)],
    }

judgments is the output of the per-claim verify_claim call from faithfulness; the evidence field is the verbatim span. An empty span with a SUPPORTED label means the judge asserted support without locating it, which for groundedness counts as ungrounded. Returning None for an empty answer keeps the "no claims" case out of the average instead of inflating it to 1.0.

Worked arithmetic

Four atomic claims, judged against the retrieved context with spans required.

Claim	Label	Span found?	Grounded?
The bake temperature is 350 F.	SUPPORTED	yes	yes
The bake time is 8 to 10 minutes.	SUPPORTED	yes	yes
The cookies should cool on a wire rack.	SUPPORTED	yes	yes
The recipe yields two dozen cookies.	SUPPORTED	no	no

The fourth claim reads as supported to a lenient judge (it is plausible and not contradicted) but no span states the yield; the model supplied it. Groundedness = 3 / 4 = 0.75. A context-level faithfulness judge that skipped the span requirement would likely score this 4 / 4 = 1.0. The 0.25 gap is exactly the class of unsupported inference that groundedness catches and faithfulness lets pass.

Where it fits the eval program

Run groundedness alongside faithfulness, not instead of it. Faithfulness is the cheaper, more permissive gate and the one every framework ships; groundedness is the strict cousin you reach for when the product contract is "every sentence cites a source." Wire either into a trace pipeline with the RAG eval with Ragas recipe, and read the generation metrics hub for how groundedness, faithfulness, and correctness divide the labor. Tool-specific setup lives on the Ragas and DeepEval pages.

CAUTION

Do not report groundedness as your quality number. A 1.0 groundedness answer built on a stale or wrong chunk is confidently wrong with a citation attached. Pair it with answer correctness and retrieval metrics so a corpus problem cannot hide behind a loyal generator.

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.