Document understanding and extraction

Extract structured info or answer questions from documents: PDFs, scans, forms. Field-level F1 and ANLS, not holistic JSON-looks-right scoring.

What "good" looks like

Document understanding splits into extraction (structured fields out of a document) and QA (a free-text answer over a document). Each is scored per unit, not holistically, so a model that nails 9 fields and silently corrupts the 10th is caught.

The primary metrics:

Extraction: field-level F1, per-field precision and recall averaged across the fields you care about.
QA: ANLS (Average Normalized Levenshtein Similarity), the DocVQA metric.
Critical fields (invoice total, party names): above 0.95 field-level F1, with confidence calibrated for human-in-the-loop fallback below the threshold. Less-critical fields can run lower.

ANLS accommodates minor OCR variation in the gold answer. Human ANLS on DocVQA is reported at 94.36%, with strong systems closing the gap^[1].

The per-field view is the point. It catches the common failure where the model gets almost everything right and corrupts one field without any signal that it did.

flowchart TD
    DOC["Document: PDF, scan, or form"] --> SHAPE{"Extraction or QA?"}
    SHAPE -->|extraction| EXT["Field-level F1 per field, macro-averaged"]
    SHAPE -->|QA| QA["ANLS on free-text answer"]
    EXT --> CRIT{"Critical field?"}
    CRIT -->|yes| GATE{"Field F1 above 0.95?"}
    CRIT -->|no| LOWER["Lower target allowed"]
    QA --> GATE
    GATE -->|yes| SHIP["Ship"]
    GATE -->|no| HITL["Route to human-in-the-loop by confidence"]

Secondary metrics

Metric	What it catches	How to score
Layout robustness	Degradation on rotated, skewed, or low-resolution scans	Per-class accuracy across scan conditions
Table extraction fidelity	Corrupted cells inside an "extracted" table	Cell-level score, not table-level pass/fail
Bbox grounding accuracy	Values that are right by luck without the model seeing the region	Overlap check when output includes a page location
OCR error propagation	Upstream OCR errors turning into extraction errors	Rate at which OCR errors translate downstream

Offline eval design

Start on DocVQA for QA breadth. It gives coverage across document classes before you invest in domain labels.

Then build a domain extraction set: 100 documents, 5 fields each, gold-annotated. The gold should include character-level spans so substring matching is verifiable. Score field-level F1 (per field, then macro-averaged) and ANLS where the answer is free-text.

Run two pipelines side by side:

OCR-then-LLM: Tesseract or commercial OCR plus an LLM extractor.
Native VLM: Claude or Gemini reading the page directly.

The two pipelines fail differently, and the comparison is informative.

For documents with tables, add a cell-level table extraction score. The aggregate "table extracted" pass/fail rate is too coarse. Cell-level recall is what predicts whether downstream analytics will be correct.

Online eval design

Score field-level confidence on every extraction. Route low-confidence extractions to a human-in-the-loop queue. The cost-quality frontier in this category is set by the confidence-threshold trade-off.

Alert on per-field accuracy drift by document type. A tax-form pipeline often drifts independently from an invoice pipeline, because the form templates change separately.

Use bbox grounding as a cheap secondary check. Bbox-grounded fields let human reviewers confirm an extraction by glancing at the boxed region without reading the whole document.

Top benchmarks

DocVQA, https://www.docvqa.org/. 50,000 questions over 12,000+ documents; human ANLS 94.36%.
Your domain extraction set. 100 documents, 5 fields, gold-annotated. The eval that predicts deployment quality.
Per-vendor OCR baseline. Establishes the floor; if the LLM does not beat raw OCR-plus-regex on your fields, the LLM is not earning its keep.

Tools especially suited

Marker: PDF-to-markdown and PDF-to-structured-text.
Docling: PDF-to-markdown and PDF-to-structured-text.
Native VLMs (Claude, Gemini, GPT-4V): image-direct extraction without an OCR intermediate.
Azure Document Intelligence: the commercial OCR-plus-extraction stack.
Inspect AI: sandboxed reproducible runs across releases.

Common pitfalls

Pitfall	Why it bites	Fix
Holistic accuracy	"Did the JSON look right?" hides per-field failures	Score field-level precision and recall
No bbox grounding eval	An extracted value can be correct by luck without the model seeing the right region	Add bbox-overlap scoring
OCR baseline missing	You cannot tell whether the LLM adds value beyond OCR	Compare OCR-plus-regex against the LLM pipeline
Treating low-confidence extractions as failures	Headline accuracy penalizes the right behavior	Measure recall-conditioned-on-acceptable-FP at threshold, route to human-in-the-loop^[2]

Canonical reading

The DocVQA paper and challenges site are the canonical references, because DocVQA introduced ANLS as the QA metric and set the per-document-class methodology that domain extensions follow^[1]^[3].

The Hamel field guide framing about looking at data applies directly. The failure modes in document understanding are heavily template-specific, and the only way to find them is to read 50 misclassifications and group them^[2].

Day-1 starter

100 documents, 5 fields each, gold-annotated. Per-field precision and recall plus ANLS for any QA columns.
A bbox-overlap check for any field where the model is supposed to "find this on the page". Without it, the value can be right by luck.
Confidence threshold and human-in-the-loop routing. Measure recall conditioned on an acceptable FP rate at the threshold; this is the cost-quality frontier in production.

Cross-cutting links

Constructing your eval dataset: building the 100-doc, 5-field gold set without burning a quarter on labeling.
Sampling strategies for production traces: routing low-confidence extractions to humans without flooding them.
Golden dataset methodology: keeping field-level gold annotations maintainable as document templates change.
Inspect AI: scoring OCR-then-LLM and native-VLM pipelines in one harness.
Your first eval in 30 minutes: the labeled-traces-to-scorer loop, applicable per field.