What "good" looks like
Document understanding splits into extraction (structured fields out of a document) and QA (free-text answer over a document). The primary metric is field-level F1 for extraction and ANLS (Average Normalized Levenshtein Similarity) for QA. ANLS is the DocVQA metric and it accommodates minor OCR variation in the gold answer; human ANLS on DocVQA is reported at 94.36%, with strong systems closing the gap[1]. Field-level F1 is per-field precision and recall averaged across the fields you care about; the per-field view is what catches the common failure mode where the model nails 9 fields and silently corrupts the 10th.
The "good" target depends on the field. Required-and-critical fields (invoice total, party names) should be above 0.95 field-level F1 with confidence calibrated for human-in-the-loop fallback below the threshold; less-critical fields can run lower.
Secondary metrics
Layout robustness. Per-class accuracy on rotated, skewed, or low-resolution scans. Table extraction fidelity, scored at the cell level not the table level. Bbox grounding accuracy, when the output includes a "where on the page" location. OCR error propagation, the rate at which OCR errors in upstream text translate to extraction errors downstream.
Offline eval design
DocVQA for QA breadth. On top of it, build a domain extraction set: 100 documents, 5 fields each, gold-annotated. The gold should include character-level spans so substring matching is verifiable. Score field-level F1 (per field, then macro-averaged) and ANLS where the answer is free-text. Run two pipelines side by side: OCR-then-LLM (Tesseract or commercial OCR plus an LLM extractor) and native VLM (Claude or Gemini reading the page directly). The two pipelines fail differently and the comparison is informative.
For documents with tables, add a cell-level table extraction score. The aggregate "table extracted" pass/fail rate is too coarse; cell-level recall is what predicts whether downstream analytics will be correct.
Online eval design
Field-level confidence on every extraction. Route low-confidence extractions to a human-in-the-loop queue; the cost-quality frontier in this category is set by the confidence-threshold trade-off. Alert on per-field accuracy drift by document type; a tax-form pipeline often drifts independently from an invoice pipeline because the form templates change separately. Bbox-grounded fields enable a secondary check: human reviewers can confirm the extraction by glancing at the boxed region without reading the whole document.
Top benchmarks
- DocVQA, https://www.docvqa.org/. 50,000 questions over 12,000+ documents; human ANLS 94.36%.
- Your domain extraction set. 100 documents, 5 fields, gold-annotated. The eval that predicts deployment quality.
- Per-vendor OCR baseline. Establishes the floor; if the LLM does not beat raw OCR-plus-regex on your fields, the LLM is not earning its keep.
Marker and Docling for PDF-to-markdown and PDF-to-structured-text. Native VLMs (Claude, Gemini, GPT-4V) for image-direct extraction without an OCR intermediate. Azure Document Intelligence for the commercial OCR-plus-extraction stack. Inspect AI for sandboxed reproducible runs across releases.
Common pitfalls
Holistic accuracy. "Did the JSON look right?" is too coarse; field-level precision and recall catches the per-field failures that holistic scoring hides. No bbox grounding eval. An extracted value can be correct by luck without the model actually seeing the right region; bbox-overlap scoring catches this. OCR baseline missing. Without comparing OCR-plus-regex to the LLM pipeline, you cannot tell whether the LLM is adding value beyond OCR. Treating low-confidence extractions as failures. The right answer is human-in-the-loop routing; the eval should measure the recall-conditioned-on-acceptable-FP at threshold, not headline accuracy[2].
Canonical reading
The DocVQA paper and challenges site are the canonical references because DocVQA introduced ANLS as the QA metric and set the per-document-class methodology that domain extensions follow[1][3]. The Hamel field guide framing about looking at data applies directly: the failure modes in document understanding are heavily template-specific, and the only way to find them is to read 50 misclassifications and group them[2].
Day-1 starter
- 100 documents, 5 fields each, gold-annotated. Per-field precision and recall plus ANLS for any QA columns.
- A bbox-overlap check for any field where the model is supposed to "find this on the page". Without it, the value can be right by luck.
- Confidence threshold and human-in-the-loop routing. Measure recall conditioned on an acceptable FP rate at the threshold; this is the cost-quality frontier in production.