Structured output and JSON mode

Emit valid JSON conforming to a schema with correct types and values. Schema validity times semantic correctness, not validity alone.

What "good" looks like

A structured-output eval scores two things at once: does the output validate against the schema, and are the field values actually right. The primary metric is the product of those two rates, because either alone lies about production quality.

The ship targets on your domain set:

Validity above 99%, which strict-mode constrained decoding will get you to.
Per-field semantic accuracy in the 90s.

The two numbers track different problems, so report both. Validity alone is necessary-not-sufficient; semantic correctness alone is unmeasurable when validity is 60%.

No standalone benchmark exists because the failure modes are domain-specific: which fields, which types, which enums. The BFCL v4 format-sensitivity subset is the closest public proxy, because it scores both AST-equivalence and value correctness on function-call outputs^[1].

flowchart TD
    IN["Input"] --> GEN["Model generates JSON"]
    GEN --> V{"jsonschema validates?"}
    V -->|no| LOG["Log raw output, retry or reject"]
    V -->|yes| SEM{"Field values correct?"}
    SEM -->|no| FIELD["Per-field failure: enum, date, currency, ID"]
    SEM -->|yes| GATE{"Validity above 99% and per-field in the 90s?"}
    GATE -->|no| FIX["Fix the weaker axis"]
    GATE -->|yes| SHIP["Ship"]

Secondary metrics

Metric	What it catches	How to score
Field omission rate	Optional fields that should have been populated but were left out	Fraction of expected fields present
Extra-field rate	Fields the model added that are not in the schema	Count of out-of-schema keys per output
Enum hallucination	An enum value not in the allowed set	Membership check against the allowed set
Format adherence	Malformed dates, currencies, and identifiers	Format checks for each typed field
Retry count	Constrained decoding looping	Per-output retry budget, a cost signal

Offline eval design

Build the fixture set. Assemble 100 inputs with gold JSON. Score with jsonschema validate plus per-field accuracy:

String equality for IDs and enums.
Type-and-range checks for numbers.
ISO-format checks for dates.

Compare strict against unconstrained. Run strict mode (constrained decoding via the provider's schema enforcement) against unconstrained (prompt-engineered JSON). Strict mode usually wins on validity and loses a few points on semantic correctness, because the constraint can force the model into a local optimum that satisfies the schema but is wrong. The eval should make both visible.

Test for format sensitivity. Sclar et al. showed that swapping plausible prompt formats can swing accuracy by up to 76 points on the same underlying task^[2]. Format sensitivity is a real and underestimated source of variance. Run multiple seeds and multiple plausible prompt phrasings, not just one.

Online eval design

Validate at the boundary. The boundary is the system contract: an invalid JSON should never propagate. Log parse failures with the raw output so the failure mode is debuggable.

Alert on the right signals. Alert on validity-rate drops greater than 1% and on any field's per-value accuracy regressing. The "extra fields" alert is often the leading indicator of a model swap or prompt regression that does not affect validity.

Go per-field for critical products. If you ship a critical structured-output product, build a per-field accuracy dashboard. Per-field beats global because the per-field failure modes (date parsing, enum drift, currency formatting) are independent, and the global number aggregates them away.

Top benchmarks

BFCL v4 format-sensitivity subset, https://gorilla.cs.berkeley.edu/leaderboard.html. The closest public proxy; scores both AST-equivalence and execution.
Instructor and Outlines test suites. Domain-specific; useful as scaffolding but not benchmarks per se.
Your domain fixture set. The eval that actually predicts production quality.

Tools especially suited

Instructor: Pydantic-based output validation with auto-retry on validation error (over 3 million monthly downloads)^[3].
Outlines: context-free-grammar-level constraints during decoding^[4].
Provider strict mode (Anthropic, OpenAI, Google): schema-shape guarantees during generation^[5].
jsonschema: the Python library for the validity check.
Inspect AI: sandboxed runs with field-level assertion primitives.

Common pitfalls

Pitfall	Why it bites	Fix
Measuring only "is JSON" without measuring values	The most common single mistake; validity is necessary but not sufficient	Score per-field semantic correctness alongside validity
Letting the model choose the schema	The schema is the contract; if the model can vary it, downstream code cannot rely on anything	Fix the schema and bind both prompt and eval to it
Not testing constrained and unconstrained side by side	Strict mode wins on validity and sometimes loses on value correctness; reporting only one hides the trade	Run both and report both numbers
Ignoring format sensitivity	Findings show large variance across plausible prompt rewordings; a single-prompt eval undersells the noise	Test multiple seeds and phrasings^[2]

Canonical reading

The Outlines documentation is the most useful single reading, because it frames the structured-output problem as a decoding-time constraint problem rather than a prompt-engineering problem^[4].

The framing is the operational insight: if you want guaranteed validity, constrain decoding; if you want maximum flexibility, accept retries. Pair with Instructor's docs for the Pydantic-plus-retry pattern that most teams use in production^[3].

Day-1 starter

Pydantic models for every output type. The model is the contract; both the prompt and the eval bind to it.
50 golden input-to-output pairs in CI. Per-field accuracy, not aggregate; the per-field view is what catches enum drift and date-format regressions.
Per-field accuracy dashboard with alert thresholds. The dashboard is the production eval; the CI run is the pre-deployment gate.

Cross-cutting links

Tool-call evaluation: AST, schema, execution: the same schema-then-semantics discipline, applied to function arguments.
Wiring evals into CI: running the 50-pair golden set on every PR.
Golden dataset methodology: keeping input-to-gold-JSON fixtures maintainable as the schema evolves.
Promptfoo: assertion-based JSON checks in CI.
TypeScript eval with Vercel AI SDK: Zod-schema evals end to end.