What "good" looks like
There is no widely-adopted public benchmark for "JSON mode" per se because the failure modes are domain-specific: which fields, which types, which enums. The right primary is the product of two rates: schema validity (does jsonschema validate the output?) and semantic correctness (are the field values right?). Validity alone is necessary-not-sufficient; semantic correctness alone is unmeasurable when validity is 60%. The BFCL v4 format-sensitivity subset is the closest public proxy because it scores both AST-equivalence and value correctness on function-call outputs[1].
The "good" target is validity above 99% (strict-mode constrained decoding will get you there) and per-field semantic accuracy in the 90s on your domain set. The two numbers track different problems; report both.
Secondary metrics
Field omission rate. The model leaves out optional fields that should have been populated. Extra-field rate. The model adds fields not in the schema. Enum hallucination. The model writes an enum value not in the allowed set. Format adherence for dates, currencies, and identifiers. Retry count under constrained decoding; constrained decoding can loop, and the per-output retry budget is a cost signal.
Offline eval design
Build a fixture set of 100 inputs with gold JSON. Score with jsonschema validate plus per-field accuracy (string equality for IDs and enums, type-and-range checks for numbers, ISO-format checks for dates). Compare strict-mode (constrained decoding via the provider's schema enforcement) against unconstrained (prompt-engineered JSON). Strict mode usually wins on validity and loses a few points on semantic correctness because the constraint can force the model into a local optimum that satisfies the schema but is wrong. The eval should make both visible.
Sclar et al. showed that swapping plausible prompt formats can swing accuracy by up to 76 points on the same underlying task[2]; format sensitivity is a real and underestimated source of variance. Run multiple seeds and multiple plausible prompt phrasings, not just one.
Online eval design
Validate every output at the boundary. The boundary is the system contract: an invalid JSON should never propagate. Log parse failures with the raw output so the failure mode is debuggable. Alert on validity-rate drops greater than 1% and on any field's per-value accuracy regressing. The "extra fields" alert is often the leading indicator of a model swap or prompt regression that does not affect validity.
If you ship a critical structured-output product, build a per-field accuracy dashboard. Per-field beats global because the per-field failure modes (date parsing, enum drift, currency formatting) are independent and the global number aggregates them away.
Top benchmarks
- BFCL v4 format-sensitivity subset, https://gorilla.cs.berkeley.edu/leaderboard.html. The closest public proxy; scores both AST-equivalence and execution.
- Instructor and Outlines test suites. Domain-specific; useful as scaffolding but not benchmarks per se.
- Your domain fixture set. The eval that actually predicts production quality.
Instructor for Pydantic-based output validation with auto-retry on validation error (over 3 million monthly downloads)[3]. Outlines for context-free-grammar-level constraints during decoding[4]. Provider strict mode (Anthropic, OpenAI, Google) for schema-shape guarantees during generation[5]. The Python jsonschema library for the validity check. Inspect AI for sandboxed runs with field-level assertion primitives.
Common pitfalls
Measuring only "is JSON" without measuring values. The most common single mistake; validity is necessary but not sufficient. Letting the model choose the schema. The schema is the contract; if the model can vary it, the downstream code cannot rely on anything. Not testing constrained and unconstrained side by side. Strict mode wins on validity and sometimes loses on value correctness; reporting only one hides the trade. Ignoring format sensitivity. Format-sensitivity findings show large variance across plausible prompt rewordings; a single-prompt eval undersells the noise[2].
Canonical reading
The Outlines documentation is the most useful single reading because it frames the structured-output problem as a decoding-time constraint problem rather than a prompt-engineering problem[4]. The framing is the operational insight: if you want guaranteed validity, constrain decoding; if you want maximum flexibility, accept retries. Pair with Instructor's docs for the Pydantic-plus-retry pattern that most teams use in production[3].
Day-1 starter
- Pydantic models for every output type. The model is the contract; both the prompt and the eval bind to it.
- 50 golden input-to-output pairs in CI. Per-field accuracy, not aggregate; the per-field view is what catches enum drift and date-format regressions.
- Per-field accuracy dashboard with alert thresholds. The dashboard is the production eval; the CI run is the pre-deployment gate.