What "good" looks like
The primary metric is execution accuracy (EX) on BIRD-bench, the 12,751-pair benchmark across 95 databases (33.4 GB total) that introduces realistic dirty-data and efficiency considerations beyond Spider[1]. BIRD reports human EX at 92.96% and top systems at 81.95%; the gap to humans is real and roughly stable across leading frontier models. BIRD also defines R-VES (Reward-based Valid Efficiency Score) which jointly rewards correctness and query efficiency; reporting both EX and R-VES is the responsible default because a correct-but-slow query is broken in production.
Spider remains useful for cleaner ablation testing but is largely saturated for frontier models[2]. BIRD is where the discrimination lives in 2026.
Secondary metrics
Query validity rate. Did the generated SQL parse and execute, regardless of correctness? Execution time and cost. A 10x cost increase versus the gold query is a regression even if the rows match. Row-set equivalence, which handles ORDER BY ambiguity and column-order differences that exact-match scoring penalizes. Explain-plan red flags, where the model writes valid SQL that triggers a full table scan or an exploding join.
Offline eval design
Spider for breadth, BIRD for the harder discriminator. On top of the public sets, build a warehouse-specific set of your top 100 ad-hoc queries with gold SQL and expected row-set hashes. The row-set hash is the artifact that makes scoring deterministic across ORDER BY ambiguity. Run on a copy of the production schema, sanitized of PII; the dirty-data behavior in BIRD is what makes the score predict production quality, and you only get that if your eval schema looks like production.
For each query, capture EX (correctness), R-VES (efficiency), execution time, and the explain-plan. The composite tells you whether the model is shipping correct, performant queries or correct queries that will melt the warehouse at 9 AM.
Online eval design
Log every natural-language question, the generated SQL, the execution result, and the user's next action. The user-edit rate is the most useful single online signal: if users edit the generated SQL before running it, the model failed even if the SQL was technically correct. Alert on schema-drift breakages (a column rename in the warehouse silently invalidates the prompts). Track per-table accuracy because the model usually understands the tables it has seen often and degrades on rarely-queried tables.
Top benchmarks
The OpenAI cookbook SQL eval notebook (How_to_evaluate_LLMs_for_SQL_generation.ipynb) is a good template for the execution-accuracy harness[3]. Vanna and Defog are open-source RAG-on-SQL stacks with built-in eval harnesses. Inspect AI for sandboxed deterministic runs against a reference schema.
Common pitfalls
Exact-match scoring. Multiple valid SQLs exist for almost every non-trivial question; exact match penalizes the model for choosing a different correct path. Use execution accuracy. Ignoring efficiency. A correct query that costs 10x is broken in production; BIRD's R-VES exists to fix this[1]. Not testing on dirty data. BIRD's contribution is realistic schemas with messy values; if your eval schema is clean, your score does not predict production quality. Treating user-edit rate as a vanity metric; it is the most actionable online signal in this category[4].
Canonical reading
The BIRD-bench paper and site are the canonical reading because BIRD is explicitly designed as "the first text-to-SQL benchmark designed to encourage semantic parsers to produce SQL queries that are not only correct but also efficient"[1]. The efficiency framing is the operational thesis: in production, the metric is execution-accuracy-conditioned-on-acceptable-cost, not pure correctness. Pair with the OpenAI cookbook for the practical harness implementation[3].
Day-1 starter
- 50 natural-language questions to gold SQL to expected row-set hash. The hash is the artifact that makes scoring robust to ORDER BY ambiguity.
- Execution harness on a sanitized copy of the production schema. The schema shape is what makes the eval predict production quality.
- Track EX and R-VES per release. EX is correctness; R-VES is correctness-given-acceptable-cost; reporting both keeps the model from optimizing for the headline at the expense of the warehouse bill.