Text-to-SQL

Natural-language question to SQL query to correct rows. Execution accuracy on BIRD-bench plus efficiency, not exact-match scoring.

What "good" looks like

A text-to-SQL eval runs the generated query against a real database and checks the rows it returns, not the query text. Correctness is measured by execution accuracy (EX): do the returned rows match the gold query's rows? Efficiency is measured alongside it, because a correct-but-slow query is broken in production.

The targets on BIRD-bench:

Human EX: 92.96%.
Top systems: 81.95%. The gap to humans is real and roughly stable across leading frontier models.
Report both EX and R-VES (Reward-based Valid Efficiency Score), which jointly rewards correctness and query efficiency.

The primary metric is EX on BIRD-bench, the 12,751-pair benchmark across 95 databases (33.4 GB total) that introduces realistic dirty-data and efficiency considerations beyond Spider^[1]. Reporting both EX and R-VES is the responsible default: R-VES exists precisely because a correct-but-slow query is broken in production.

Spider remains useful for cleaner ablation testing but is largely saturated for frontier models^[2]. BIRD is where the discrimination lives in 2026.

flowchart LR
    Q["NL question"] --> SQL["Generated SQL"]
    SQL --> VALID{"Parses and executes?"}
    VALID -->|no| FAIL["Invalid: fix generation"]
    VALID -->|yes| ROWS["Row-set hash vs gold"]
    ROWS --> EX{"EX: rows match?"}
    EX -->|no| FAIL2["Wrong answer"]
    EX -->|yes| EFF{"R-VES: efficient? explain-plan clean?"}
    EFF -->|no| SLOW["Correct but slow: regression"]
    EFF -->|yes| SHIP["Ship"]

Secondary metrics

Metric	What it catches	How to score
Query validity rate	SQL that fails to parse or execute	Did the generated SQL parse and execute, regardless of correctness?
Execution time and cost	Correct rows at a runaway price	A 10x cost increase versus the gold query is a regression even if the rows match
Row-set equivalence	ORDER BY ambiguity and column-order differences that exact-match scoring penalizes	Compare row sets, not query text
Explain-plan red flags	Valid SQL that triggers a full table scan or an exploding join	Inspect the explain-plan for the pattern

Offline eval design

Public sets first. Spider for breadth, BIRD for the harder discriminator.

Then build a warehouse-specific set of your top 100 ad-hoc queries with gold SQL and expected row-set hashes. The row-set hash is the artifact that makes scoring deterministic across ORDER BY ambiguity.

Run on a copy of the production schema, sanitized of PII. The dirty-data behavior in BIRD is what makes the score predict production quality, and you only get that if your eval schema looks like production.

Capture four things per query: EX (correctness), R-VES (efficiency), execution time, and the explain-plan. The composite tells you whether the model is shipping correct, performant queries or correct queries that will melt the warehouse at 9 AM.

Online eval design

Log the full funnel: every natural-language question, the generated SQL, the execution result, and the user's next action.

User-edit rate is the most useful single online signal. If users edit the generated SQL before running it, the model failed even if the SQL was technically correct.

Alert on schema-drift breakages. A column rename in the warehouse silently invalidates the prompts.

Track per-table accuracy. The model usually understands the tables it has seen often and degrades on rarely-queried tables.

Top benchmarks

BIRD-bench, https://bird-bench.github.io/. 12,751 pairs, 95 databases; human EX 92.96%, top systems 81.95%.
Spider, https://yale-lily.github.io/spider. 10,181 questions, 200 databases; cleaner, somewhat saturated^[2].
Your warehouse top-100. The eval that predicts whether your data team will ship the model.

Tools especially suited

OpenAI cookbook SQL eval: the How_to_evaluate_LLMs_for_SQL_generation.ipynb notebook, a good template for the execution-accuracy harness^[3].
Vanna: an open-source RAG-on-SQL stack with a built-in eval harness.
Defog: an open-source RAG-on-SQL stack with a built-in eval harness.
Inspect AI: sandboxed deterministic runs against a reference schema.

Common pitfalls

Pitfall	Why it bites	Fix
Exact-match scoring	Multiple valid SQLs exist for almost every non-trivial question; exact match penalizes the model for choosing a different correct path	Use execution accuracy
Ignoring efficiency	A correct query that costs 10x is broken in production	Report R-VES, which BIRD defines to fix this^[1]
Not testing on dirty data	If your eval schema is clean, your score does not predict production quality	Use BIRD-style realistic schemas with messy values^[1]
Treating user-edit rate as a vanity metric	It is the most actionable online signal in this category	Track and alert on it^[4]

Canonical reading

The BIRD-bench paper and site are the canonical reading because BIRD is explicitly designed as "the first text-to-SQL benchmark designed to encourage semantic parsers to produce SQL queries that are not only correct but also efficient"^[1].

The efficiency framing is the operational thesis: in production, the metric is execution-accuracy-conditioned-on-acceptable-cost, not pure correctness. Pair with the OpenAI cookbook for the practical harness implementation^[3].

Day-1 starter

50 natural-language questions to gold SQL to expected row-set hash. The hash is the artifact that makes scoring robust to ORDER BY ambiguity.
Execution harness on a sanitized copy of the production schema. The schema shape is what makes the eval predict production quality.
Track EX and R-VES per release. EX is correctness; R-VES is correctness-given-acceptable-cost; reporting both keeps the model from optimizing for the headline at the expense of the warehouse bill.

Cross-cutting links

Constructing your eval dataset: turning warehouse query logs into the 50-question gold set.
Drift detection for production AI: catching schema-drift breakages before users do.
Golden dataset methodology: gold SQL plus row-set hashes as a maintained artifact.
Langfuse: logging the NL-to-SQL-to-execution-to-user-action funnel.
Text-to-SQL eval with execution: the day-1 starter; execution match against a seeded SQLite sandbox.