Text-to-SQL is the cleanest task type to score because there is a deterministic check: execute the generated query against the warehouse and compare the result set to the gold. The mistake teams make is scoring by string match on SQL text; multiple correct queries always exist 1. This recipe will walk an execution-grounded eval on a sanitized copy of your warehouse, scoring on row-set equivalence with order handling, and reporting query-cost as a secondary metric. The OpenAI cookbook SQL eval notebook is the closest published analog and is the right starting reference 2.
Planned outline
- Sanitize a copy of your schema. The eval runs queries; you cannot point it at production. Use Postgres with sample data, or SQLite for portability.
- Build the 50-question gold set. NL question, gold SQL, expected row-set hash. Cover at least: simple SELECT, JOIN, GROUP BY, window functions, and one ambiguous question.
- Generate SQL. The agent gets the schema as system prompt; we use Anthropic's strict tool-use with a single
run_sql tool for cleaner failure modes.
- Execute and compare. Run both queries; canonicalize the result set (sort if ORDER BY is missing); compare. Use Spider and BIRD's eval scripts as references 3.
- Score. Execution Accuracy (EX) as the primary metric. Cost (execution time, rows scanned) as secondary.
- Failure modes. Wrong table, wrong join condition, missing GROUP BY, wrong filter, ambiguous question without disambiguation.
- Production hook. Log every NL-to-SQL pair as the agent generates them; sample-review a portion weekly; add edited queries back to the gold set.
What this will skip in v1
Multi-step SQL agents (one-shot only). LLM-generated explanations (focus on the SQL). Dialect translation (Postgres only in the walkthrough; BIRD spans dialects). Query-plan analysis as a metric (worth doing in production; out of scope for the smoke recipe).
For broader context see the text-to-SQL task type. BIRD is the right benchmark to add once your eval set crosses 100 questions 1.