What "good" looks like
Tool use looks like a single capability and is actually four nested ones: pick the right function, fill correct arguments, handle the result, and chain across turns. The primary metric is a composite: AST-match (does the function name and argument structure match the gold?) plus execution-grounded success (when the mocked or real endpoint runs the call, does it return the right result?). Berkeley's BFCL v4, updated April 2026, is the public benchmark; it scores unweighted accuracy across simple, parallel, multiple, multi-turn, irrelevance, and live categories[1]. The Gorilla paper, which introduced API-Bench, showed that retrieval plus fine-tuning "substantially mitigates hallucination" of nonexistent function names, which is the most-cited single failure mode in this category[2].
The "good" target is high AST-match on simple cases (above 90%), an honest measurement of irrelevance detection (the model should not invoke a tool when none is needed), and execution-grounded success on real or mocked endpoints, not just JSON validity.
Secondary metrics
Hallucinated function names or parameters. The Gorilla failure mode; track separately. Parameter type errors, especially for numeric, date, and enum fields. Cost from extra tool calls; an agent that makes 4 calls when 2 would do is paying 2x for the same outcome. Irrelevance detection, the rate at which the model declines to call a tool when no tool is needed; without it, the AST-match number is misleading.
Offline eval design
BFCL v4 for breadth. ToolBench from Tsinghua and OpenBMB for scale: 3,451 tools, 16,464 REST APIs, 126,486 instructions[3]. On top of public sets, build a harness over your own tools with mocked endpoints; mocking is what enables deterministic scoring. A 50-task set with explicit gold tool plus arguments plus expected response is the minimum useful artifact. Add 10 "no tool needed" negative cases; the model should answer from its own knowledge. Score AST-equivalence and execution outcome separately so the failure mode is debuggable.
Strict mode (Anthropic's term, used by other providers under different names) eliminates one whole failure class by constraining the schema during decoding[4]; evaluate with and without strict mode and report both.
Online eval design
Log every tool call, result, and final answer. Track three rates: schema-conformance (does the call validate?), retry rate (how often does the model re-call after a tool error?), and latency by tool (long-tail tools are usually the next regression). Strict-mode tool use eliminates schema errors at the cost of some flexibility; the right default is strict mode plus an unstrict fallback for known edge cases. Alert on hallucinated-function-name detections and on any tool that suddenly drops out of the call distribution (deletion in the registry, schema drift, or a model regression that makes the tool look unattractive).
Top benchmarks
Instructor (over 3 million monthly downloads) for Pydantic-based output validation with auto-retry on schema failure[5]. Outlines for constraint-grammar-level enforcement during decoding[6]. The BFCL evaluator for parity with public scores. Inspect AI for sandboxed mock-endpoint eval with strong assertion primitives. Langfuse or Braintrust for production logging with per-tool slicing.
Common pitfalls
Treating valid JSON as success when arguments are wrong. The single most common scoring error; AST-equivalence beats JSON validity but execution-grounded success is the truth. No "tool-shouldn't-have-been-called" negatives. Without irrelevance cases, your AST-match number is biased upward. Single-turn only when the product is multi-turn. BFCL v4 has a multi-turn category for a reason. Forgetting strict mode. Provider strict modes (Anthropic, OpenAI, Google) eliminate a whole class of schema-shape errors; the eval should compare with and without[4].
Canonical reading
Gorilla (Patil et al., 2023) is the canonical paper because it introduced API-Bench and articulated the hallucination-of-API-names problem that motivates this whole category[2]. The paper's headline result, that retrieval plus fine-tuning substantially reduces fabricated API calls, is still the operational pattern. Read it with the BFCL leaderboard for the current state of the art[1].
Day-1 starter
- Mock every tool. Build a 50-task set with gold function names and arguments and expected responses. Mocking is what enables deterministic scoring.
- Add 10 "no tool needed" negative cases. The model should answer from its own knowledge or refuse the tool politely.
- Score AST-equivalence and execution outcome separately. When they diverge, the failure mode is named and debuggable; when you report only one composite, it is not.