Tool use and function calling

Choose the right function, fill correct arguments, handle errors, chain calls. AST-match plus execution-grounded success on BFCL v4 is the bar.

What "good" looks like

Tool use looks like a single capability and is actually four nested ones: pick the right function, fill correct arguments, handle the result, and chain across turns. A tool-use eval scores whether the model produced the right call and whether that call actually works, not whether it emitted valid JSON.

The primary metric is a composite of two checks:

AST-match: does the function name and argument structure match the gold?
Execution-grounded success: when the mocked or real endpoint runs the call, does it return the right result?

The "good" targets:

AST-match on simple cases above 90%.
Honest irrelevance detection: the model should not invoke a tool when none is needed.
Execution-grounded success on real or mocked endpoints, not just JSON validity.

The public bar. Berkeley's BFCL v4, updated April 2026, is the public benchmark. It scores unweighted accuracy across simple, parallel, multiple, multi-turn, irrelevance, and live categories^[1].

The named failure mode. The Gorilla paper, which introduced API-Bench, showed that retrieval plus fine-tuning "substantially mitigates hallucination" of nonexistent function names, which is the most-cited single failure mode in this category^[2].

flowchart TD
    REQ["User request"] --> CALL["Model tool call"]
    CALL --> AST{"AST-match: name and args match gold?"}
    AST -->|no| FAIL["Fail: wrong call"]
    AST -->|yes| EXEC{"Execution-grounded: endpoint returns right result?"}
    EXEC -->|no| FAIL
    EXEC -->|yes| IRR{"Irrelevance: tool skipped when none needed?"}
    IRR -->|no| FAIL
    IRR -->|yes| GATE{"AST-match above 90% on simple cases?"}
    GATE -->|yes| SHIP["Ship"]
    GATE -->|no| FIX["Fix the weaker axis"]

Secondary metrics

Metric	What it catches	How to score
Hallucinated function names or parameters	The Gorilla failure mode: invented names or params	Track separately from AST-match^[2]
Parameter type errors	Wrong numeric, date, and enum fields	Type-check arguments against the schema
Cost from extra tool calls	An agent making 4 calls when 2 would do, paying 2x for the same outcome	Count calls per task against a gold minimum
Irrelevance detection	The model calling a tool when no tool is needed	Rate of declining to call; without it the AST-match number is misleading

Offline eval design

Public sets for breadth and scale. Use BFCL v4 for breadth. Use ToolBench from Tsinghua and OpenBMB for scale: 3,451 tools, 16,464 REST APIs, 126,486 instructions^[3].

Your own harness. On top of public sets, build a harness over your own tools with mocked endpoints. Mocking is what enables deterministic scoring.

A minimum useful artifact is a 50-task set with, per task:

The explicit gold tool.
The gold arguments.
The expected response.

Add 10 "no tool needed" negative cases; the model should answer from its own knowledge. Score AST-equivalence and execution outcome separately so the failure mode is debuggable.

Strict mode. Strict mode (Anthropic's term, used by other providers under different names) eliminates one whole failure class by constraining the schema during decoding^[4]. Evaluate with and without strict mode and report both.

Online eval design

Log everything. Log every tool call, result, and final answer.

Track three rates:

Schema-conformance: does the call validate?
Retry rate: how often does the model re-call after a tool error?
Latency by tool: long-tail tools are usually the next regression.

Default to strict. Strict-mode tool use eliminates schema errors at the cost of some flexibility. The right default is strict mode plus an unstrict fallback for known edge cases.

Alert on two signals. Alert on hallucinated-function-name detections, and on any tool that suddenly drops out of the call distribution (deletion in the registry, schema drift, or a model regression that makes the tool look unattractive).

Top benchmarks

BFCL v4, https://gorilla.cs.berkeley.edu/leaderboard.html. The current public standard; scored on accuracy, cost, and latency.
ToolBench / ToolLLM, https://github.com/OpenBMB/ToolBench. 16,464 REST APIs; the scale benchmark.
Gorilla API-Bench, https://arxiv.org/abs/2305.15334. The original API-grounded benchmark; useful for hallucination measurement^[2].

Tools especially suited

Instructor (over 3 million monthly downloads): Pydantic-based output validation with auto-retry on schema failure^[5].
Outlines: constraint-grammar-level enforcement during decoding^[6].
BFCL evaluator: parity with public scores.
Inspect AI: sandboxed mock-endpoint eval with strong assertion primitives.
Langfuse or Braintrust: production logging with per-tool slicing.

Common pitfalls

Pitfall	Why it bites	Fix
Treating valid JSON as success when arguments are wrong	The single most common scoring error	AST-equivalence beats JSON validity, but execution-grounded success is the truth
No "tool-shouldn't-have-been-called" negatives	Without irrelevance cases, your AST-match number is biased upward	Add negative cases where the model should not call a tool
Single-turn only when the product is multi-turn	Misses the multi-turn failures the product will hit	BFCL v4 has a multi-turn category for a reason^[1]
Forgetting strict mode	Provider strict modes eliminate a whole class of schema-shape errors	Compare with and without strict mode (Anthropic, OpenAI, Google)^[4]

Canonical reading

Gorilla (Patil et al., 2023) is the canonical paper because it introduced API-Bench and articulated the hallucination-of-API-names problem that motivates this whole category^[2].

The paper's headline result, that retrieval plus fine-tuning substantially reduces fabricated API calls, is still the operational pattern. Read it with the BFCL leaderboard for the current state of the art^[1].

Day-1 starter

Mock every tool. Build a 50-task set with gold function names and arguments and expected responses. Mocking is what enables deterministic scoring.
Add 10 "no tool needed" negative cases. The model should answer from its own knowledge or refuse the tool politely.
Score AST-equivalence and execution outcome separately. When they diverge, the failure mode is named and debuggable; when you report only one composite, it is not.

Cross-cutting links

Tool-call evaluation: AST, schema, execution: the chapter-length treatment of the primary metric.
Golden dataset methodology: building the 50-task gold set with mocked endpoints.
Langfuse: per-call logging of tool name, arguments, and result.
Inspect AI: AST-equivalence and execution outcome scored in one harness.
Agent eval with τ-bench: tool calls graded inside full trajectories.
TypeScript eval with Vercel AI SDK: schema-bound tool-call evals in a TypeScript stack.