A tool call is a structured action with a name, an argument schema, and a consequence. Evaluating one means asking three independent questions: did the call parse, did it match the expected call, did it produce the right effect on the real system. Most teams ship the first answer and pretend it covers the other two. It does not.
The three layers correspond to escalating cost and signal. Schema conformance is cheap and catches malformed JSON, missing required fields, and type errors. AST-equivalence checks whether the call matches a reference modulo argument order and noise; the Berkeley Function-Calling Leaderboard uses this approach as a scalable proxy for execution 1. Execution-grounded evaluation runs the call against a sandboxed or simulated environment and inspects the resulting state; τ-bench's retail and airline domains are the reference here 2.
The cheapest check. Given the tool's declared input schema (JSON Schema, OpenAPI, or the provider's tool spec), validate the model's emitted call against it. Required fields present, types correct, enums respected, additional properties rejected if the schema is closed. Most modern provider SDKs expose a strict-mode flag that does this server-side and refuses to emit malformed calls in the first place 3.
Schema conformance catches the failures that would surface as exceptions in production: a missing customer_id, a string where an integer was required, a date in the wrong format. It does not catch wrong semantics. A call to cancel_order(order_id="42") parses cleanly and refunds the wrong customer.
Layer 2: AST-equivalence
The middle layer. Given a reference call (the labeled correct one) and the candidate call, compare structure modulo argument order and a controlled set of normalisations. Two calls are AST-equivalent if they share the same function name and their argument bags compare equal after stringification, optional whitespace stripping, and number-format coercion.
BFCL popularised AST matching as a scalable proxy for execution: their benchmark grades single-turn, multi-turn, and parallel function calls without spinning up a live API for each test case 1. The earlier API-Bank and ToolBench evaluations used similar string-and-tree comparisons across thousands of API descriptions 4 5. AST equivalence misses calls that are correct but phrased differently from the reference and approves calls that match the reference but are wrong for the user's actual goal. It is a proxy, not a verdict.
Layer 3: execution-grounded
The expensive check, and the only one that catches the right failures. Run the candidate call against a sandboxed environment (a Docker container with a fake database, a τ-bench retail simulator, a mocked API gateway). Compare the resulting environment state against the expected state.
τ-bench's setup is the cleanest published example: each task has an initial state, a user simulator with policy documents, and a final-state assertion 2. Success is not "the model emitted the right call." Success is "the database row for the customer ends in the right state and no other rows changed." Execution grounding catches the silent failures the first two layers miss: a refund that succeeds but to the wrong account, a cancellation that fires the right tool with the right shape and the wrong row.
A runnable rubric
"""
Three-layer tool-call eval. No external dependencies beyond jsonschema.
"""
from dataclasses import dataclass
from typing import Any, Callable
import json
from jsonschema import Draft202012Validator, ValidationError
@dataclass
class ToolCall:
name: str
arguments: dict[str, Any]
@dataclass
class Eval:
schema_ok: bool
ast_ok: bool
exec_ok: bool | None # None = no execution harness available
def schema_check(call: ToolCall, schema: dict) -> bool:
try:
Draft202012Validator(schema).validate(call.arguments)
return True
except ValidationError:
return False
def ast_equivalent(candidate: ToolCall, reference: ToolCall) -> bool:
"""Same tool, same argument bag (modulo order and JSON normalisation)."""
if candidate.name != reference.name:
return False
return json.dumps(candidate.arguments, sort_keys=True) == \
json.dumps(reference.arguments, sort_keys=True)
def execution_check(
call: ToolCall,
apply: Callable[[ToolCall], dict],
expected_state: dict,
) -> bool:
"""Run the call against a sandbox and compare resulting state."""
final = apply(call)
return final == expected_state
def evaluate(
candidate: ToolCall,
reference: ToolCall,
schema: dict,
sandbox: Callable[[ToolCall], dict] | None,
expected_state: dict | None,
) -> Eval:
sch = schema_check(candidate, schema)
ast = sch and ast_equivalent(candidate, reference)
exe = None
if sandbox is not None and expected_state is not None:
try:
exe = execution_check(candidate, sandbox, expected_state)
except Exception:
exe = False
return Eval(schema_ok=sch, ast_ok=ast, exec_ok=exe)
Report all three. A run where 100% of calls pass schema, 80% pass AST, and 55% pass execution tells you something specific: the model can format calls and often picks the right tool, but its argument values are wrong often enough that execution fails. That is a targeted training signal, not a single number to argue about.
What most evals undertest
Three failure modes that the AST layer alone cannot see.
The first is should-not-have-used. The agent calls a tool when it should have answered from memory, or picks a slower tool when a cheaper equivalent exists. AST equivalence treats every comparison as positive-against-reference; a labeled set must include negative cases where the correct action is no tool call at all.
The second is multi-turn recovery. Tool A returns a 5xx, the agent retries blindly, fails again, then proceeds as if the call had succeeded. BFCL's v3 multi-turn split surfaces this kind of brittleness; τ-bench's tool errors do the same in the execution domain 1 2. A single-turn AST match cannot model the recovery behaviour at all.
The third is silent state pollution. The call succeeds, returns the expected response, and also flips an unrelated boolean somewhere in the state. Only state diffing on the full snapshot catches this. The Gorilla and ToolBench benchmarks predate the execution-grounded turn and do not exercise it 6 5.
What to do this week
- For every tool in your catalogue, add three labelled examples: a positive call, a negative case where the correct action is no call, and an error-recovery case where the first tool response is a transient failure.
- Wire schema validation into your provider strict mode or pre-call. Refuse to emit invalid calls; do not log them as eval failures because they should not have been emitted 3.
- Stand up a sandbox for the top-three tools by volume. Execution grounding on the long tail can wait; execution grounding on the highest-traffic tools cannot.