Tool-call evaluation: schema, arguments, sequence, state

The four-layer tool-call eval, the AST-vs-execution tradeoff, BFCL's category structure, and an error taxonomy worth stealing.

A tool call is a structured action with a name, an argument schema, and a consequence. Evaluating one means answering four independent questions: did the call parse, were the arguments right, did the calls happen in a workable order, and did the world end up in the right state. Most teams ship the first answer and report it as "tool-call accuracy". That number is not tool-call accuracy.

The two failure modes that motivated Gorilla, the first serious API-calling benchmark, are still the dominant ones: inaccurate input arguments and hallucinated API usage ¹. Neither is visible to a JSON parser, and end-state failures are invisible to both. Hence the layered stack.

flowchart LR
    A[Emitted call] --> L1[1. Schema validity<br/>parses, types, enums]
    L1 --> L2[2. Argument correctness<br/>AST match vs reference]
    L2 --> L3[3. Call sequencing<br/>order, dependencies, recovery]
    L3 --> L4[4. End-state verification<br/>sandbox state diff]

Cost rises left to right; so does signal. Run layers 1 and 2 on every CI run, layer 3 on every multi-turn suite, layer 4 on your top tools by traffic.

Layer 1: schema validity

Validate the emitted call against the tool's declared input schema: required fields present, types correct, enums respected, unknown properties rejected when the schema is closed. Modern provider SDKs expose a strict mode that enforces this server-side and refuses to emit malformed calls at all ². Turn it on, and stop logging schema failures as eval failures; calls that cannot parse should never have been emitted.

Layer 1 catches what would have been a production exception: a missing customer_id, a string where an integer belongs, an undeclared tool name. It says nothing about semantics. cancel_order(order_id="42") parses cleanly and refunds the wrong customer.

Layer 2: argument correctness

Compare the candidate call against a labeled reference: same function, equivalent argument values after controlled normalization. This is AST matching, the approach the Berkeley Function-Calling Leaderboard built its main metric on: parse the call into a tree, then compare name and arguments against a set of acceptable answers per test case, strict on types (no "true" for true), lenient on string case and whitespace ³.

The design detail worth copying is the set of acceptable answers. A single golden call makes the eval brittle to harmless variation (date written differently, optional argument included); a possible-answers list per case keeps the match honest without hand-waving the comparison.

Layer 3: call sequencing

Single-call grading cannot see order. Real tasks need reads before writes, lookups before mutations, and recovery when a tool returns an error. BFCL v3 is the benchmark anchor here: its multi-turn split grades trajectories with response-based evaluation, where a trajectory is correct if it contains the labeled minimal path as a subset, even when the model takes extra steps or a different route ⁴.

Its augmented categories are a checklist of sequencing failures worth testing on your own agent: missing-parameter cases (the agent must ask, not guess), missing-function cases (the agent must say it cannot, not improvise), and long-context cases that bury the relevant state in noise ⁴.

Layer 4: end-state verification

Run the calls against a sandboxed environment and compare the resulting state with an annotated goal state. τ-bench is the cleanest published example: each task in its retail and airline domains ends with a comparison of the database state against the goal state, so success means "the right rows changed and nothing else did", not "the transcript looked right" ⁵.

The numbers explain why this layer is non-negotiable. Frontier function-calling agents succeeded on under 50 percent of τ-bench tasks, and reliability collapses under repetition: pass^8 lands below 25 percent in the retail domain ⁵. Agents that ace single-call AST suites fail end-state checks constantly; the pass^k page covers the repetition math, and the trajectory vs outcome page covers where state checks live in the wider harness.

AST match vs execution: the actual tradeoff

AST matching is cheap, deterministic, and infinitely repeatable; execution is the ground truth. The BFCL team is explicit about why they default to AST: ideally you would execute everything, but not all generated calls are executable in a harness (their example: Java functions), and REST APIs explode into an immense space of acceptable concrete responses ³. Execution also drags in flaky dependencies; BFCL's executable categories need live API responses, which is exactly the nondeterminism you do not want in CI.

The failure directions differ, which is the real reason to run both:

AST false negatives: the call is correct but phrased differently from every listed answer. Your eval punishes a working agent.
AST false positives: the call matches the reference and is still wrong for the user's actual goal, because the reference was labeled against an assumed state that drifted.
Execution false negatives: the sandbox diverges from production behavior, or a flaky dependency fails the run.

History ran the same argument at scale. API-Bank hand-annotated 73 tools across 314 dialogues so it could grade calling, retrieving, and planning separately ⁶; ToolBench scaled to 16,464 RapidAPI endpoints by accepting an LLM-based evaluator (ToolEval) instead of exact matching ⁷. Fidelity and coverage trade off; the four-layer stack is how you stop choosing.

BFCL as the benchmark anchor

One leaderboard family now spans all four layers, which is why it is the reference point ⁸:

Version	What it added	Structure
V1	AST and executable evaluation	simple, multiple, parallel, and parallel-multiple calls; relevance detection for "no tool applies"
V2	Live data	enterprise and OSS-contributed functions and test cases
V3	Multi-turn	200 base cases plus 800 augmented: missing parameters, missing functions, long context, composite; state-based and response-based grading
V4	Agentic	holistic agentic evaluation, including web-search tasks

Two BFCL habits transfer directly to private evals. Relevance detection (V1) means your labeled set must include cases where the correct action is no call at all; an agent graded only on positive cases learns that calling something always beats calling nothing. State-based grading (V3) means asserting on backend state after each turn, not on which calls were made; it is the smaller sibling of τ-bench's final-state diff and catches wrong-but-plausible trajectories that subset matching forgives ⁴.

CITATION

Berkeley Function-Calling Leaderboard (Patil, Mao et al., ICML 2025). The benchmark anchor for tool-call evaluation: AST and executable grading, relevance detection, multi-turn state-based evaluation, agentic extensions. Live leaderboard and category definitions: gorilla.cs.berkeley.edu/leaderboard.html

An error taxonomy worth stealing

Label your failures with this and your eval report becomes a training signal instead of a single number to argue about.

Class	Failure	Example	Caught by
Selection	Hallucinated tool	calls `cancel_orders`, not in the catalog	layer 1
Selection	Wrong tool	`get_order` when the goal needs `cancel_order`	layer 2
Selection	Call when none needed	looks up what the user already stated	layer 2, negative cases
Selection	No call when needed	fabricates an order status from nothing	layer 2
Arguments	Missing required field	`cancel_order()` with no `order_id`	layer 1
Arguments	Wrong type or format	integer where the schema wants a string	layer 1
Arguments	Wrong value, valid shape	cancels order 42 instead of 47	layers 2, 4
Arguments	Guessed missing value	invents the user's address instead of asking	layer 3
Sequencing	Dependency violation	mutates a record it never read	layer 3
Sequencing	Redundant repeats	retries a succeeded call five times	layer 3
Sequencing	No error recovery	proceeds as if a failed call had succeeded	layers 3, 4
Effects	Right call, wrong row	refund lands on the wrong customer	layer 4
Effects	Silent state pollution	unrelated field flipped during a write	layer 4

The last row is the one nobody tests for: the call succeeds, the response looks right, and an unrelated boolean changed. Only a full-snapshot state diff sees it.

A runnable four-layer scorer

"""
Four-layer tool-call eval. Only external dependency: jsonschema.
"""
from dataclasses import dataclass
from typing import Any, Callable
import json
from jsonschema import Draft202012Validator, ValidationError

@dataclass
class ToolCall:
    name: str
    arguments: dict[str, Any]

@dataclass
class LayerScores:
    schema_ok: bool
    args_ok: bool
    sequence_ok: bool | None  # None = single-call task
    state_ok: bool | None     # None = no sandbox wired up

def check_schema(call: ToolCall, schemas: dict[str, dict]) -> bool:
    """Layer 1. Unknown tool name = hallucination = fail."""
    schema = schemas.get(call.name)
    if schema is None:
        return False
    try:
        Draft202012Validator(schema).validate(call.arguments)
        return True
    except ValidationError:
        return False

def _norm(value: Any) -> str:
    if isinstance(value, str):
        return value.strip().lower()
    return json.dumps(value, sort_keys=True)

def check_args(candidate: ToolCall, accepted: list[ToolCall]) -> bool:
    """Layer 2. AST-style match against ANY accepted reference call."""
    return any(
        candidate.name == ref.name
        and candidate.arguments.keys() == ref.arguments.keys()
        and all(_norm(candidate.arguments[k]) == _norm(v)
                for k, v in ref.arguments.items())
        for ref in accepted
    )

def check_sequence(calls: list[ToolCall], minimal_path: list[ToolCall]) -> bool:
    """Layer 3. BFCL-style subset match: the minimal viable path must
    appear in order; detours and extra calls are tolerated."""
    i = 0
    for call in calls:
        if i < len(minimal_path) and check_args(call, [minimal_path[i]]):
            i += 1
    return i == len(minimal_path)

def check_state(
    run_in_sandbox: Callable[[list[ToolCall]], dict],
    calls: list[ToolCall],
    goal_state: dict,
) -> bool:
    """Layer 4. Compare the FULL snapshot, not just touched rows;
    that is what catches silent state pollution."""
    try:
        return run_in_sandbox(calls) == goal_state
    except Exception:
        return False

Report the four rates separately. A run with 100 percent schema, 81 percent arguments, 64 percent sequencing, and 52 percent end-state is a diagnosis: the model formats calls and mostly picks the right tool, but it orders operations badly and the damage shows up in state. Averaging those into one "tool score" destroys exactly the information you ran the eval to get.

What to do this week

For every tool in your catalog, write three labeled cases: a positive call with a possible-answers list, a negative case where the correct action is no call, and a recovery case where the first tool response is a transient error.
Turn on provider strict mode so layer 1 is enforced before emission ², then re-base your dashboards: schema validity stops being a metric and becomes a precondition.
Stand up a sandbox with full-state diffing for your top three tools by traffic, with goal-state assertions in the τ-bench style ⁵. The long tail can wait; the refund path cannot.

Once the four layers report separately, the next failure class to chase is the agent gaming the grader itself; the reward hacking page covers that arms race.

Patil et al., "Gorilla" (2023). arXiv:2305.15334. ↩
Anthropic, "Tool Use Overview (strict mode)." ↩ ↩²
BFCL blog, "Berkeley Function Calling Leaderboard" (AST vs executable methodology). https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html ↩ ↩²
BFCL blog, "BFCL V3: Multi-Turn and Multi-Step Function Calling". https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html ↩ ↩² ↩³
Yao et al., "τ-bench" (2024). arXiv:2406.12045. ↩ ↩² ↩³
Li et al., "API-Bank" (2023). arXiv:2304.08244. ↩
Qin et al., "ToolLLM / ToolBench" (2023). arXiv:2307.16789. ↩
Berkeley Function-Calling Leaderboard (Patil, Mao et al., ICML 2025). https://gorilla.cs.berkeley.edu/leaderboard.html ↩