What "good" looks like
Evaluate MCP agents by execution-verified end-to-end task success against real servers, reported per server and per tool; an aggregate number over a registry of third-party tools hides which server, which tool, and which layer failed. The unit of evaluation is not the function call, it is the agent-plus-registry system.
Scope first. AST-match versus execution-grounded scoring, hallucinated function names, and strict-mode decoding are single-registry problems, covered in Tool use and function calling[1][2][3]. This page owns what changes at registry scale with third-party servers: tool discovery across dozens of servers, schemas you do not control, and tool results you did not write.
MCP-Universe is the reference benchmark: 6 domains across 11 real MCP servers, covering maps, repo management, financial analysis, 3D design, browser automation, and web search. Paper-time scores (August 2025) put GPT-5 at 43.72%, Grok-4 at 33.33%, and Claude-4.0-Sonnet at 29.44%; even the best frontier model fails a majority of tasks. Two findings matter more than the leaderboard. Enterprise agent harnesses did not beat plain ReAct, and the dominant failure drivers are long-context token growth and unfamiliarity with unknown tools[4]. Its three evaluator types are the design lesson: format evaluators check output compliance, static evaluators match content, and dynamic evaluators fetch real-time ground truth at grading time. The dynamic evaluator is the key trick for live-server evals, because the correct answer to "cheapest flight Tuesday" changes hourly. Date-stamp every result and link the live leaderboard rather than quoting stale numbers[5].
Secondary metrics
Tool routing at registry scale. LiveMCPBench runs 95 tasks over 70 servers and 527 tools, and retrieval errors account for nearly half of all failures; the best model, Claude-Sonnet-4, reaches 78.95% while most models land in the 30-50% band[6]. If your agent sees more than a handful of servers, measure routing before anything else. The per-server operational matrix. Task accuracy crossed with latency and token consumption under a fixed agent and LLM, the MCPBench design; it is how you compare, say, two alternative search servers over local STDIO versus remote SSE transports[7]. The security triple. AgentDojo reports benign utility, utility under attack, and targeted attack success rate across 97 realistic tasks and 629 security cases; frontier models fail many tasks even unattacked[8]. Report all three, since a defense that halves utility is not a defense.
Offline eval design
Build a per-tool eval matrix as a comparison table: tool, accuracy, latency, tokens, and date of last schema change. Three tasks per tool with static or dynamic evaluators is enough to make the matrix real; the matrix is what turns "the agent got worse" into "the repo server's search tool got worse."
Then schema-drift regression. Snapshot the full tool schemas from every server you depend on, diff the snapshots in CI, and replay recorded traces after any server, schema, or prompt change. This is tooling practice rather than an academic benchmark; no published benchmark measures drift, but third-party servers change under you without a release note, and the diff job is the only warning you get. Borrow MCP-Universe's evaluator split for grading: format checks for structure, static checks for stable facts, dynamic ground-truth fetches for anything time-sensitive[4].
Online eval design
Three production signals. Schema-drift alerts: run the snapshot-diff job on a schedule against live servers, not just in CI, and page when a depended-on tool changes shape. Injection canaries: seed a small stream of AgentDojo-style poisoned tool results through the real agent and alert on any canary action executing[8]. Token and latency budgets per server: MCP-Universe's failure analysis says long-context growth is a primary killer, so track tokens consumed per server per task and alert on budget breaches before they become truncation failures[4].
Top benchmarks
The MCP-Universe harness for execution-verified tasks against real servers. LiveMCPBench for routing at registry scale, with its copilot-style tool retriever as a baseline. MCPBench for the per-server accuracy-latency-tokens matrix across transports[7]. AgentDojo for the injection-probe suite. Inspect AI as the general harness when you write your own tasks; its sandboxing and scorer primitives fit the static-plus-dynamic evaluator pattern well.
Common pitfalls
Treating live-server scores as reproducible. The servers are non-stationary; the same task set scores differently next month with zero agent changes. Use dynamic evaluators and date-stamp every result[4]. Reading only the end-to-end number. Once tool count grows, failures concentrate in tool retrieval and parameter construction; LiveMCPBench attributes nearly half of failures to retrieval alone[6]. The end-to-end score hides the layer you can actually fix. Assuming clean-run success implies safety. A benign pass rate says nothing about behavior when a tool result carries an injected instruction; only the under-attack numbers do[8]. Ignoring silent schema drift. A third-party server renames a parameter and your agent breaks with zero code change on your side. Without schema snapshots and diffs, the first alert is a customer.
Canonical reading
MCP-Universe (Luo et al., 2025) is the canonical paper because it measures what this category actually is: agents against 11 real servers with execution-based evaluators, not mocks[4]. Its two sobering results, GPT-5 at 43.72% and enterprise harnesses failing to beat plain ReAct, set honest expectations for anyone shipping MCP agents. Read it for the evaluator taxonomy as much as the scores.
Day-1 starter
- Pick your top 10 tools across servers and write 3 tasks each, graded by static evaluators for stable facts and dynamic ground-truth fetches for anything time-sensitive.
- Snapshot every tool schema you depend on and add a CI job that diffs the snapshots on each run; a changed schema fails the build until traces are replayed.
- Add 10 AgentDojo-style injection probes to the suite and report the triple: benign utility, utility under attack, and targeted attack success rate.
Cross-cutting links