Multi-agent systems

Evaluating multi-agent LLM systems: milestone KPIs, the MAST 14-failure-mode taxonomy, orchestrator vs worker splits, and cost explosion.

What "good" looks like

A multi-agent system is evaluated on two axes at once: whether the team hit its milestones, and where the collaboration broke when it did not. The primary is a milestone-based task score with partial credit, paired with a failure-mode profile of sampled traces; either one alone is uninterpretable. MultiAgentBench (MARBLE) is the reference for the first axis: it scores explicit sub-goal KPIs alongside collaboration quality, communication efficiency, and protocol compliance, so a team that finishes most of its milestones gets credit for what it finished, not a zero^[1]. Two of its findings are worth carrying into your own harness: graph topology performs best in the research scenarios, and cognitive planning improves milestone achievement by about 3%^[1]. Topology and planning strategy are eval variables, not fixed architecture decisions.

This page covers what is specific to N agents sharing a task: credit assignment, communication failure, orchestrator quality, and cost normalization. Pass^k, deterministic state checks, and single-agent error attribution are covered in Long-horizon planning; do not re-derive them here.

Secondary metrics

Cost-adjusted success against a strong single-agent baseline at equal token budget. Multi-agent systems often match single-agent quality at a multiple of the cost, and quality-only reporting hides it. Communication rounds per milestone. A rising round count with flat milestone completion is the signature of agents talking past each other. Per-agent error attribution. Which agent introduced the error that sank the run; there is no standard automatic metric for this, so trace annotation carries the load. Agent-as-judge gives step-level verdicts on sampled trajectories at reasonable cost and is the current best tool for scaling that annotation^[2].

Offline eval design

Start from the MAST taxonomy: 14 failure modes in 3 categories (specification and system design, inter-agent misalignment, task verification), built from more than 1,600 annotated traces across 7 frameworks, with human agreement at kappa 0.88^[3]. The authors ship an open LLM-as-judge annotator pipeline and the dataset, so you can profile your own traces without building a rubric from scratch^[4]. Be honest about credit assignment: no standard automatic metric exists, and MAST-style trace annotation is the current best practice, not a stopgap.

For architecture, evaluate the orchestrator separately from the workers. Magentic-One is the reference design: an Orchestrator maintains a task ledger and a progress ledger over WebSurfer, FileSurfer, Coder, and ComputerTerminal workers, and the system is statistically competitive with strong single-agent systems on GAIA, AssistantBench, and WebArena^[5] ^[6]. Score the orchestrator on task decomposition and re-planning; score workers on step execution. A failed run where the plan was sound but a worker botched a step is a different bug from a sound execution of a bad plan.

Compare topologies (star, chain, tree, graph) per scenario rather than picking one globally; MARBLE's scenario-dependent results say the best wiring changes with the task^[1]. And every offline table needs an ablation row: same task, single strong agent, same budget. If that row wins, the multi-agent architecture is the regression.

Online eval design

Log full traces with per-agent spans so every message, tool call, and hand-off is attributable to the agent that produced it. Dashboards should slice by agent: per-agent error attribution, token spend per milestone, and communication-round counts per milestone. Alert on retry loops (the same agent pair exchanging near-identical messages) and on cost-per-milestone drift, which is where multi-agent budgets quietly explode. Observability frameworks for stage-attributable scoring of agentic workflows exist and are worth adopting rather than rebuilding^[7]. Run periodic MAST annotation over a trace sample using the open annotator; the category tally over time is your system-design health metric^[4].

Top benchmarks

MultiAgentBench (MARBLE), https://arxiv.org/abs/2503.01935. Milestone KPIs plus collaboration, communication, and protocol scores across topologies^[1].
MAST dataset and annotator, https://github.com/multi-agent-systems-failure-taxonomy/MAST. Not a leaderboard; the failure-profiling toolkit this page's primary metric depends on^[4].
GAIA, https://arxiv.org/abs/2311.12983. The generalist-assistant set where multi-agent systems are most often compared against single-agent baselines^[6].

Tools especially suited

MARBLE for milestone-scored scenarios across topologies. The MAST annotator for failure-mode profiling of your own traces; it is the fastest path to a taxonomy-grounded diagnosis^[4]. Langfuse for production trace logging with per-agent spans and annotation queues. Inspect AI for harness-level control when you need isolated, repeatable runs of the same scenario. Agent-as-judge pipelines for step-level verdicts on samples where deterministic checks do not reach^[2].

Common pitfalls

Reading end-task pass rate as model quality. The MAST result cuts the other way: most failures are system-design and coordination problems, not model problems^[3]. Swapping in a stronger model to fix a specification bug is an expensive way to change nothing. Trusting the checker agent. Verification weakness is its own MAST category; a wrong answer confidently approved by a weak checker scores as success in a naive harness, so the verifier needs its own eval. Comparing runs without isolation. Agent side-effects contaminate reruns; without isolated environments plus repetition, two runs of the same task are not comparable and your deltas are noise. Reporting quality without cost. A multi-agent system that matches a single agent at several times the token spend is a worse system; the equal-budget ablation row is the honest comparison.

Canonical reading

MAST (Cemri et al., 2025) is the canonical read because it is the first systematic answer to "why do multi-agent systems fail": 14 failure modes from 1,600+ traces across 7 frameworks, annotated at kappa 0.88, with an open annotator you can run today^[3]. Its core finding, that most failures trace to system design and inter-agent coordination rather than model capability, reframes the whole eval problem: profile the organization, not just the employees.

Day-1 starter

Log 20 full multi-agent traces end to end, with every message and tool call attributed to the agent that produced it.
Annotate those 20 traces with the MAST taxonomy using the open annotator and tally the three categories; the tally is your first diagnosis^[4].
Run the same 20 tasks through your strongest single agent and compare pass rate and total tokens. If the single agent wins the cost-adjusted comparison, fix the architecture before the prompts.

Cross-cutting links

Agent-as-judge: the step-level verdict machinery this page samples with.
Sub-goal decomposition: how milestone definitions get derived and scored.
Long-horizon planning: pass^k, state checks, and single-agent attribution live here.
Agent trajectory evaluation: the technique reference for scoring what happens between milestones.
Langfuse: per-agent spans and annotation queues for the trace sample.
Agent eval with τ-bench: the single-agent harness to run your ablation row on.

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.