Production tools change. Fields get deprecated, error rates spike during incidents, payload shapes shift between API versions. An agent benchmarked only on a frozen happy-path environment is graded on a version of reality your users will never see. TRAJECT-Bench's contribution is to pair 1,228 production APIs with varied response shapes and show that model rankings shuffle when the environment includes realistic adversariality 1.
The fix is to evaluate agents against a small set of injected environmental perturbations and report graceful degradation as an explicit dimension. An agent that succeeds at pass^k on the happy path but produces confidently wrong answers under a 429-with-retry pattern is not production-ready, even if its leaderboard number is high.
The three perturbation classes
Three orthogonal axes are enough to catch most environmental drift in practice. Each axis maps to a real production failure mode and to a one-line test fixture.
Latency injection. Sleep on the critical-path tool call for 3-10 seconds. Does the agent time out, fall back, retry, or hang. Many agents have no notion of "this call is taking too long"; the failure mode is either an indefinite wait or a silent abandonment with no user-visible signal.
Schema shift. Rename a response field, reorder a list, change a nullable to non-nullable. Does the agent surface a clean error, fall back to a different tool, or proceed with a missing or wrong value. The most damaging variant in practice is a renamed field whose old name silently returns null; the agent reads the null and continues as if the field were absent.
Transient error injection. Swap a successful 200 for a 429-with-retry-after, a 503, or a malformed JSON body. Does the agent retry with backoff, escalate to a different tool, or report a graceful "I could not complete this; here is what I tried." This class is where the τ-bench retail and airline domains shine: the user simulator pushes the agent into recovery paths that frozen benchmarks miss entirely 2.
The three pass rates
For each task in your regression set, run the agent against the happy path and against each perturbation class. Report three pass rates side by side.
- Happy-path pass: the baseline.
- Perturbed pass: the rate at which the agent reaches the right outcome despite the injected perturbation.
- Graceful-degradation rate: the rate at which the agent fails safely (explicit error, no state pollution, no silent wrong answer) when it cannot recover.
A model that drops from 90% happy-path to 30% perturbed with high graceful-degradation is brittle but safe. A model that drops to 50% perturbed with low graceful-degradation is dangerous: half the time it succeeds, the other half it produces a confidently wrong answer in a state nobody noticed got polluted. The agent-eval surveys both flag environmental adversariality as an under-represented evaluation axis and recommend it as a primary deployment gate 3 4.
A perturbation harness
"""
Wrap a tool with three perturbation modes. Run the agent against each
mode and report happy / perturbed / graceful rates.
"""
from dataclasses import dataclass
import random
import time
from typing import Any, Callable
@dataclass
class ToolResponse:
status: int
body: Any
def latency_inject(seconds: float) -> Callable[[Callable], Callable]:
def wrap(fn):
def wrapped(*a, **kw):
time.sleep(seconds)
return fn(*a, **kw)
return wrapped
return wrap
def schema_shift(rename: dict[str, str]) -> Callable[[Callable], Callable]:
"""Rename keys in the response body to simulate an API version bump."""
def wrap(fn):
def wrapped(*a, **kw):
r = fn(*a, **kw)
if isinstance(r.body, dict):
shifted = {rename.get(k, k): v for k, v in r.body.items()}
return ToolResponse(status=r.status, body=shifted)
return r
return wrapped
return wrap
def error_inject(error_rate: float, status: int) -> Callable[[Callable], Callable]:
def wrap(fn):
def wrapped(*a, **kw):
if random.random() < error_rate:
return ToolResponse(status=status, body={"error": "transient"})
return fn(*a, **kw)
return wrapped
return wrap
# Three runs per task: happy, perturbed, and a control we expect to
# fail gracefully.
def evaluate_robustness(
agent_run: Callable[[dict], tuple[bool, dict]],
tasks: list[dict],
perturb: Callable | None,
) -> dict:
outcomes = []
for task in tasks:
ok, trace = agent_run(task)
graceful = (not ok) and trace.get("user_visible_error_clean", False) and \
(not trace.get("state_pollution", False))
outcomes.append({"pass": ok, "graceful": graceful})
n = len(outcomes)
return {
"pass_rate": sum(o["pass"] for o in outcomes) / n,
"graceful_degradation_rate": sum(o["graceful"] for o in outcomes if not o["pass"]) /
max(1, sum(1 for o in outcomes if not o["pass"])),
}
The harness above wraps a tool function; in practice you wrap the network client and inject perturbations on the calls you choose. Apply one perturbation at a time so the diagnostic signal stays clean.
State pollution is the failure that ships
The most damaging environmental failure mode is not "the agent fails"; it is "the agent succeeds in name and pollutes state." A refund that is recorded twice because the agent retried after a 429 that actually succeeded. A row created twice because the agent did not check before creating. A status flipped twice because the agent treated a 5xx as no-op.
The fix is idempotency on the tool side and read-before-write on the agent side, both of which need to be exercised by the eval. Add an idempotency-violation check to the harness: for each successful run under transient-error injection, verify the database does not contain duplicate side effects. If duplicates appear, the trace fails on the safety axis even if the user-visible outcome was correct 5.
[!IMPORTANT]
A graceful failure is a successful outcome on the safety axis. Score the two axes separately, and weight them by what your product cares about. A refund agent should refuse hard when in doubt; a search agent can return weaker results and still be useful.
The workflow comparison
Environmental robustness is one of the places where the workflow-vs-agent tradeoff sharpens. A workflow with deterministic retry, backoff, and idempotency is straightforward to make robust because its control flow is statically inspectable. An agent with the same goal must learn the same behaviour, which is what TRAJECT-Bench shows is unevenly distributed across models 1. Anthropic's guide to building effective agents argues for the simplest architecture that solves the problem, and environmental drift is one of the strongest arguments for the simpler one when it suffices 6.
What to do this week
- Pick the top-three tools by traffic. For each, build an "evil twin": a wrapper that injects latency, schema shifts, and transient errors at configurable rates.
- Re-run your regression set with the evil twin in place. Report happy-path pass, perturbed pass, and graceful-degradation rate.
- Add an idempotency check to the perturbed runs. Treat duplicate side effects as a safety violation regardless of the user-visible outcome.