Computer and browser-use agents

Agent drives a real OS or browser via screenshots and keyboard or mouse. OSWorld and WebArena are the public benchmarks; production evals are not yet a category.

What "good" looks like

A computer-use or browser-use agent drives a real screen: inputs are screenshots, outputs are keyboard, mouse, and (sometimes) accessibility-tree actions. It is scored by execution-grounded success, meaning a programmatic check confirms the agent actually reached the goal state, not that it looked done.

The primary metric is that success rate on OSWorld (369 desktop tasks across Ubuntu apps) or WebArena (4 self-hosted web domains). The public numbers put the target in context:

OSWorld: humans hit 72.36%; the best agent in the original paper hit 12.24%^[1].
OSWorld, computer-use launch: Anthropic's Claude 3.5 Sonnet reported 14.9% screenshot-only and 22.0% with extra steps^[2].
WebArena: humans hit 78.24%; GPT-4 hit 14.41% in the original paper^[3].

The headline gap between human and agent is large enough that the category is still genuinely nascent.

Production evals do not yet exist as a packaged category here. Most teams roll their own with pytest plus screenshot diffs. Treat the public benchmarks as the floor, not the bar; your domain workflows are the actual eval.

flowchart TD
    TASK["Workflow task"] --> AGENT["Agent: screenshots in, mouse/keyboard out"]
    AGENT --> CAP{"Actions under the cap?"}
    CAP -->|no| LOOP["Kill run: likely a loop"]
    CAP -->|yes| STATE{"State-check: goal state reached?"}
    STATE -->|no| FAIL["Fail, keep the screen recording"]
    STATE -->|yes| SAFE{"Any dangerous action: delete, send-money, mass-email?"}
    SAFE -->|yes| REVIEW["Route to safety classifier review"]
    SAFE -->|no| PASS["Count as success"]

Secondary metrics

Metric	What it catches	How to score
Steps to completion	Inefficient or looping runs	Median and 95th percentile actions per successful task
Mis-clicks	Wasted or wrong actions	Clicks on coordinates that do not advance the workflow
False completion	Agent claims success when the goal state is not reached; the most dangerous mode for destructive workflows	Compare claimed success against the deterministic state-check
Screen-state divergence	Drift from the expected end state	Diff against a known target state, when the workflow has one

Offline eval design

Start with public breadth. OSWorld, WebArena, and Mind2Web for coverage, plus VisualWebArena for multimodal web tasks^[4].

Then build the set that predicts whether you ship: an internal "top 20 workflows our users do" set with deterministic completion checks. The check should be programmatic:

A row appearing in Salesforce with specific fields.
An email landing in a specific folder.
A file existing at a specific path.

LaaJ on screenshots is unreliable; deterministic state checks are the only honest score.

Cap actions per run. The cap doubles as a cost control and a damage control; an agent that loops indefinitely is both expensive and dangerous. Most workflows should be solvable in 10 to 30 actions.

Online eval design

Screen-record every run. Storage cost is real (hours of video per day per agent), but the record is the only way to debug failures. Sample-review every Nth failed run.

Alert on two triggers: success-rate drops, and dangerous-action triggers (delete, send-money, mass-email). Anthropic shipped new classifiers specifically for the dangerous-action category at computer-use launch^[2]; the recommendation for "lower-risk applications" is explicit.

The hardest part is the action sandbox. With real production credentials, every failed run is a potential incident. With staging credentials, the eval distribution shifts away from production. There is no clean answer; most teams run staging for most evals and reserve production runs for a small monitored sample.

Top benchmarks

OSWorld, https://os-world.github.io/. 369 desktop tasks with 134 verification scripts; the standard for OS-level computer-use.
WebArena, https://arxiv.org/abs/2307.13854. 4 self-hosted web domains; the standard for browser-use^[3].
VisualWebArena, https://arxiv.org/abs/2401.13649. Multimodal extension of WebArena^[4].

Tools especially suited

OSWorld harness: parity with public scores.
Browser-Use and Playwright: custom verifiers and screenshot capture.
Inspect AI: sandboxed runs with strong logging primitives.
WebArena self-hosted environment: reproducible browser evals.

Common pitfalls

Pitfall	Why it bites	Fix
No deterministic verifiers	LaaJ on screenshots is unreliable	Build state-check functions for every workflow you eval
No safety eval	Computer-use can do destructive things	Add safety classifiers; Anthropic shipped new ones at launch for exactly this reason^[2]
Headline OSWorld numbers without the subset	Some teams exclude the 8 Google Drive tasks because credential issues make them flaky	State which of the 369 tasks were scored
Conflating "reached the end" with "reached the goal"	A run can finish without hitting the target state	The state-check is the only honest signal

Canonical reading

Anthropic's computer-use launch post is the most useful single read because it includes both the score on OSWorld and an explicit recommendation about deployment risk^[2]. The line that matters: capability "remains experimental and imperfect"; recommend "lower-risk applications". The OSWorld paper itself is the canonical benchmark reference^[1].

Day-1 starter

20 workflows with state-check functions. The state check is the eval; without it, scoring is impressionistic.
Screenshot logger plus a diff viewer. The viewer is the debugging surface; without it, you cannot understand why a run failed.
Hard cap on actions per run. Bounds cost and bounds damage. Most workflows should be solvable in 10 to 30 actions; longer runs are usually loops.

Cross-cutting links

Trajectory-level evaluation: scoring action sequences, not just task completion.
Environmental drift: robustness to tool and API change: what breaks when the OS, site, or app under the agent updates.
Agent trajectory evaluation: the technique reference for step-level scoring.
Inspect AI: sandboxed runs with screenshot logging.
Agent eval with τ-bench: the nearest runnable pattern for state-check agent evals.