What "good" looks like
Computer-use and browser-use agents drive real screens. Inputs are screenshots; outputs are keyboard, mouse, and (sometimes) accessibility-tree actions. The primary metric is execution-grounded success on OSWorld (369 desktop tasks across Ubuntu apps) or WebArena (4 self-hosted web domains). OSWorld humans hit 72.36%; the best agent in the original paper hit 12.24%, and Anthropic's Claude 3.5 Sonnet computer-use launch reported 14.9% screenshot-only and 22.0% with extra steps[1][2]. WebArena humans hit 78.24%; GPT-4 hit 14.41% in the original paper[3]. The headline gap is large enough that the category is still genuinely nascent.
This whole area is one where production evals do not yet exist as a packaged category. Most teams roll their own with pytest plus screenshot diffs. Treat the public benchmarks as the floor, not the bar; your domain workflows are the actual eval.
Secondary metrics
Steps to completion. Median and 95th percentile actions per successful task. Mis-clicks, defined as clicks on coordinates that do not advance the workflow. False completion, the agent claims success when the goal state is not reached; this is the most dangerous failure mode for destructive workflows. Screen-state divergence from expected, measurable when the workflow has a known target state.
Offline eval design
OSWorld, WebArena, and Mind2Web for breadth. VisualWebArena for multimodal web tasks[4]. On top of those, build an internal "top 20 workflows our users do" set with deterministic completion checks. The check should be programmatic: a row appearing in Salesforce with specific fields, an email landing in a specific folder, a file existing at a specific path. LaaJ on screenshots is unreliable; deterministic state checks are the only honest score.
Cap actions per run. The cap doubles as a cost control and a damage control; an agent that loops indefinitely is both expensive and dangerous. Most workflows should be solvable in 10 to 30 actions.
Online eval design
Screen-record every run. Storage cost is real (hours of video per day per agent) but the record is the only way to debug failures. Sample-review every Nth failed run. Alert on success-rate drops and on dangerous-action triggers (delete, send-money, mass-email). Anthropic shipped new classifiers specifically for the dangerous-action category at computer-use launch[2]; the recommendation for "lower-risk applications" is explicit.
The hardest part of online eval here is the action sandbox. If your agent has real production credentials, every failed run is a potential incident; if it has staging credentials, the eval distribution shifts away from production. There is no clean answer; most teams run staging for most evals and reserve production runs for a small monitored sample.
Top benchmarks
The OSWorld harness for parity with public scores. Browser-Use and Playwright for custom verifiers and screenshot capture. Inspect AI for sandboxed runs with strong logging primitives. WebArena's self-hosted environment for reproducible browser evals.
Common pitfalls
No deterministic verifiers. LaaJ on screenshots is unreliable; build state-check functions for every workflow you eval. No safety eval. Computer-use can do destructive things; Anthropic shipped new safety classifiers at launch for exactly this reason[2]. Reporting headline OSWorld numbers without specifying which 369. Some teams exclude the 8 Google Drive tasks because credential issues make them flaky; the published score should say which subset was scored. Conflating "agent reached the end" with "agent reached the goal". The state-check is the only honest signal.
Canonical reading
Anthropic's computer-use launch post is the most useful single read because it includes both the score on OSWorld and an explicit recommendation about deployment risk[2]. The line that matters: capability "remains experimental and imperfect"; recommend "lower-risk applications". The OSWorld paper itself is the canonical benchmark reference[1].
Day-1 starter
- 20 workflows with state-check functions. The state check is the eval; without it, scoring is impressionistic.
- Screenshot logger plus a diff viewer. The viewer is the debugging surface; without it, you cannot understand why a run failed.
- Hard cap on actions per run. Bounds cost and bounds damage. Most workflows should be solvable in 10 to 30 actions; longer runs are usually loops.