▶️ LISTEN TO THIS ARTICLE
In April 2024, the best AI agent scored 12.24% on OSWorld, a benchmark that tests whether models can actually operate a real computer. Humans scored 72.36%. Eighteen months later, multiple agents have crossed that human baseline. Simular's Agent S hit 72.6%. OpenAI's Computer-Using Agent landed at 38.1% in January 2025 and the field kept climbing. That 12-to-72 trajectory is the steepest improvement curve on any major agent benchmark, and it's worth asking what it actually proves.
The Benchmark That Actually Matters
OSWorld, introduced by Xie et al. at NeurIPS 2024, does something most agent benchmarks don't bother with: it drops models into a real operating system. Not a sandboxed API. An actual Ubuntu or Windows desktop with 369 tasks spanning file management, web browsing, and multi-app workflows. The agent either completed the task or it didn't. No partial credit.
When it launched, GPT-4V managed 12.24%. Claude 3.5 Sonnet posted 14.9% on screenshot-only tasks and 22% with more steps. These numbers established a baseline for a capability nobody had seriously measured before: can a model use a computer the way you and I do?
The answer was "barely." The models struggled with GUI grounding, the ability to look at a screen and figure out where to click. They struggled harder with operational knowledge: that you need to save a file before closing an application, or that a dropdown menu requires a specific click sequence.
How Agents Closed a 60-Point Gap
The jump from 12% to 72% didn't come from a single breakthrough. It came from a stack of engineering fixes applied on top of better foundation models.
Agent S2, from Agashe et al., introduced two ideas that mattered: Mixture-of-Grounding for precise GUI element localization, and Proactive Hierarchical Planning that breaks tasks into sub-goals. This pushed scores past 34% on 50-step evaluations and beat Claude Computer Use by 32.7%. It also crushed WindowsAgentArena by 52.8%, suggesting the techniques generalized across operating systems.
Then came Agent S3 and a paper with a title that doesn't oversell itself: "The Unreasonable Effectiveness of Scaling Agents for Computer Use." Simular's team simplified the Agent S2 framework, added a native coding agent, and introduced Behavior Best-of-N. The idea is straightforward: run the same task multiple times with different agent instances, then pick the best result. This brute-force approach pushed accuracy from 62.6% to 69.9%. Eventually, the full Agent S system crossed 72.6%.
OpenAI took a different path. Their Computer-Using Agent combined GPT-4o's vision with reinforcement learning, scoring 38.1% on OSWorld and 87% on WebVoyager. The gap between those two numbers tells the story: WebVoyager tests curated websites with predictable layouts, while OSWorld throws the entire messiness of a desktop OS at the agent.
This tracks with a pattern I keep seeing across different types of agents: performance looks impressive on constrained tasks and falls apart as the environment gets more open-ended.

The Score Hides the Cost
Here's where the numbers get uncomfortable. A team from UC San Diego published OSWorld-Human in June 2025, and their finding deserves more attention than it got: even the highest-scoring agents take 1.4 to 2.7 times more steps than a human would need. Each successive step takes roughly 3x longer than the steps at the beginning, because the model burns tokens on planning and reflection calls.
A task a human completes in two minutes can take an agent ten. That's the difference between a useful tool and a tech demo.
The OS Agents survey, accepted as an ACL 2025 Oral, confirms the pattern across computers, phones, and browsers: accuracy improves faster than efficiency. Models finish more tasks, but they waste steps on actions a human would never consider.
This is the when-agents-meet-reality problem playing out in a new domain. Benchmark scores measure whether the job gets done. Production systems care about whether it gets done fast enough to be worth doing.
What This Means for Agent Builders
The 12-to-72 trajectory tells us something real: multimodal models can learn to operate graphical interfaces. The grounding problem, which looked unsolvable two years ago, now has multiple working solutions. The bBoN scaling result from Agent S3 suggests that throwing more compute at inference time continues to pay off for GUI tasks, even without better models.
But the efficiency gap should worry anyone building products on top of this. Users won't wait eight minutes for an agent to fill out a form they could complete in ninety seconds. The path forward isn't just higher accuracy. It's fewer wasted steps.
The 72% number will keep climbing. The number that matters more, the one nobody puts in their press release, is how many minutes the agent burns getting there.

Sources
Research Papers:
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments — Xie et al. (2024)
- Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents — Agashe et al. (2025)
- The Unreasonable Effectiveness of Scaling Agents for Computer Use — Simular AI (2025)
- OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents — Abhyankar et al. (2025)
- Large Language Model-Brained GUI Agents: A Survey — Zhang et al. (2024)
Industry:
- Computer-Using Agent — OpenAI (2025)
- Developing a Computer Use Model — Anthropic (2024)
- OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use — Hu et al., ACL 2025
Related Swarm Signal Coverage: