LISTEN TO THIS ARTICLE

Evidence base: OSWorld, VisualWebArena, OSWorld-Human, OpenAI CUA, Anthropic computer use, OpenCUA, tau-squared-bench, and CoVe.

Multimodal agents have made real progress. In October 2024, Anthropic said Claude 3.5 Sonnet scored 14.9% on OSWorld in the screenshot-only category, or 22.0% with more steps. OpenAI later reported CUA at 38.1% on OSWorld, 58.1% on WebArena, and 87.0% on WebVoyager. The open-source side has moved too: OpenCUA-72B reports 45.0% on OSWorld-Verified, 60.8% on ScreenSpot-Pro, and 37.4% on UI-Vision.

That is improvement. It is not a production story yet.

These tests are imperfect, but they ask the right class of question: can the model turn pixels into useful actions?

Key takeaways

  • Multimodal agents are better at perception and tool use, but action reliability still breaks on messy interface paths.
  • Workflow grounding matters more than one-off visual understanding once an agent touches real software state.
  • Verification is becoming the more useful frontier than another headline image-understanding score.

What multimodal agents improved

The jump is easiest to see in computer-use tests. OSWorld tests agents across 369 real computer tasks involving desktop apps, file operations, web apps, and multi-application workflows. VisualWebArena added 910 visually grounded web tasks and found the best VLM agents at 16.4% success against 88.7% human performance. These tests are imperfect, but they ask the right class of question: can the model turn pixels into useful actions?

The answer is less embarrassing than it was. That matters for the /models-frontiers/ track because multimodality is no longer just captioning, diagram reading, or video summary. It is becoming an interface layer for agents. The older Swarm Signal read, When Models See and Speak, was about the gap between seeing and acting. The gap has narrowed. The failure mode has changed.

That is the uncomfortable part: extra cognition can make the agent slower without making the workflow safer.

Why multimodal agents still miss workflows

The problem is that workflows punish almost-correct actions. A model can identify the right button, click the wrong duplicate, recover with a backtrack, then make a stale assumption three steps later. Visual perception helped. The task still failed.

OSWorld-Human makes this concrete. Its authors evaluated 16 computer-use agents, found leading systems still take 2.7 to 4.3 more steps than human reference trajectories, and tied planning, reflection, and judging calls to runtime bottlenecks. That is the uncomfortable part: extra cognition can make the agent slower without making the workflow safer.

The same pattern shows up outside GUIs. MCP-Bench connects agents to 28 MCP servers and 250 tools, then tests cross-tool workflows rather than isolated API calls. tau-squared-bench tests a dual-control setting where both the agent and the user act in a shared environment, and performance drops when agents have to guide users. These are workflow-grounding tests, not eye tests.

That is why computer-use agents should be read with agent evals and agent verification, not as a branch of vision alone.

Operator takeaway

Treat multimodal agents as workflow components, not autonomous staff. Start with bounded tasks where the UI state is observable, the allowed actions are explicit, and every state-changing action can be checked before execution.

CoVe points in the right direction. Its March 2026 paper reports CoVe-4B at 43.0% success in Airline and 59.4% in Retail on tau-squared-bench by using task constraints as deterministic verifiers. The lesson is not that small models suddenly beat frontier labs. The lesson is that explicit constraints improve tool use because they give the agent something firmer than a screenshot and a hope.

The practical build order is dull and correct: map the workflow, define permissible actions, instrument the UI trace, validate state changes, and then add multimodal perception where it removes friction. If the system cannot verify what changed, better vision only helps it make cleaner-looking mistakes.

Related: Browser-Use Agents After the Computer-Use Benchmarks.

Source trail

Research papers

Vendor and project claims

Related Swarm Signal analysis