Multimodal Agents Are Still Missing the Workflow

LISTEN TO THIS ARTICLE

Evidence base: OSWorld, VisualWebArena, OSWorld-Human, OpenAI CUA, Anthropic computer use, OpenCUA, tau-squared-bench, and CoVe.

Multimodal agents have made real progress. In October 2024, Anthropic said Claude 3.5 Sonnet scored 14.9% on OSWorld in the screenshot-only category, or 22.0% with more steps. OpenAI later reported CUA at 38.1% on OSWorld, 58.1% on WebArena, and 87.0% on WebVoyager. The open-source side has moved too: OpenCUA-72B reports 45.0% on OSWorld-Verified, 60.8% on ScreenSpot-Pro, and 37.4% on UI-Vision.

That is improvement. It is not a production story yet.

These tests are imperfect, but they ask the right class of question: can the model turn pixels into useful actions?

Key takeaways

Multimodal agents are better at perception and tool use, but action reliability still breaks on messy interface paths.
Workflow grounding matters more than one-off visual understanding once an agent touches real software state.
Verification is becoming the more useful frontier than another headline image-understanding score.

What multimodal agents improved

The jump is easiest to see in computer-use tests. OSWorld tests agents across 369 real computer tasks involving desktop apps, file operations, web apps, and multi-application workflows. VisualWebArena added 910 visually grounded web tasks and found the best VLM agents at 16.4% success against 88.7% human performance. These tests are imperfect, but they ask the right class of question: can the model turn pixels into useful actions?

The answer is less embarrassing than it was. That matters for the /models-frontiers/ track because multimodality is no longer just captioning, diagram reading, or video summary. It is becoming an interface layer for agents. The older Swarm Signal read, When Models See and Speak, was about the gap between seeing and acting. The gap has narrowed. The failure mode has changed.

That is the uncomfortable part: extra cognition can make the agent slower without making the workflow safer.

Why multimodal agents still miss workflows

The problem is that workflows punish almost-correct actions. A model can identify the right button, click the wrong duplicate, recover with a backtrack, then make a stale assumption three steps later. Visual perception helped. The task still failed.

OSWorld-Human makes this concrete. Its authors evaluated 16 computer-use agents, found leading systems still take 2.7 to 4.3 more steps than human reference trajectories, and tied planning, reflection, and judging calls to runtime bottlenecks. That is the uncomfortable part: extra cognition can make the agent slower without making the workflow safer.

The same pattern shows up outside GUIs. MCP-Bench connects agents to 28 MCP servers and 250 tools, then tests cross-tool workflows rather than isolated API calls. tau-squared-bench tests a dual-control setting where both the agent and the user act in a shared environment, and performance drops when agents have to guide users. These are workflow-grounding tests, not eye tests.

That is why computer-use agents should be read with agent evals and agent verification, not as a branch of vision alone.

Operator takeaway

Treat multimodal agents as workflow components, not autonomous staff. Start with bounded tasks where the UI state is observable, the allowed actions are explicit, and every state-changing action can be checked before execution.

CoVe points in the right direction. Its March 2026 paper reports CoVe-4B at 43.0% success in Airline and 59.4% in Retail on tau-squared-bench by using task constraints as deterministic verifiers. The lesson is not that small models suddenly beat frontier labs. The lesson is that explicit constraints improve tool use because they give the agent something firmer than a screenshot and a hope.

The practical build order is dull and correct: map the workflow, define permissible actions, instrument the UI trace, validate state changes, and then add multimodal perception where it removes friction. If the system cannot verify what changed, better vision only helps it make cleaner-looking mistakes.

Source trail

Research papers

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments - 369 tasks, 72.36% human baseline, 12.24% initial best model.
VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks - 910 tasks, 16.4% best VLM success, 88.7% human performance.
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents - 16 agents, 2.7 to 4.3 more steps than human paths.
tau-squared-bench: Evaluating Conversational Agents in a Dual-Control Environment - shared-state user-agent tasks.
CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification - CoVe-4B at 43.0% Airline and 59.4% Retail.

Vendor and project claims

Computer-Using Agent - OpenAI, 2025; 38.1% OSWorld, 58.1% WebArena, 87.0% WebVoyager.
Introducing computer use - Anthropic, 2024; 14.9% screenshot-only OSWorld, 22.0% with more steps.
OpenCUA: Open Foundations for Computer-Use Agents - OpenCUA project; 45.0% OSWorld-Verified, 60.8% ScreenSpot-Pro, 37.4% UI-Vision.

Related Swarm Signal analysis

Multimodal Agents Are Still Missing the Workflow

Key finding

Why it matters

Evidence base

Operator takeaway

Where this breaks

Use this if

Avoid this if

Key takeaways

What multimodal agents improved

Why multimodal agents still miss workflows

Operator takeaway

Source trail

Execution tooling is separate