LISTEN TO THIS ARTICLE
Multimodal Agents Score 40% Where Humans Score 72%
Every frontier lab now ships models that see, hear, and read. The assumption is that more modalities mean more capable agents. The benchmarks tell a different story.
On OSWorld, the most realistic test of agents operating real computer environments, the best multimodal agents complete roughly 40% of tasks. Humans score 72% on the same tasks. That gap isn't closing as fast as the marketing suggests.
Where the Eyes Fail
The bottleneck isn't understanding. It's grounding -- the act of mapping visual perception to precise action.
GPT-4V connected to a GUI clicks the wrong button because it misidentifies an icon. It misses pop-up dialogs. It confuses a loading spinner for a static element. These are not reasoning failures. They're perception-to-action failures, and they compound across multi-step tasks. A January 2026 survey of agentic architectures found that nondeterminism in long-horizon tasks makes evaluation itself unreliable, because agents fail differently each time they run.
Text-only agents that parse DOM and HTML directly sidestep the problem entirely. They trade generality for reliability. For structured web tasks, a text agent reading the page source often outperforms a multimodal agent looking at the screenshot.
The Open-Weight Surprise
One development worth watching: Qwen3-VL-235B, an open-weight model from Alibaba, now matches or surpasses GPT-5 and Gemini-2.5-Pro on multimodal benchmarks. The smaller Qwen2.5-Omni-7B is already deployed in BMW vehicles for in-car assistance. Open-weight multimodal models reaching parity with proprietary ones happened faster than most forecasts predicted.
This matters because multimodal agent deployment has been gated partly by cost. Running vision models at inference is expensive. Open-weight alternatives that teams can host and optimize internally change the economics.
The Evaluation Problem Nobody Solved
Microsoft published a formal Multimodal Agent Score standard in February 2026. The fact that a standardized evaluation metric needed to be invented this late tells you how immature the field remains. AgentArch, an enterprise-focused benchmark, found that few existing evaluation suites target real enterprise workflows at all. Most benchmarks test toy tasks.
The International AI Safety Report flagged a related concern: hallucination rates of 20-30% in language models persist and compound in multimodal settings where inputs are noisy or ambiguous.
Meanwhile, the production numbers stay sobering. Gartner projects 40% of enterprise apps will embed AI agents by end of 2026, up from less than 5% in 2025. But fewer than one in four organizations that experiment with agents have successfully scaled them to production. Adding vision doesn't simplify that transition.
What This Changes
The multimodal agent story for 2026 isn't "agents can now see." They could see in 2025. The story is that seeing and acting remain fundamentally different capabilities, and the gap between them is wider than most product roadmaps acknowledge.
For practitioners, the calculus is straightforward. If your agent operates in a structured digital environment with accessible DOM or APIs, a text-only agent will be more reliable and cheaper. Multimodal perception is worth the cost when the input is genuinely unstructured: scanned documents, physical environments, interfaces without programmatic access.
The Agentic AI Foundation, formed under the Linux Foundation in late 2025, is standardizing agent infrastructure through MCP and related protocols. But infrastructure for how agents communicate doesn't solve the harder question of whether they can reliably act on what they see.
Forty percent task completion is progress. It's also an honest number in a field that doesn't produce enough of them.
Related reading:
- When Models See and Speak: The Multimodal Agent Arrives
- Robots With Reasoning: When Language Models Meet the Physical World
- Scaling Laws Explained for Practitioners
- Agent Reliability Scores Are Getting Worse, Not Better