LISTEN TO THIS ARTICLE

Nobody Knows If Deployed AI Agents Are Safe

The 2025 AI Agent Index just cataloged 30 deployed agentic AI systems, and the finding that should alarm everyone isn't about capability. It's about documentation. Most of these agents ship with incomplete or entirely absent safety disclosures. We're not talking about experimental research prototypes. These are production systems handling financial transactions, managing personal calendars, writing and executing code, and interacting with external APIs on behalf of real users. And the companies deploying them can't consistently tell you what guardrails are in place.

The Index That Exposes the Gap

The 2025 AI Agent Index, published by Staufer, Feng, Wei, and collaborators, is the most comprehensive attempt yet to systematically document what's actually deployed in the agent space. The team surveyed 30 commercially deployed agentic systems — chat applications with agentic tools, browser-based agents, and enterprise workflow platforms — tracking 45 information fields across their origins, design architectures, capabilities, and safety features. The picture it paints isn't reassuring.

Agents are proliferating fast. They're booking flights, managing codebases, conducting web research, and orchestrating multi-step workflows with minimal human oversight. But the safety documentation across these systems is wildly inconsistent. The Index found that 25 out of 30 agents disclose no internal safety results, and 23 out of 30 have no third-party testing information. Only four agents provide dedicated safety documentation. Some vendors publish detailed model cards and safety evaluations. Others ship agents with nothing more than a marketing page and a terms-of-service document that mentions "responsible AI" once. There's no shared taxonomy for what "safe" even means in this context, no agreed-upon set of properties an agent should demonstrate before it's allowed to touch a user's email or bank account.

Think of it like this: we're building an entire airline industry where each manufacturer gets to define its own crash-test standards, run them internally, and publish only the results it likes. That's where agent safety evaluation sits right now.

I've now read the Index cover to cover alongside four adjacent papers published in the same month, and the convergence is striking. Every single one identifies the same core problem from a different angle: the gap between benchmark performance and real-world reliability is massive, and nobody has a credible plan to close it.

The attack surface here is qualitatively different from chatbot-era prompt injection.

Benchmarks Are Testing the Wrong Thing

Rabanser, Kapoor, Kirgis, and their collaborators at Princeton make the case bluntly in "Towards a Science of AI Agent Reliability." Rising accuracy scores on standard benchmarks suggest rapid progress. Agents are smashing leaderboard numbers on tool-use evaluations and multi-step reasoning tasks. But they keep failing in production. The math doesn't lie.

The paper identifies a fundamental mismatch: benchmarks test agents against well-specified tasks with clear success criteria, while real-world deployment is dominated by ambiguity, partial information, and edge cases that no benchmark author anticipated. An agent that scores 92% on a structured tool-use benchmark can still catastrophically mishandle a request it's never seen before, because the benchmark never tested its ability to recognize its own limits.

This connects directly to work by Sirdeshmukh and Wetter on "Implicit Intelligence," which tackles the problem of underspecification. When humans talk to AI agents, they leave enormous amounts unsaid. They assume shared context, unstated constraints, and common-sense inferences that current agents handle poorly. A user who says "book me a flight to Chicago next week" expects the agent to know they probably mean O'Hare, not Midway, that they prefer aisle seats, that they don't want a 5am departure, and that the corporate travel policy caps airfare at $600. None of that is in the prompt.

Current evaluation frameworks don't test for this at all. They test whether an agent can use a flight-booking API correctly, which is the easy part. The hard part is inferring what the user actually wants. Nobody tests that in production.

The Security Dimension Everyone's Underweighting

While the safety evaluation conversation focuses on reliability and alignment, a parallel threat is growing that most frameworks barely address. Wang, Zhang, and colleagues published AdapTools, demonstrating adaptive tool-based indirect prompt injection attacks against agentic LLMs. Their approach exploits the exact integration points that make agents useful: connections to external data services, APIs, and protocols like MCP.

The attack surface here is qualitatively different from chatbot-era prompt injection. When an agent can read your email, execute code, and call external APIs, a successful injection doesn't just produce a wrong answer. It can exfiltrate data, execute unauthorized transactions, or propagate compromised instructions to other agents in a chain. AdapTools showed that adaptive attacks, those that adjust their injection strategy based on the agent's behavior, achieve significantly higher success rates than static injection attempts.

The 2025 AI Agent Index found that security evaluations are among the least consistently documented safety features across deployed systems. Some agents mention input filtering. Fewer describe output monitoring. Almost none document testing against adaptive adversarial attacks. This isn't a theoretical concern anymore. As we covered in our breakdown of the OWASP Top 10 for agent security, these attack vectors are well-understood. The tools to exploit them are getting more sophisticated. The defenses aren't keeping pace.

The Trust Problem Gets Weirder

Here's a wrinkle that makes the evaluation problem harder than it looks. Bo, Mok, and Anderson found that language models exhibit inconsistent biases when processing information from algorithmic agents versus human experts. In some contexts, models defer to algorithmic sources; in others, they privilege human-generated content. The pattern isn't predictable.

This matters because modern agentic systems often involve LLMs arbitrating between multiple information sources. If a coding agent queries both a documentation API and a Stack Overflow scraper, the underlying model's inconsistent source preferences can silently skew outputs. No current evaluation framework tests for this kind of meta-bias. We're evaluating agents as if they're monolithic decision-makers when they're actually mediating between multiple information streams with hidden and inconsistent preferences about which streams to trust.

The part that actually worries me is that this bias isn't stable. It shifts depending on the domain, the phrasing of the query, and the specific model version. You can't just characterize it once and compensate. It's a moving target, which means any safety evaluation that doesn't continuously re-test for source-preference drift will produce stale results.

Labs like Anthropic and OpenAI run internal red-teaming exercises and publish responsible scaling policies.

What Would a Real Framework Look Like

The current approach to agent safety evaluation is fragmented across three mostly disconnected tracks. Labs like Anthropic and OpenAI run internal red-teaming exercises and publish responsible scaling policies. Academic groups build benchmarks that test narrow capability slices. And policy bodies like NIST develop risk management frameworks that are comprehensive on paper but vague on implementation specifics for agentic systems.

What's missing is the connective tissue. A credible evaluation framework for deployed agents would need to test at minimum: reliability under underspecified instructions, graceful degradation when facing unfamiliar tasks, resistance to adaptive adversarial attacks through tool interfaces, consistency of behavior across information sources with different trust profiles, and transparency about what the agent can and cannot do.

The AI Agent Index is a necessary first step because you can't evaluate what you haven't cataloged. But cataloging isn't evaluating. The OpenPort Protocol paper proposes a security governance specification for agent tool access, which addresses one piece. The "Right to History" work on verifiable agent execution traces addresses another. None of them, individually, constitute a framework. And the accountability gaps we've previously covered remain wide open.

The EU AI Act will force some of this work to happen for systems deployed in Europe. But regulatory timelines and agent deployment timelines aren't synchronized. Agents are shipping now. Compliance frameworks won't be enforced until August 2026 at the earliest.

What This Actually Changes

The honest assessment: the 2025 AI Agent Index is the most useful document we've gotten this year for understanding the state of deployed agent safety, precisely because it reveals how little systematic documentation exists. It changes the conversation from "are agents safe?" to "we literally don't know, and here's the evidence for that ignorance."

For practitioners, the immediate implication is defensive. If you're deploying agents in production, you can't rely on vendor safety claims that aren't backed by documented evaluation methodology. You need your own testing regime, and it needs to go beyond the happy-path demonstrations that benchmark suites provide. The reliability science paper from Rabanser and colleagues offers a reasonable starting taxonomy for what to test.

For the field, the bigger shift is conceptual. Agent evaluation can't be a pre-deployment checkbox. Agents interact with dynamic environments, face adaptive adversaries, and serve users who communicate with radical underspecification. Static evaluations will always be stale by the time the agent hits production. The industry needs continuous monitoring frameworks that treat safety evaluation as an ongoing process, not a gate you pass once. We're not close to having those. Acknowledging that gap clearly is worth more right now than pretending any existing framework has it covered.

Sources

Research Papers:

The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems, Staufer, Feng, Wei et al. (2026)
Towards a Science of AI Agent Reliability, Rabanser, Kapoor, Kirgis et al. (2026)
Implicit Intelligence, Evaluating Agents on What Users Don't Say, Sirdeshmukh, Wetter (2026)
AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs, Wang, Zhang, Zhang et al. (2026)
Language Models Exhibit Inconsistent Biases Towards Algorithmic Agents and Human Experts, Bo, Mok, Anderson (2026)
OpenPort Protocol: A Security Governance Specification for AI Agent Tool Access, Zhu, Wang, Wang et al. (2026)

Industry / Policy:

NIST AI Risk Management Framework, NIST
Anthropic Research & Safety Updates, Anthropic
Stanford HAI Policy Research, Stanford Institute for Human-Centered AI

Related Swarm Signal Coverage: