The AI Agent Security Playbook
By Tyler Casey · AI-assisted research & drafting · Human editorial oversight
@getboski
NVD describes CVE-2025-32711 as AI command injection in Microsoft 365 Copilot that can disclose information over a network. Aim Labs/Cato calls the disclosed chain EchoLeak. In that EchoLeak write-up, the chain starts with a crafted email and uses Copilot context plus auto-fetched URLs as an exfiltration path. The record treats the issue as high severity from multiple scoring angles. Aim Labs/Cato describes EchoLeak as zero-click and says Microsoft confirmed no customers were affected. This article focuses on that intersection: untrusted content, private context, and outbound channels.
Recent public data points:
As of June 2026, treat the incident counts and attack catalogs below as current evidence, not stable baselines.
- Stanford HAI's 2025 AI Index reports a rising number of AI-related incidents in 2024 compared with 2023.
- Gartner's August 2025 forecast suggests task-specific AI agents will become common in enterprise applications by the end of 2026.
- BigID's June 2025 release says only a small minority of surveyed organizations had an advanced AI security strategy or defined AI TRiSM framework.
Agents Are a Different Threat Surface
A chatbot takes a prompt and returns text. An agent takes a prompt, reasons about it, calls tools, reads files, executes code, sends emails, and queries databases. The attack surface isn't a conversation. It's every tool the agent can touch.
OWASP recognized this distinction in December 2025 by publishing a dedicated Top 10 for Agentic Applications, separate from their existing LLM Top 10. The agentic list emphasizes risks that become much sharper when models can plan and call tools: agent goal hijacking, tool misuse and exploitation, memory and context poisoning, insecure inter-agent communication, cascading failures across agent networks, and agents that drift from intended behavior. MITRE ATLAS added new agent-specific attack techniques in October 2025, including AI agent context poisoning, memory manipulation, and exfiltration via agent tool invocation.
The common thread: agents have permissions that chatbots don't. Permissions create attack surfaces. For a framework on how to constrain those permissions, see the guardrails guide.
Prompt Injection Remains the Top Threat
OWASP treats prompt injection as a top vulnerability for LLM applications in 2025, and it's worse for agents because agents act on their instructions rather than just generating text.
Anthropic published quantified prompt injection failure rates in their February 2025 system card. The numbers depend heavily on context and safeguards. In a constrained coding environment, injection did not land in the reported tests. In a GUI-based system without safeguards, success was much higher. With safeguards enabled on a browser-based system, success dropped sharply. That spread is a good illustration of the value of defense-in-depth.
Indirect injection is the bigger concern for agents. Attackers embed instructions in documents, web pages, or emails that the agent processes. The GitHub Copilot RCE (CVE-2025-53773) demonstrated this precisely: malicious prompts in source code comments instructed Copilot to modify VS Code settings, enable auto-approve mode, and achieve arbitrary code execution. The injection was self-replicating: during code refactoring, the compromised instructions propagated to other files. The extension had been installed at massive scale.
A financial services AI banking assistant was exploited through prompt injection to bypass transaction verification, resulting in fraudulent transfers. A state-backed threat actor manipulated Claude Code to conduct AI-orchestrated espionage across multiple organizations.
Security researcher Michael Bargury proposed the "Rule of Two" as the simplest design heuristic: an agent session should satisfy only two of three properties simultaneously: processing untrustworthy inputs, accessing sensitive data, and communicating externally. If your agent does all three, it's a prompt injection waiting to happen. For a deeper analysis of how indirect injection attacks work in agent pipelines, see the dedicated coverage.
Tool Misuse and Excessive Agency
The Amazon Q Developer incident in July 2025 demonstrated how tool vulnerabilities compound with agent capabilities. An attacker used an unverified GitHub account to gain admin access to the aws-toolkit-vscode repository and injected malicious prompts into the Amazon Q Developer Extension. The compromised version instructed the AI to delete the file system, clear user config, discover AWS profiles, and delete S3 buckets, EC2 instances, and IAM users via AWS CLI. The extension had been installed at massive scale, and the malicious version shipped without tamper detection.
Trend Micro found a classic SQL injection vulnerability in Anthropic's reference SQLite MCP server implementation, which had been forked widely before being archived. In agent environments, SQL injection becomes a springboard for stored-prompt injection: agents treat database content as trusted, so embedded prompts in query results can trigger email sends, cloud API calls, and lateral movement. The MCP protocol's security model hasn't caught up with its adoption, making every MCP server a potential attack vector.
OWASP distinguishes between two related risks. Excessive Agency is when an agent is granted too many permissions. Tool Misuse is when an agent operates within authorized privileges but applies a legitimate tool unsafely. Both require the same defense: treat tool access like production IAM. Start from deny-all. Allowlist only the specific commands, directories, and API endpoints each agent needs. Use short-lived credentials scoped per task.
Data Flows Through and Out of Agents
GenAI tools are now among the biggest uncontrolled channels for corporate data exfiltration, and public reporting continues to point to substantial unauthorized data movement through them. A large share of files uploaded into GenAI tools contain PII or payment card data, and many employees paste data into GenAI tools through unmanaged personal accounts.
The risk compounds for agents that process documents, emails, and databases. Zero-click exfiltration attacks like EchoLeak hide instructions in documents and images that trigger data extraction without user interaction. MITRE ATLAS documents using agent "write" tools such as email composers and CRM updaters to encode sensitive data into outbound tool parameters.
Training data extraction is a separate vector. Finetuning attacks have extracted training examples from ChatGPT in a notable share of attempts, reconstructing proprietary documents. Membership inference attacks on Google-trained models have shown high accuracy in determining whether specific data was used in training, according to NIST.
The defense starts with data minimization: design agent workflows to capture only what's strictly necessary. Use tools like Microsoft Presidio for automatic PII detection and redaction in agent inputs and outputs. Never use real PII in development or testing environments. And implement output filtering that catches sensitive data patterns before the agent can send them anywhere. Shadow AI breaches can cost materially more than traditional incidents, and many organizations that experienced AI-related breaches still lacked basic access controls.
Multi-Agent Systems Multiply the Risk
Researchers have demonstrated self-replicating prompt infections that propagate between agents through compromised inter-agent communication. In one setup using GPT-4o, harmful actions including data exfiltration, scam creation, and content manipulation succeeded at high rates. A separate study showed AI worm behavior: a prompt injected into email triggers an LLM to replicate itself and exfiltrate data, and the infection spreads when the email assistant drafts new messages.
The M-Spoiler framework demonstrated that a single malicious agent can manipulate collective decisions in multi-agent debates. Contaminated shared context is equally dangerous: an attack on one agent's RAG memory compromises downstream decisions across the entire system. Small errors amplify through collaborative agents, with repeated reads and writes reinforcing the corrupted state.
The practical defense for multi-agent systems: validate inter-agent communications with the same rigor you apply to external inputs. Isolate memory and context between agents. Monitor token and secret proliferation across the system, because one compromised credential cascades through every agent that shares it. For the empirical evidence on how deception propagates through agent networks, see the multi-agent deception analysis. The alignment implications are equally relevant since agents that understand instructions well enough to follow them can also understand them well enough to subvert them.
The Defense Playbook
OWASP's AI Agent Security Cheat Sheet provides the structure. Here's what it looks like in practice.
Input and output validation at every boundary. Validate and sanitize all external inputs and inter-agent communications. Apply content filtering for known injection patterns. On the output side, validate agent outputs before execution and filter for sensitive data leakage. Use structured outputs with schema validation whenever possible. This catches the majority of injection attempts and all accidental data leaks.
Sandboxed execution. Production-grade options range from Docker containers (baseline) to gVisor (user-space kernel with system call interception), Kata Containers (VM-level isolation with container ergonomics), and Firecracker MicroVMs (strongest isolation with fast boot time). The isolation boundary prevents a compromised agent from accessing the host system or other agents' resources.
Least privilege for everything. Give agents access only to tools they absolutely need. Use short-lived credentials with limited scope per task. Require explicit confirmations for high-risk actions like data deletion, financial transactions, and external communications. Microsoft's Entra Agent ID system gives each agent its own identity for visibility and auditability.
Behavioral monitoring. Define normal patterns per agent role. Alert on deviation: unexpected network connections, excessive API calls, unusual resource consumption, permission denials, and policy violations. Log every tool call, every API request, with immutable audit trails. The four pillars from OWASP's agentic mitigation framework: access verification for every tool invocation, behavioral monitoring for anomalies, explicit operational boundaries, and tamper-proof execution logs.
Human-in-the-loop for sensitive actions. Full autonomy is a design choice, not a requirement. For actions with real-world consequences, a human approval step is the most reliable security control that exists.
Testing Your Security
Four open-source tools cover the testing spectrum. Promptfoo supports many vulnerability types with adaptive attack generation and maps results directly to OWASP, NIST, MITRE ATLAS, and EU AI Act compliance requirements. NVIDIA's Garak offers a broad set of probe modules for model endpoint testing. Microsoft's PyRIT was built from their Bing Chat and Copilot red teaming experience and handles deep custom assessments. DeepTeam from Confident AI covers many attack types with minimal setup for smaller teams.
One practical integration strategy is to run Promptfoo in your CI pipeline as a pre-commit security gate, Garak against staging in nightly builds, and PyRIT for deeper periodic assessments. The automated red teaming analysis covers how small models systematically probe larger ones for vulnerabilities, and the same approach applies to agent security testing.
Continuous monitoring completes the picture. Real-time attack simulation, autonomous agents simulating attacker behavior at scale, and integration with existing SIEM and SOAR workflows. Many organizations would consider AI penetration testing, but relatively few run it continuously.
The Compliance Timeline
The cited EU AI Act timeline points to August 2, 2026 for high-risk AI system requirements. That means conformity assessments, technical documentation, risk management frameworks, human oversight features, and incident reporting. Penalties can reach the tens of millions of euros or a percentage of global annual turnover for prohibited practices. Article 50 transparency obligations, including disclosure that users are interacting with AI and labeling of synthetic content, are tied to the same date in the cited timeline.
NIST published a Request for Information on Security Considerations for AI Agents specifically in January 2026, seeking practices for measuring and improving secure development and deployment. Their December 2025 Cybersecurity Framework Profile for AI overlays three AI-specific focus areas on the existing CSF 2.0 framework.
ISO 42001, the world's first AI-specific management system standard, explicitly addresses AI risk, transparency, accountability, and bias mitigation. It fills the gap that SOC 2 leaves: SOC 2 evaluates security and privacy but has no AI-specific governance requirements. Organizations deploying agents in regulated industries should treat ISO 42001 as the baseline standard.
The pattern across all three frameworks is that you need to document what your agents can do, monitor what they actually do, and prove you can control them when they do something wrong. The security playbook described above is the operational implementation of that requirement.
Related: The Agent Project That Should Have Been One
Sources
Research Papers:
- Open Challenges in Multi-Agent Security -- (2025)
- Self-Replicating Prompt Infections in Multi-Agent Systems -- Lee and Tiwari (2025)
- MASpi: Evaluating Prompt Injection Robustness in Multi-Agent Systems -- (2025)
- Extracting Training Data from ChatGPT -- ICLR (2025)
- NIST AI 100-2: Adversarial Machine Learning -- NIST (2025)
Industry / Case Studies:
- OWASP Top 10 for LLM Applications 2025 -- OWASP (2025)
- OWASP Top 10 for Agentic Applications 2026 -- OWASP (2025)
- MITRE ATLAS: Agentic AI Techniques -- MITRE (2025)
- EchoLeak: Zero-Click Microsoft 365 Copilot Vulnerability -- Cato / Aim Labs (2025)
- CVE-2025-32711 Detail -- NIST NVD (2025; modified 2026)
- CVE-2025-53773: GitHub Copilot RCE via Prompt Injection -- Embrace The Red (2025)
- Amazon Q Developer Extension Security Incident -- DevOps.com (2025)
- Prompt Injection Defenses: Quantified Failure Rates -- Anthropic (2025)
- Stanford HAI 2025 AI Index Report -- Stanford HAI (2025)
- Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026 -- Gartner (2025)
- AI Risk & Readiness in the Enterprise: 2025 Report -- BigID (2025)
- NIST AI Risk Management Framework -- NIST (2023)
Commentary:
- The Rule of Two: A Practical Approach to AI Agent Security -- Michael Bargury (2025)
- AI Agent Attacks in Q4 2025 Signal New Risks for 2026 -- eSecurity Planet (2025)
- The Year of the Agent: What Q4 2025 Attacks Reveal -- Lakera (2025)
- How to Sandbox AI Agents -- Northflank (2025)
Related Swarm Signal Coverage:
-
AI Guardrails for Agents: How to Build Safe, Validated LLM Systems
-
AI Alignment Explained: What It Actually Means to Make AI Do What We Want
-
When Agents Lie to Each Other: Deception in Multi-Agent Systems
-
The Red Team That Never Sleeps: When Small Models Attack Large Ones
-
AI Agent Security in 2026: Prompt Injection, Memory Poisoning, and the OWASP Top 10