🎧 LISTEN TO THIS ARTICLE

In December 2025, a research team demonstrated that a single malicious prompt hidden in a web page could instruct an AI agent to search a user's email, extract credentials, and exfiltrate them through an image URL. The agent followed every instruction. It had the tools, the permissions, and no mechanism to distinguish the attacker's intent from a legitimate request. The attack required zero interaction from the victim.

That scenario is precisely what red teaming is designed to find before production users do. But red teaming AI agents is fundamentally different from red teaming standalone language models. Models generate text. Agents take actions. When an agent fails, it doesn't just say something harmful. It does something harmful: deleting files, sending unauthorized transactions, leaking data to external servers. This guide covers how to build a structured red teaming practice for AI agent systems, grounded in the frameworks and attack taxonomies that have emerged through 2025 and 2026.

Why Agent Red Teaming Differs from Model Red Teaming

Traditional model red teaming focuses on eliciting harmful outputs. You craft adversarial prompts, try jailbreaks, probe for toxicity, and document what the model says. The failure mode is a bad response.

Agent red teaming operates on a different plane. OWASP recognized this distinction by publishing a dedicated Top 10 for Agentic Applications in December 2025, separate from their existing LLM Top 10. The agentic list includes risks that simply don't exist for standalone models: agent goal hijacking, cascading failures across agent networks, privilege escalation through delegated workflows, and rogue agents that drift from intended behavior.

Three properties make agent red teaming harder than model red teaming.

Agents have tool access. A chatbot that gets jailbroken says something toxic. An agent that gets jailbroken executes code, queries databases, sends emails, and calls external APIs. The blast radius scales with the agent's permissions.

Agents chain actions. A single compromised step can cascade. The OWASP agentic framework emphasizes that a minor vulnerability like a simple prompt injection can quickly escalate into system-wide compromise, data exfiltration, or financial loss when the agent chains multiple tool calls in sequence.

Agents operate across trust boundaries. A model processes a prompt in a single context. An agent reads documents, browses the web, processes emails, and communicates with other agents. Each of those interactions crosses a trust boundary, and each boundary is an injection surface.

The UK AI Security Institute (AISI) found this directly in their evaluations. They assess agent capabilities by embedding AI systems in scaffolding that includes tools, prompting procedures, and error handling. Between 2023 and 2025, success rates on self-replication evaluations went from 5% to 60%. Agents are getting more capable, and the attack surface is expanding with them.

NIST's AI Risk Management Framework recommends continuous adversarial testing throughout the AI system lifecycle, not one-time assessments. The US Executive Order on AI defines red teaming as "a structured testing effort to find flaws and vulnerabilities in an AI system using adversarial methods." For agents, that structure needs to account for tool use, multi-step reasoning, and cross-boundary data flows that traditional model testing ignores.

The Agent Attack Surface Map

When an agent fails, it doesn't just say something harmful. It does something harmful: deleting files, sending unauthorized transactions, leaking data.

Agent vulnerabilities cluster into four categories. Each requires different testing approaches and produces different failure modes.

Prompt Injection

OWASP ranks prompt injection as the number one vulnerability for LLM applications, and it's worse for agents because agents act on injected instructions rather than just generating text.

There are two variants. Direct injection is when an attacker crafts a prompt that overrides the agent's system instructions. Indirect injection is when malicious instructions are embedded in content the agent processes: documents, web pages, emails, database records, or API responses.

Indirect injection is the bigger concern for agents. The GitHub Copilot vulnerability (CVE-2025-53773) demonstrated this: malicious prompts in source code comments instructed the agent to modify IDE settings, enable auto-approve mode, and achieve arbitrary code execution. The injection was self-replicating. During code refactoring, the compromised instructions propagated to other files.

A recent meta-analysis synthesizing findings from 78 studies between 2021 and 2026 found that attack success rates against state-of-the-art defenses exceed 85% when adaptive attack strategies are employed. OpenAI's own researchers have acknowledged that AI browsers may always be vulnerable to prompt injection attacks, since the fundamental issue of mixing trusted and untrusted content in a single context window has no complete solution.

When red teaming for prompt injection, test both direct and indirect vectors. For indirect injection, embed adversarial instructions in every type of content your agent processes: PDFs, emails, web pages, database records, API responses, and code comments. Test whether the agent can distinguish between its instructions and attacker-planted instructions in retrieved content. For a detailed breakdown of how indirect injection attacks work in agent pipelines, see the dedicated coverage.

Tool Abuse and Unsafe Tool Use

OWASP's agentic list distinguishes between Excessive Agency (granting too many permissions) and Unsafe Tool Use (applying legitimate tools in unintended ways). Both are testing targets.

The Amazon Q Developer incident in July 2025 showed how tool vulnerabilities compound with agent capabilities. An attacker injected malicious prompts into the Amazon Q Developer Extension that instructed the AI to delete the file system, discover AWS profiles, and delete S3 buckets, EC2 instances, and IAM users. The extension had been installed over 950,000 times, and the malicious version shipped without tamper detection.

A SQL injection vulnerability in Anthropic's reference SQLite MCP server implementation had been forked over 5,000 times before being archived. In agent environments, SQL injection becomes a springboard for stored-prompt injection, where agents treat database content as trusted and embedded prompts in query results can trigger further tool calls.

Red team test cases for tool abuse should probe whether agents can be tricked into misusing their authorized tools. Can a research agent be made to use its code execution capability to access the file system? Can a customer service agent be manipulated into sending emails to external addresses? Can a database agent be coerced into running destructive queries?

Data Exfiltration

Agents process sensitive information and have output channels. That combination creates exfiltration risk. Zero-click attacks like EchoLeak hide instructions in documents and images that trigger data extraction without any user interaction. MITRE ATLAS documents techniques for encoding sensitive data into outbound tool parameters like email composers and CRM updaters.

Testing for data exfiltration requires probing whether the agent will include sensitive information in external-facing outputs. Plant instructions in processed documents asking the agent to append internal data to its responses, include credentials in email drafts, or encode information in image URLs. The "Rule of Two" heuristic from security researcher Michael Bargury provides a design principle: an agent session should satisfy only two of three properties simultaneously. Processing untrustworthy inputs, accessing sensitive data, and communicating externally. If your agent does all three, it's an exfiltration risk.

Privilege Escalation and Cascading Failures

OWASP's ASI05 (Privilege Escalation) addresses agents that inherit, misuse, or retain privileges improperly across sessions, users, or delegated workflows. This is particularly relevant for enterprise agents with single sign-on, multi-role systems, and delegated task chains.

In multi-agent systems, researchers have demonstrated self-replicating prompt infections that propagate between agents through compromised inter-agent communication. Using GPT-4o, harmful actions including data exfiltration, scam creation, and content manipulation succeeded over 80% of the time.

Red team testing should verify that agents cannot escalate their own privileges through tool chaining, retain permissions from previous sessions, or propagate compromised context to other agents in the system.

A Structured Testing Methodology

Attack success rates against state-of-the-art defenses exceed 85% when adaptive attack strategies are employed.

Anthropic's Frontier Red Team uses a control evaluation framework where a red team designs attack policies that let an AI model intentionally pursue hidden, harmful goals. Their modular scaffold approach makes it faster and less costly to build strong red teams for real-world evaluations. The following methodology adapts that structure for practitioner teams.

Phase 1: Scope and Threat Model

Before testing anything, document what the agent can do. Map every tool, API, data source, and output channel. Identify the trust boundaries: where does the agent receive untrusted input? Where does it produce output that crosses a security boundary?

Build a threat model with three actor profiles: an external attacker who can only reach the agent through its normal input channels, a compromised data source (document, web page, or API) that contains adversarial content, and a malicious insider who has legitimate access but seeks to abuse the agent's capabilities.

For each actor, identify the highest-impact actions they could cause the agent to take. This produces your test case priority list.

Phase 2: Automated Scanning

Run automated vulnerability scanners against your agent to establish a baseline. The best red teaming tools for 2026 cover this category in detail. Three tools handle the automated scanning layer effectively.

NVIDIA's Garak provides 37+ probe modules for prompt injection, jailbreaks, data leakage, and encoding-based attacks. Point it at your agent endpoint and run a full scan. Promptfoo supports 50+ vulnerability types and maps results directly to OWASP, NIST, MITRE ATLAS, and EU AI Act compliance requirements. It integrates into CI/CD pipelines as a pre-commit security gate. Microsoft's PyRIT orchestrates sophisticated multi-turn attack strategies including Crescendo (gradually escalating harmful requests) and Skeleton Key (bypassing safety training through system-level manipulation).

The integration strategy: run Promptfoo in CI on every commit. Run Garak against staging in nightly builds. Use PyRIT for quarterly deep assessments with human oversight.

Phase 3: Manual Adversarial Testing

Automated tools find known vulnerability patterns. Manual testing finds novel attack chains that scanners miss.

Structure manual testing around the attack surface map. For each category (prompt injection, tool abuse, data exfiltration, privilege escalation), build test cases specific to your agent's tools and data flows. Test multi-step attack chains where the first step is benign but the sequence produces a harmful outcome.

Test the boundaries that matter most for your deployment. If your agent processes customer documents, embed adversarial instructions in realistic-looking documents. If it has email access, test whether injections can trigger unauthorized email sends. If it operates in a multi-agent system, test whether a compromised agent can poison the shared context.

Anthropic's research found that in a constrained coding environment, prompt injection succeeded 0% of the time across 200 attempts. In a GUI-based system without safeguards, success reached 78.6%. The delta shows that architecture and safeguards determine the outcome more than the model itself.

Phase 4: Report and Remediate

Document every finding with the attack chain that produced it, the impact if exploited, and a recommended fix. Classify findings using the OWASP agentic categories (ASI01 through ASI10) for consistency across assessments.

Prioritize by exploitability and impact. A prompt injection that causes an agent to reveal its system prompt is lower priority than one that triggers data exfiltration. A privilege escalation in a sandboxed environment is lower priority than one that reaches production databases.

Re-test after remediation. Fixes that work against the specific test case may not generalize against variants of the same attack. Run regression tests to confirm that mitigations hold against adaptive adversaries.

📬 THE SIGNAL

Get the best AI agent research delivered weekly. No spam, just signal.

Subscribe Free →

Building a Red Team Program

In a constrained coding environment, prompt injection succeeded 0% of the time. In a GUI-based system without safeguards, success reached 78.6%.

Team Composition

You don't need a dedicated red team to start. A single security-aware engineer running automated scans weekly and manual tests monthly catches the majority of vulnerabilities. As your agent fleet grows, dedicated roles become more valuable.

Effective agent red teams combine three skill sets: traditional application security (understanding of injection, authentication, and authorization attacks), AI/ML knowledge (prompt engineering, model behavior, and adversarial machine learning), and domain expertise specific to the agent's function (financial services, healthcare, software development).

The Cloud Security Alliance published an Agentic AI Red Teaming Guide in 2025 that provides organizational templates for building red team programs, including role definitions, engagement scoping, and reporting structures.

Cadence

Continuous is better than periodic, but periodic is better than nothing. A practical cadence for most teams:

Every commit: Automated scans via Promptfoo or equivalent in CI/CD. This catches regressions and new vulnerabilities introduced by code changes.

Weekly: Review scan results. Triage new findings. Update test cases based on emerging attack techniques.

Monthly: Manual adversarial testing focused on the highest-risk attack surfaces. Test new tools, data sources, or integrations added during the month.

Quarterly: Deep assessment using PyRIT or equivalent. Bring in external perspective if possible. Review and update the threat model based on changes to the agent's capabilities, new OWASP advisories, and published attack research.

Compliance Alignment

The EU AI Act requires documented adversarial testing for high-risk AI systems, with full requirements taking effect August 2, 2026. NIST AI RMF recommends continuous evaluation. ISO 42001 includes red teaming in its AI management system requirements. Build your red team reporting to map directly to these frameworks, so compliance is a byproduct of testing rather than a separate effort.

For teams subject to the EU AI Act requirements, red team reports should document the testing methodology, findings, remediation steps, and residual risk assessment in a format that satisfies conformity assessment requirements.

Common Findings

Between 2023 and 2025, success rates on self-replication evaluations went from 5% to 60%.

After reviewing published vulnerability reports, OWASP case studies, and disclosed incidents from 2025 and 2026, certain findings appear repeatedly across agent red team engagements. Knowing these patterns helps teams focus their testing.

Over-permissioned tool access. Agents granted broad file system access, unrestricted API credentials, or admin-level database permissions when they need far less. The fix is straightforward: start from deny-all and allowlist only the specific resources each agent needs. Use short-lived credentials scoped per task.

No input validation on retrieved content. Agents that fetch documents, browse the web, or read emails without sanitizing the content for adversarial instructions. Every piece of external content should be treated as untrusted input, regardless of its apparent source.

Shared context across trust levels. Multi-agent systems where a low-trust agent can write to a shared memory or context store that a high-trust agent reads. The security implications of agent-to-agent communication require the same isolation principles used in microservice architectures.

Missing output filtering. Agents that can include sensitive data in their responses without detection or redaction. Output filtering for PII patterns, credentials, and internal data should be the last line of defense before any external communication.

Persistent sessions without re-authentication. Agents that maintain credentials across sessions, allowing an attacker who compromises one session to inherit privileges for future interactions. Session isolation and credential rotation limit the blast radius of any single compromise.

Frequently Asked Questions

How is AI agent red teaming different from traditional penetration testing?

Traditional penetration testing targets deterministic software with predictable inputs and outputs. Agent red teaming targets probabilistic systems where the same input can produce different outputs across runs. Agents also introduce novel attack surfaces: prompt injection, tool misuse, and context manipulation don't have direct equivalents in traditional application security. The testing methodology needs to account for non-deterministic behavior, test across multiple runs, and evaluate the agent's reasoning process, not just its outputs.

What's the minimum viable red team engagement for a new agent deployment?

Run automated scans with Garak or Promptfoo against your agent endpoint to establish a baseline. Then spend two to four hours on manual testing focused on indirect prompt injection through every input channel the agent processes, tool misuse scenarios specific to the agent's authorized tools, and data exfiltration through the agent's output channels. Document findings and re-test after fixes. That covers the highest-risk attack surfaces for most deployments. As the AI agent security playbook details, the defense foundation is input validation, sandboxed execution, least privilege, and behavioral monitoring.

Can automated red teaming tools replace human testers?

No. Automated tools are good at scanning for known vulnerability patterns at scale. They catch regressions, cover broad attack surfaces, and integrate into CI/CD pipelines. But they don't discover novel attack chains, understand business context, or think creatively about how an agent's specific tools and permissions could be combined in unexpected ways. The automated red teaming analysis covers how small models systematically probe larger ones, but even those approaches work best as a supplement to human adversarial thinking, not a replacement.

How does red teaming fit into AI compliance requirements?

The EU AI Act, NIST AI RMF, and ISO 42001 all reference adversarial testing as a component of AI risk management. The EU AI Act requires conformity assessments that include documented security testing for high-risk systems. Red team reports that follow the OWASP agentic categories (ASI01 through ASI10) map cleanly to these frameworks. Building compliance into your reporting format means you produce audit-ready documentation as a natural output of testing rather than a separate workstream.

Enjoyed this article?

Join 500+ AI practitioners getting weekly breakdowns of agent research, tools, and real-world case studies.

Subscribe Free Browse Premium Products →

Sources