▶️ LISTEN TO THIS ARTICLE

Red Teaming AI Agents: A Practitioner's Guide

By Tyler Casey · AI-assisted research & drafting · Human editorial oversight
@getboski

Jailbreaking a chatbot is only one part of the problem. Many AI agents can call tools, chain multi-step actions, coordinate with other agents, and persist state across sessions. The attack surface is not just the prompt window. It can include the execution environment around the model. This guide covers the threat models, testing methodologies, and open-source tooling practitioners should consider for agentic systems in 2026.

In March 2026, researchers published T-MAP, a trajectory-aware evolutionary search method for discovering multi-step attack paths through AI agents' tool-use chains. Treat its reported success rates as benchmark-specific, not as universal field rates. The important practitioner lesson is narrower and better supported: agent red teaming has to evaluate full execution trajectories, not only single prompt-response pairs.

That captures the core challenge of red teaming AI agents in the 2025-2026 literature. The attack surface has expanded from text generation toward action execution. A model that refuses to write malware might still be tricked into running a script that downloads it. A system hardened against direct jailbreaks might leak credentials through its memory store. An agent that individually passes every safety benchmark might still be compromised through a manipulated message from another agent in the same system.

Traditional LLM red teaming tests whether a model will generate harmful text. For agentic systems, red teaming also has to test whether a system can take harmful actions through tools, memory, or inter-agent communication. The difference is structural enough to require different threat models, different tools, and different thinking.

Why Agent Red Teaming Is Different

LLM red teaming, as practiced since 2023, focuses on the input-output boundary. You craft an adversarial prompt. The model generates a response. You evaluate whether the response violates a safety policy. The attack surface is the context window. The failure mode is text generation.

Agents break this model in four ways.

Tool access creates real-world consequences. When an agent can execute code, send emails, query databases, or modify files, a successful attack is not a harmful string in a chat log. It is a harmful action in a production environment. The OWASP Top 10 for Agentic Applications identifies "Excessive Agency" as a top risk: agents granted more permissions than their task requires become force multipliers for any successful exploitation.

Multi-step execution creates trajectory-level vulnerabilities. A single prompt-response pair might be safe at every step while the overall trajectory is harmful. T-MAP demonstrated this systematically: attacks that succeed through sequences of individually benign tool calls, where the harm emerges from the combination, not from any single step. Red teaming must evaluate trajectories, not just individual outputs.

Inter-agent communication creates new injection surfaces. In multi-agent systems, agents pass messages to each other. These messages are another input channel, and they are exploitable. The Agent-in-the-Middle (AiTM) attack, published in February 2025 and accepted at ACL 2025, demonstrated that an adversary who can manipulate inter-agent messages can compromise an entire multi-agent system without ever touching an individual agent's prompt. The adversarial agent uses a reflection mechanism to generate contextually aware malicious instructions that other agents follow because they arrive through a trusted communication channel.

Persistent memory creates poisoning opportunities. Agents that maintain state across sessions can be attacked through their memory. A malicious instruction injected into an agent's memory during one session may execute during a future session, long after the original attack vector is gone. This is a fundamentally different threat model from stateless LLM interactions, and standard prompt injection defenses do not address it.

In multi-agent architectures, agents communicate through message passing, shared memory stores, or orchestrator-mediated channels.

The Threat Model: Five Attack Surfaces

Before running a single test, you need a threat model specific to your agent architecture. The following five surfaces cover the primary attack categories for agentic systems in 2026. They build on OWASP's Agentic Top 10 and MITRE ATLAS, adapted for practitioner use.

1. Goal Hijacking

The attacker redirects the agent's objective. Unlike prompt injection, which targets a single response, goal hijacking targets the agent's persistent planning loop. A financial analysis agent might be redirected to exfiltrate portfolio data instead of summarizing it. OWASP classifies this as ASI01 (Agent Goal Hijacking) and treats it as one of the highest-severity agentic risks.

Test approach: Embed conflicting instructions in tool outputs, retrieved documents, and inter-agent messages. Verify whether the agent's stated goal changes or whether its actions diverge from its stated goal while the goal description remains unchanged (the more dangerous variant).

2. Tool Misuse and Escalation

The agent uses its tools in unintended ways, either through direct manipulation or through emergent behavior from ambiguous instructions. An agent with filesystem access might be induced to read configuration files containing secrets. An agent with code execution might be tricked into installing packages.

Test approach: For each tool the agent can access, enumerate the maximum harm it could cause. Then test whether the agent can be induced to approach that maximum through indirect instruction, social engineering within tool outputs, or chained multi-step sequences where each individual step appears benign.

3. Inter-Agent Message Injection

In multi-agent architectures, agents communicate through message passing, shared memory stores, or orchestrator-mediated channels. Each of these is an injection surface. The AiTM attack proved that a single compromised message channel can cascade through an entire multi-agent system. Recent work on multi-agent security architecture shows that one analysis counted 193 distinct threat items across nine risk categories, and estimated current frameworks cover roughly two-thirds of them.

Test approach: Inject adversarial payloads into inter-agent messages at each communication point. Test whether agents validate the source and content of messages from other agents, or whether they treat all inter-agent communication as trusted input. Test with both direct instruction injection and subtler approaches: encoded instructions, instructions spread across multiple messages, and instructions that exploit the receiving agent's specific role.

4. Memory and Context Poisoning

Agents with persistent memory (conversation history, RAG knowledge bases, learned preferences) can be attacked through their memory layer. The RAG reliability gap shows that retrieval systems have inherent failure modes. When those systems also serve as an agent's long-term memory, poisoning the knowledge base poisons the agent's future decisions.

Test approach: Insert adversarial content into every data source the agent reads from: documents in its RAG corpus, prior conversation history, shared state stores, and configuration files. Test both immediate exploitation (the agent acts on the poisoned data now) and delayed exploitation (the poisoned data sits in memory and activates when a future query triggers retrieval).

5. Output and Data Exfiltration

The agent is tricked into including sensitive information in its outputs, tool calls, or external communications. This can be direct (the agent is told to include the data) or indirect (the agent encodes data in URLs, API parameters, or seemingly innocuous text that an external observer can decode).

Test approach: Provide the agent with sensitive information in its context, then test whether adversarial instructions in tool outputs or retrieved documents can cause the agent to leak that information through any available output channel. Test covert channels: URL parameters, encoded strings in email bodies, metadata fields in API calls.

The Practitioner's Toolkit

Three open-source frameworks dominate agent red teaming in 2026. Each covers a different part of the problem.

PyRIT (Python Risk Identification Toolkit) is Microsoft's open-source framework for AI red teaming. It supports multi-turn attack strategies, prompt injection testing, and automated evaluation of model outputs. PyRIT's orchestrators can chain attacks across multiple interaction turns, making it suitable for testing agents that maintain state across a conversation. It includes converters that transform attack payloads (encoding, translation, obfuscation) and scorers that automatically evaluate whether an attack succeeded.

Garak is NVIDIA's vulnerability scanner for LLMs. It ships with a library of probes covering the OWASP Top 10, known jailbreak patterns, prompt injection variants, and data extraction techniques. Garak is strongest at breadth: scanning a model across hundreds of known vulnerability patterns in a single run. For agent testing, it covers the model layer but requires extension for tool-use and multi-agent scenarios.

DeepTeam is a red teaming framework from Confident AI that maps directly to the OWASP Top 10 for Agentic Applications. It provides automated attack generation for agent-specific risks including goal hijacking, tool misuse, and excessive agency. DeepTeam integrates with evaluation pipelines, making it practical for CI/CD integration where every deployment is red-teamed before release.

For teams building custom tooling, AGENTICRED (January 2026) provides a research framework for optimizing agentic systems specifically for automated red teaming. It treats the red teaming process itself as an agentic workflow, using LLM agents to systematically explore attack strategies.

For multi-agent systems, instrument the communication channels between agents and run AiTM-style injection tests.

Running a Red Team Engagement: A Step-by-Step Process

The CSA Agentic AI Red Teaming Guide and OWASP GenAI Red Teaming Guide converge on a four-phase methodology. Here is a condensed version adapted for engineering teams.

Phase 1: Scope and Threat Model (1-2 days)

Document the agent's architecture: what tools it can access, what data it can read, what actions it can take, what other agents or systems it communicates with. Map each capability to the five attack surfaces above. Prioritize by impact: which capabilities would cause the most harm if exploited?

Define your adversary model. Are you testing against external attackers who can only interact through the agent's user-facing interface? Internal attackers who can poison data sources? Compromised components in a multi-agent pipeline? Each adversary model requires different test strategies.

Phase 2: Automated Scanning (2-3 days)

Run Garak and DeepTeam against the agent's model layer to establish a baseline. This catches known vulnerability patterns: common jailbreaks, prompt injection variants, data extraction techniques. These tools run hundreds of probes automatically and produce structured reports.

For multi-agent systems, instrument the communication channels between agents and run AiTM-style injection tests. Monitor whether injected payloads propagate between agents and whether any agent validates messages from other agents.

Phase 3: Manual Trajectory Testing (3-5 days)

This is where automated tools fall short and human creativity matters. Design multi-step attack scenarios specific to your agent's domain. For a financial agent, test whether it can be induced to execute unauthorized trades through a sequence of individually reasonable instructions. For a code execution agent, test whether it can be guided toward installing malicious dependencies through a sequence of seemingly helpful suggestions.

Use T-MAP's core insight: evaluate the full trajectory, not just individual steps. Record the complete sequence of tool calls for each test. A trajectory where every individual tool call passes safety checks but the overall sequence produces harm is the most dangerous finding, because it will also pass any step-level monitoring you deploy.

Phase 4: Report and Remediate (1-2 days)

Structure findings by attack surface and severity. For each finding, document: the attack surface exploited, the specific technique used, the trajectory of tool calls, the harm produced, and a recommended mitigation. Prioritize mitigations by impact-to-effort ratio.

Common mitigations include: principle of least privilege for tool access (most agents have more permissions than they need), input validation on inter-agent messages (most multi-agent systems treat all internal messages as trusted), memory integrity checks (verify that retrieved context hasn't been tampered with), and trajectory-level monitoring (evaluate sequences of actions, not just individual steps).

What the Research Says About Defensive Effectiveness

The early assessment is sobering. Automated red teaming appears to be improving faster than automated defense. Capability-based scaling research from 2025 showed that as red teaming models get larger, they find vulnerabilities that smaller red teamers miss, and the relationship scales predictably. Meanwhile, defensive alignment techniques show diminishing returns at scale.

The AiTM attack against multi-agent systems achieved high success rates across multiple frameworks and communication structures. The researchers noted that existing multi-agent frameworks have minimal message validation, treating inter-agent communication as a trusted channel by default. This is the equivalent of building a web application that trusts all HTTP headers without validation, a mistake the web security community learned to avoid two decades ago.

T-MAP's benchmark results against frontier models are concerning because they show how trajectory-level attacks can succeed even when individual steps look benign. The attacks succeeded because safety evaluation at the step level missed harm that only emerged at the trajectory level. Treat the exact rate as benchmark-specific, but the architectural blind spot is current and practical.

For regulated use cases, EU AI Act implementation and related security guidance make adversarial testing documentation increasingly relevant in compliance reviews. This is not legal advice, and obligations depend on the system and jurisdiction, but red teaming is increasingly treated as governance evidence in regulated settings.

The Practitioner's Takeaway

Red teaming AI agents is not an extension of red teaming LLMs. It is a different discipline with different threat models, different tools, and different success criteria. The minimum viable red teaming program for an agentic system in 2026 includes:

A threat model specific to your agent architecture. Map every tool, every data source, every communication channel, and every persistence mechanism to the five attack surfaces.
Automated baseline scanning. Use PyRIT, Garak, and DeepTeam to cover known vulnerability patterns. This is table stakes, not the finish line.
Trajectory-level manual testing. Design multi-step attack scenarios specific to your domain. Evaluate full action sequences, not individual steps. The most dangerous vulnerabilities are the ones where every step looks safe and the outcome is harmful.
Continuous testing, not point-in-time assessment. Agents change behavior as their models update, their tools change, and their memory accumulates. A red team engagement that ran last month may not reflect today's attack surface. Automated red teaming in CI/CD catches regressions. Manual red teaming on a quarterly cadence catches novel attack categories.
Inter-agent message validation. If you are running a multi-agent system, treat inter-agent communication as untrusted input. This single change blocks the entire class of AiTM attacks.

The attack surface for AI agents is large and growing. The tooling for testing it is maturing rapidly. The gap between what attackers can do and what defenders routinely test for remains wide. Closing that gap starts with recognizing that agent security is not model security, and that the playbook for one does not cover the other.

Sources

Research Papers:

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search — (2026)
Red-Teaming LLM Multi-Agent Systems via Communication Attacks (AiTM) — He et al. (2025), ACL 2025
AGENTICRED: Optimizing Agentic Systems for Automated Red-teaming — (2026)
AutoInject: Automated Universal Transferable Adversarial Suffix Attack — (2026)
Capability-Based Scaling Trends for LLM-Based Red-Teaming — (2025)
Red Teaming AI Red Teaming — (2025)

Frameworks and Guidelines:

OWASP Top 10 for Agentic Applications 2026 — OWASP
OWASP GenAI Red Teaming Guide — OWASP
CSA Agentic AI Red Teaming Guide — Cloud Security Alliance
MITRE ATLAS — MITRE

Tools:

PyRIT (Python Risk Identification Toolkit) — Microsoft
Garak — NVIDIA
DeepTeam — Confident AI

Related Swarm Signal Coverage:

Red Teaming AI Agents: A Practitioner's Guide

Key finding

Why it matters

Evidence base

Operator takeaway

Where this breaks

Use this if

Avoid this if

Red Teaming AI Agents: A Practitioner's Guide

Why Agent Red Teaming Is Different

The Threat Model: Five Attack Surfaces

1. Goal Hijacking

2. Tool Misuse and Escalation

3. Inter-Agent Message Injection

4. Memory and Context Poisoning

5. Output and Data Exfiltration

The Practitioner's Toolkit

Running a Red Team Engagement: A Step-by-Step Process

Phase 1: Scope and Threat Model (1-2 days)

Phase 2: Automated Scanning (2-3 days)

Phase 3: Manual Trajectory Testing (3-5 days)

Phase 4: Report and Remediate (1-2 days)

What the Research Says About Defensive Effectiveness

The Practitioner's Takeaway

Sources

Execution tooling is separate