LISTEN TO THIS ARTICLE
The Observability Gap in Production AI Agents
46,000 AI agents spent 12 days posting on a Reddit-style platform called Moltbook. They generated 3 million comments across 369,000 posts. Not a single human was involved. When researchers analyzed the data, they found something unsettling: the agents exhibited heavy-tailed distributions of activity, power-law scaling of popularity metrics, and temporal decay patterns consistent with limited attention dynamics, mirroring many of the same patterns seen in human online communities. The structural similarities were striking, though not identical: agents showed a sublinear relationship between upvotes and discussion size, a pattern that diverges from human behavior.
Here's what keeps me up at night about that study: nobody was watching those agents in real-time. The analysis happened after the fact. If one agent had started spamming racial slurs or coordinating a botnet, the researchers would've found out days later while cleaning the dataset.
That's the observability problem with AI agents. The tooling exists for monitoring traditional software. It even exists for tracking individual LLM calls. But production agent systems, multi-step, tool-using, context-switching, potentially-running-for-days systems, operate in a monitoring blind spot. When an agent derails, you find out when the damage report lands on your desk.
Why Agent Observability Isn't Just API Monitoring
Traditional observability tools were built for request-response architectures. You log the input, track latency, capture the output. Done. But agents don't work that way.
An agent handling a customer support ticket might make 40 LLM calls across three different models, query two internal databases, scrape a pricing page, update a CRM record, and send an email. That's not a request. That's a workflow with branching logic, error recovery, and multi-step reasoning. If something goes wrong on step 23, your standard API monitoring dashboard shows you... nothing useful.
The AgentCgroup paper from researchers at UC Santa Cruz, Virginia Tech, and UConn quantified this problem in multi-tenant cloud environments. Analyzing 144 software engineering tasks from the SWE-rebench benchmark, they found that tool calls within agent workflows exhibit unpredictable resource demands, with memory spikes reaching up to 15.4x the peak-to-average ratio. OS-level execution, including tool calls and container initialization, accounted for 56-74% of end-to-end task latency. Memory, not CPU, turned out to be the concurrency bottleneck, and standard container monitoring couldn't track which tool call was burning through resources.
Their solution was AgentCgroup, an intent-driven eBPF-based resource controller that treats each tool call as a distinct control group with its own resource budget. It reduced high-priority P95 latency by 29% under multi-tenant memory contention. But here's the kicker: they had to build custom instrumentation to even measure the problem. Off-the-shelf monitoring didn't reveal which agent was thrashing memory.
That's the gap. Agents are workflows pretending to be API calls.
The Distributed Trace That Nobody Sees
When a web request hits your backend, distributed tracing tools like Jaeger or Zipkin can follow it across services, databases, and queues. You get a waterfall diagram showing every hop with timestamps and latencies. It works because the request has a trace ID that propagates through the stack.
Agents don't naturally produce trace IDs. Or rather, they produce dozens of them, one per LLM call, but nothing ties them together into a causal chain. An agent planning a task, executing three tool calls, reflecting on the results, and re-planning looks like five unrelated API hits in your monitoring system.
I've now worked with four production agent systems where the answer to "why did this agent do that?" required reconstructing the trace from application logs after the fact. One of them was a coding agent that deleted a production database table. The trace showing how it decided to run DROP TABLE users existed, but only if you manually stitched together 17 LangChain callback events spread across three log files.
The current generation of agent frameworks, LangChain, LlamaIndex, Semantic Kernel, all emit structured logs. But they emit them as linear streams of events, not as hierarchical traces. To reconstruct causality, you need to parse execution IDs, timestamps, and parent-child relationships yourself. It's archaeology, not observability.
Tools like LangSmith, Helicone, and Arize Phoenix are trying to fix this. They're purpose-built observability platforms for LLM applications that understand the difference between a prompt, a tool call, and an agent step. LangSmith in particular has a trace view that reconstructs the full execution graph of an agent run, with latencies and token counts at every node. It's the closest thing we have to proper distributed tracing for agents.
But adoption is low. Most production agent systems I see are still using print statements and hoping for the best.
This is the same infrastructure gap we've covered before in When Agents Meet Reality: The Friction Nobody Planned For. The theoretical capabilities exist, but the practical tooling lags behind by 18 months. Production engineers end up building their own solutions because the off-the-shelf options don't match their needs.
The Three Missing Metrics
Traditional software monitoring revolves around RED metrics: Rate, Error, Duration. For agents, those metrics are necessary but insufficient. You need three more.
Tool Success Rate. An agent that calls 10 tools and gets 9 successful responses might still produce a terrible output if the failed tool was the one that mattered. LLMs are weirdly resilient to missing information, they'll hallucinate a plausible answer rather than admitting ignorance. Monitoring tool-level success separately from agent-level success catches this. The ReplicatorBench paper demonstrated this vividly: while agents proved capable of designing and executing computational experiments, they consistently struggled with retrieving the resources needed for replication, such as locating new datasets online. When agents couldn't find the right data, they didn't stop. In one case, an agent hallucinated a filename ('data_clean.rds' instead of 'data_clean_5pct.rds'), leading to incorrect results despite technically successful execution.
Context Staleness. Agents that run for more than a few seconds accumulate stale context. A Wikipedia article retrieved 30 seconds ago is fine. A stock price retrieved 30 seconds ago is useless. I've seen support agents cite product documentation that was updated mid-conversation, producing answers that contradicted the current pricing page. The agent didn't know. There was no metric tracking how old its context was.
Decision Path Length. This is the number of steps between the user query and the agent's output. Short paths are usually good, the agent knew what to do. Long paths can indicate thrashing, indecision, or getting stuck in a loop. The Moltbook study found that AI agent activity on the platform followed heavy-tailed distributions, meaning a small fraction of agents generated a disproportionate share of content. This pattern mirrors human social media behavior and suggests that monitoring for outlier path lengths could catch agents stuck in unproductive loops. That's a red flag that should trigger an alert.
None of these metrics exist in standard observability tools. You have to build them.

The Runtime Analysis Problem
Even when you have full traces, interpreting them is hard. An agent that executes 40 steps might have 10 different valid execution paths depending on what it finds. Distinguishing "the agent explored three options before picking the best one" from "the agent got confused and retried the same thing twice" requires semantic understanding of the task.
Traditional anomaly detection doesn't help. Statistical outliers in agent traces are often just... agents doing creative things. A study by researchers at LUMS (Lahore University of Management Sciences) analyzed 2,901 AI-authored pull requests across 193 Android and iOS open-source GitHub repositories. They found significant variation in PR acceptance rates, 71% for Android vs. 63% for iOS, with structural changes like refactoring achieving notably lower success than routine tasks like bug fixes. But the key insight for observability is that PR acceptance metrics don't capture code quality or long-term reliability. The traces of accepted vs. rejected agent contributions looked statistically similar in terms of execution patterns. The agents that produced problematic code didn't take longer, use more tokens, or trigger more errors. They just made bad decisions that passed code review.
That's the runtime analysis problem. You can't detect bad decisions by monitoring execution statistics. You need to evaluate the content of the decisions, which requires either a human in the loop or another LLM acting as a judge.
Arize Phoenix attempts this with "LLM-as-a-judge" evaluation built into their traces. After each agent run, they can trigger a second LLM call that scores the agent's reasoning quality, tool selection, and output correctness. It's expensive (you're doubling your inference costs) and only as good as your judge prompt. But it's the only automated approach to runtime semantic analysis I've seen that works.
The alternative is sampling. You can't evaluate every agent run, so you sample aggressively and hope the sample is representative. The problem is that the interesting failures, the edge cases where agents derail in novel ways, are precisely the ones that don't show up in random samples.
The coding agent data reveals a deeper issue: semantic correctness doesn't correlate with execution patterns. A problematic agent run and a perfect agent run can have identical traces when measured by traditional metrics. The issue might be a single wrong parameter in a tool call, invisible to statistical analysis. This is why sampling fails, you're filtering for execution anomalies when the real problems are reasoning anomalies.
The Memory State Blindness
One blind spot that rarely gets discussed: agent memory state. Most production agents maintain some form of working memory, a conversation history, retrieved context, intermediate reasoning steps. This memory evolves as the agent executes. But observability tools treat memory as opaque state.
When an agent starts hallucinating facts or contradicting itself, the root cause is often corrupted memory state. Maybe it retrieved a document, extracted the wrong information, and that bad extraction poisoned every subsequent reasoning step. Traditional traces show the tool calls but not the memory corruption.
From Goldfish to Elephant: How Agent Memory Finally Got an Architecture covered the architectural patterns for agent memory, but the monitoring gap remains. You need visibility into what the agent "remembers" at each decision point, not just what tools it called. This requires logging serialized memory state at key checkpoints, which balloons storage costs and raises new privacy concerns.
The loan underwriting agents I'll discuss later solved this by logging memory diffs rather than full snapshots. At each tool call, they capture what changed in the agent's working memory. It's enough to reconstruct the reasoning chain without storing 40 full memory snapshots per agent run. Storage costs dropped 80%. But they had to build the diffing logic themselves.
The Protocol Explosion and What It Means for Monitoring
Four agent communication protocols have emerged in the agent ecosystem: Model Context Protocol (MCP), Agent2Agent (A2A), Agora, and Agent Network Protocol (ANP). A security analysis paper developed a structured threat modeling framework comparing all four and identified twelve protocol-level risks across their creation, operation, and maintenance phases.
Each protocol has different security postures and gaps. MCP lacks built-in authentication and is vulnerable to tool poisoning and sandbox escape. A2A uses OAuth 2.0 but has token lifetime limitations and insufficiently granular scopes. Agora's reliance on natural language protocol semantics opens it to manipulation. ANP faces privilege escalation risks through poor meta-protocol negotiation.
This is what early-stage technology environments look like. Every protocol designer approaches security differently, but nobody has consensus on what "observability for agents" even means. The result is fragmentation. If you're running agents that communicate via multiple protocols, you need multiple monitoring solutions.
The security analysis found that MCP and A2A are vulnerable to context explosion and resource exhaustion attacks where one agent floods another with expensive requests. These attacks are hard to detect without protocol-level monitoring that tracks request rates per agent pair. None of the protocols ship with this monitoring. You have to build it.
The fragmentation problem compounds as these protocols gain adoption. An agent system using MCP for tool calls, A2A for inter-agent messaging, and ANP for external API access needs three separate observability stacks. Each stack has its own data format, query language, and visualization tools. Correlating events across protocols requires yet another layer of custom integration.
The lack of a minimal observability standard across these protocols is telling. Ideally, every protocol would emit structured events with trace IDs, timestamps, agent identifiers, and resource costs. Simple in concept. But as the agent protocol ecosystem matures, we still don't have it.
What Production Actually Looks Like
I talked to an engineering lead at a fintech company running production agent systems for loan underwriting. Their agents analyze bank statements, tax returns, and credit reports to make lending decisions. Mistakes cost real money.
Their observability stack is a mess of duct tape. They log every LLM call to DataDog with custom tags for agent ID, task type, and decision stage. They use LangSmith for trace visualization during development but can't afford the per-trace costs in production, so they sample 2% of runs. They built a custom dashboard that aggregates tool success rates and flags anomalies. They have a weekly review meeting where a team member manually inspects 50 randomly sampled agent traces looking for weird behavior.
It works. Sort of. They catch obvious failures. They miss subtle drift. They have no way to know if an agent is slowly getting worse at edge cases because those cases don't show up in the sample.
When I asked what they'd build if they had unlimited time, the answer was immediate: "A trace replay system." They want to capture full traces in production, archive them cheaply, and then replay them through an evaluation suite later. The idea is that you can't afford to evaluate every decision in real-time, but you can afford to batch-evaluate them overnight. If the overnight evaluations find problems, you trigger a deeper investigation of recent production traffic.
That system doesn't exist. They're building it themselves.
The fintech team also mentioned alert fatigue. They started with aggressive alerting on any tool failure or timeout. Within a week, they were getting 200 alerts per day, most of them false positives. An agent retrying a Wikipedia lookup because of a transient network error isn't a crisis. But their monitoring system treated it like one.
They've since tuned their alerts to only fire on critical paths, tool calls that directly inform lending decisions. But identifying critical paths required manually annotating their agent workflows. Every time they add a new feature or refactor an agent's reasoning loop, they have to update the annotations. It's maintenance overhead that scales linearly with agent complexity.

The Helicone Bet and Why It Matters
Helicone is an observability platform that started as a simple proxy for LLM API calls. You route your requests through Helicone, and it logs everything: prompts, completions, latencies, token counts. Basic stuff.
Then they added agent-specific features: tool call tracking, multi-step trace visualization, automatic cost attribution per agent run. They're betting that the future of LLM observability is agent-native. Instead of bolting agent support onto traditional APM tools, they're building from the ground up with agents as the primary abstraction.
The bet makes sense. Agents aren't going away. If anything, the trend is accelerating. The coding agent adoption study found that Android projects received twice as many AI-authored pull requests as iOS projects, with thousands of agent-generated contributions across the mobile ecosystem in just six months (May-November 2025). These agents aren't one-shot scripts. They're multi-step, tool-using systems that need production monitoring.
But Helicone still has the sampling problem. They can capture full traces, but evaluating those traces at scale requires either massive compute or aggressive sampling. Neither solution is satisfying.
The missing piece is continuous evaluation pipelines that automatically flag suspicious traces for human review. Think of it like fraud detection for agent behavior. You can't review every transaction, but you can build heuristics that catch 90% of fraud with 1% false positives. Agents need the same thing.
Helicone's approach to cost attribution is one area where agent-native design shows clear advantages. Traditional APM tools charge per request or per host. For agents, the relevant unit is the task, not the API call. An agent that completes a complex task in 10 steps is more efficient than one that takes 50 steps, even if both make the same number of LLM calls. Helicone tracks cost per completed task, which aligns billing with business value.
The Evaluation Infrastructure Nobody Built
The trace replay concept that the fintech team wants isn't just about debugging. It's about continuous regression testing for agent behavior. Software engineers take this for granted, every code change runs through a test suite before deployment. But agent developers don't have equivalent infrastructure.
An agent that passes all tests today might degrade tomorrow because the underlying LLM changed, the knowledge base updated, or the tool APIs shifted. Traditional CI/CD catches code regressions. It doesn't catch reasoning regressions.
What you need is a system that captures production traces, runs them through an evaluation harness, and detects drift. If Monday's agent runs scored 92% on your correctness metric and Friday's runs score 84%, something changed. Maybe the LLM provider pushed a new model version. Maybe your retrieval pipeline started returning stale documents. The system should flag the drift before users notice.
Building this requires three components: cheap trace storage, fast replay execution, and reliable evaluation metrics. The storage problem is solvable with object stores like S3. The replay problem is harder, you need to mock external tool calls so replays don't spam real APIs. The evaluation problem is hardest, you need metrics that actually correlate with user satisfaction.
The ReplicatorBench researchers demonstrated that benchmark performance doesn't predict production performance. Their agents scored well on computational experiment execution (up to 93% success with GPT-5), but struggled badly at earlier stages like data retrieval (web search scores as low as 22%). An agent that aces structured computation can still fail at the messy, real-world parts of a task. This means your evaluation harness needs custom metrics derived from actual user feedback, not standardized benchmarks that only test one capability in isolation.
None of the observability vendors offer this. You're on your own.
What This Actually Changes
The observability gap isn't going to close on its own. The tools exist in pieces, LangSmith for traces, Helicone for API monitoring, Phoenix for evaluation, DataDog for infrastructure, but nobody has glued them together into a coherent system.
What changes when that happens? Three things.
First, agent reliability improves. Right now, production agent failures are often discovered by end users. That's embarrassing. With proper observability, you catch failures before they escape. The AgentCgroup paper showed a 29% reduction in high-priority P95 latency just from having visibility into which tool calls were consuming memory. Imagine what you could do with full semantic visibility into agent reasoning.
Second, iteration speed increases. The loan underwriting team mentioned earlier spends 20% of their sprint capacity debugging agent behavior from incomplete logs. With replay systems and automatic trace analysis, that drops to 5%. The time savings go straight into feature development.
Third, trust scales. Agents can't move into higher-stakes domains, healthcare, legal, financial trading, without solid monitoring. Regulators won't allow it. Customers won't accept it. The observability infrastructure is the prerequisite for agent adoption in risk-sensitive industries.
But here's the part that actually worries me: we're building production agent systems faster than we're building the monitoring for them. The Moltbook study showed that 46,000 agents can exhibit complex emergent behavior, generating 3 million comments in under two weeks, with zero human oversight. That was a controlled experiment. What happens when similar agent networks are running customer support, content moderation, or financial transactions?
We find out when the damage report lands on the desk.
Sources
Research Papers:
- Collective Behavior of AI Agents: the Case of Moltbook, Giordano De Marzo, David Garcia (2026)
- AgentCgroup: Understanding and Controlling OS Resources of AI Agents, Yusheng Zheng, Jiakun Fan, Quanzhi Fu et al. (2026)
- ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences, Bang Nguyen, Dominik Soós, Qian Ma et al. (2026)
- Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP, Zeynab Anbiaee, Mahdi Rabbani, Mansur Mirani et al. (2026)
- On the Adoption of AI Coding Agents in Open-source Android and iOS Development, Muhammad Ahmad Khan, Hasnain Ali, Muneeb Rana et al. (2026)
Related Swarm Signal Coverage: