LISTEN TO THIS ARTICLE

Agent Observability Is Escaping the Dashboard

The practical signal is simple: builders are starting to expect the whole agent run, not only the final answer, to be inspectable.

In June 2026, agent observability is moving from vendor dashboards into trace contracts OpenAI Agents SDK tracing OpenInference semantic conventions. OpenAI's Agents SDK records LLM generations, tool calls, handoffs, guardrails, and custom events in traces, while OpenInference defines separate span kinds for LLM, retriever, tool, agent, guardrail, evaluator, and prompt steps OpenAI Agents SDK tracing OpenInference semantic conventions. The practical signal is simple: builders are starting to expect the whole agent run, not only the final answer, to be inspectable.

Evidence base: official tracing documentation from OpenAI, OpenInference, OpenTelemetry, LangSmith, Braintrust, and Datadog.

Key takeaways

  • The useful unit of analysis is shifting from a model call to a run trace.
  • Schema choice matters because tracing conventions are still split across projects.
  • Evaluation should attach to traces, not sit in a separate spreadsheet.
  • Instrumentation without redaction is operational debt.

The signal

OpenAI says tracing is enabled by default in its Agents SDK and captures a record of events during an agent run, including model generations, tool calls, handoffs, guardrails, and custom events OpenAI Agents SDK tracing. OpenInference's specification breaks the trace into operation types such as LLM, embedding, chain, retriever, reranker, tool, agent, guardrail, evaluator, and prompt spans OpenInference semantic conventions.

OpenTelemetry's GenAI agent conventions are marked Development, but the draft already names create-agent, invoke-agent, workflow, plan, and execute-tool spans OpenTelemetry GenAI agent spans. That matters because the agent stack is becoming too stateful for a single "request succeeded" metric to explain much.

Evidence

LangSmith frames observability as visibility from individual traces to production-wide performance metrics, with integrations across OpenAI, Anthropic, CrewAI, Vercel AI SDK, Pydantic AI, and other stacks LangSmith Observability. Braintrust's OpenAI Agents SDK integration says it captures root task spans, child spans for tool calls, guardrails, handoffs, nested model work, inputs and outputs, token metrics when exposed, and parent-child relationships Braintrust OpenAI Agents SDK.

The strongest industry signal is not another dashboard. Datadog says OpenTelemetry GenAI conventions give teams one schema for prompts, model responses, token usage, tool and agent calls, and provider metadata, then route those spans through an existing OpenTelemetry Collector path Datadog OTel GenAI support. In plain terms: agent telemetry is becoming something platform teams can govern before it leaves the network.

Do not ask only whether the model answered correctly.

Why it matters

Agent failures usually hide in sequence, not output. The final answer can look fine while the retriever selected the wrong document, the tool call sent the wrong argument, or a handoff lost policy context.

That is why this belongs next to Swarm Signal's work on agent evals that catch production failures and agent memory architecture. Evals tell you whether the behaviour was acceptable. Traces tell you where the behaviour came from.

The caveat is privacy. The OpenTelemetry GenAI registry warns that captured message and tool-call attributes can contain sensitive or personal data, and says content capture should be gated by explicit opt-in OpenTelemetry GenAI attributes. A trace that records everything without filtering is not observability. It is a liability log.

What changes

Do not ask only whether the model answered correctly. Ask whether the run trace explains why each step happened, which inputs each step saw, which tool call changed state, which guardrail fired, and which evaluator judged the result.

For builders working through deployment patterns, the practical move is to instrument the agent boundary before optimising prompts. Start with run IDs, model spans, tool spans, retrieval spans, guardrail spans, evaluator spans, token counts, and redaction policy. Add dashboards later.

Operator takeaway

If you are building this system, do this:

  • One practical action: choose an OpenTelemetry or OpenInference-compatible trace shape before adding a proprietary dashboard.
  • One thing to measure: trace completeness across model, retriever, tool, guardrail, and evaluator steps.
  • One thing to avoid: storing raw prompts, retrieved documents, or tool arguments without an explicit redaction rule.
  • One decision gate: no agent reaches production until a failed run can be reconstructed from trace data.

Related: Agent State Migration and Rollback: The Missing Reliability Layer.

Source trail

Primary specifications and docs

Industry implementation signals

Related Swarm Signal analysis