🎧 LISTEN TO THIS ARTICLE
You shipped an AI agent. It passed your evals. Then it spent $400 in one afternoon because a retry loop went infinite, and you didn't find out until a user filed a support ticket. The agent wasn't broken. Your visibility into it was.
Agent observability isn't application performance monitoring with an LLM label slapped on top. Traditional APM tracks request-response cycles. Agent observability tracks decisions: which tools got called, what the model reasoned before calling them, how much each chain of thought cost, and whether the output actually helped the user. The tooling has matured fast through early 2026, but the choices have gotten harder. Eight platforms now compete for this space, and they all define "observability" differently.
We tested all eight in production agent workflows during Q1 2026. Here's what we found.
How We Ranked
Every tool was evaluated on five criteria, weighted by what matters most in production:
- Trace depth (30%): Can you inspect every LLM call, tool invocation, and agent handoff in a multi-step workflow? Nested spans and parent-child relationships matter more than flat log lines.
- Evaluation integration (25%): Does the platform let you score outputs against ground truth, run regression tests, and catch quality degradation before users do?
- Cost tracking (20%): Does it automatically calculate token costs across providers? Can you set budget alerts and attribute spend to specific features or users?
- Setup friction (15%): How many lines of code to get traces flowing? Does it require a proxy, an SDK, or just an environment variable?
- Production scalability (10%): Can it handle millions of spans without melting your wallet or your dashboard?
Tools that score well on trace depth but lack evals lose points. Tools with great evals but no cost tracking lose points. No tool is perfect across all five.
At a Glance
| Tool | Type | Open Source | Free Tier | Standout Feature | Best For |
|---|---|---|---|---|---|
| LangSmith | Full platform | No (SDK is OSS) | 5K traces/mo | Deepest LangChain integration | LangChain/LangGraph teams |
| Langfuse | Full platform | Yes (MIT) | 50K observations/mo | Self-host with full features | Privacy-conscious teams |
| Arize Phoenix | Full platform | Yes (Elastic v2) | Unlimited (self-hosted) | OTel-native traces + evals | Framework-agnostic shops |
| Braintrust | Evals + logging | Partial (SDKs) | 1M spans + 10K scores | AI-powered eval generation | Eval-first workflows |
| Helicone | Gateway + analytics | Yes (Apache 2.0) | 10K requests/mo | One-line proxy integration | Cost-focused teams |
| W&B Weave | Full platform | Yes (Apache 2.0) | Free credits included | ML experiment lineage | Teams already on W&B |
| Datadog LLM Obs | Enterprise APM add-on | No | 14-day trial | Unified infra + LLM monitoring | Datadog-native enterprises |
| OpenTelemetry | Standard + DIY | Yes (Apache 2.0) | Free (standard only) | Zero vendor lock-in | Teams with custom backends |
Every tool handles basic LLM call logging. The differences show up in three areas: how they handle multi-agent traces, whether evals are built-in or bolted on, and what happens to your bill at 10 million spans per month.
LangSmith

LangSmith is LangChain's first-party observability platform, and it shows. If you're building with LangChain or LangGraph, tracing works out of the box with zero configuration. Set an environment variable, and every chain, tool call, and retrieval step appears as a nested span in the dashboard. No decorators. No SDK calls. Just traces.
The evaluation suite is where LangSmith earns its keep. You can define custom evaluators, run them against datasets, compare prompt versions side-by-side, and track quality metrics over time. The "Online Evaluation" feature scores production traces in real time, catching regressions before they become user-visible. The annotation queue lets non-technical reviewers label outputs for quality, feeding human judgment back into your eval pipeline.
The catch is ecosystem coupling. LangSmith works with non-LangChain code through its SDK, but the experience is notably rougher. Auto-instrumentation doesn't exist for vanilla OpenAI or Anthropic calls. You'll be wrapping functions manually. And at $39/month per seat on the Plus plan (with 10,000 base traces included), costs scale with team size fast.
Pricing: Free Developer tier (5,000 traces/month, 1 seat, 14-day retention). Plus at $39/seat/month (10K traces included). Enterprise custom. Self-hosted and BYOC options available.
GitHub: SDK at langchain-ai/langsmith-sdk (~800 stars). Platform is closed-source.
Langfuse
Langfuse has become the default open-source alternative to LangSmith, and the 23,000+ GitHub stars reflect real adoption, not hype. Khan Academy, Twilio, and Merck all run Langfuse in production. The reason is straightforward: you get a full-featured observability platform that you can self-host for free, with no feature gates on the open-source version.
The tracing model is framework-agnostic by design. Langfuse integrates with LangChain, LlamaIndex, OpenAI, Anthropic, and any custom code through Python and TypeScript SDKs. The recent OpenTelemetry-native integration means you can pipe OTel spans directly into Langfuse without a proprietary SDK. For teams that refuse to bet on a single vendor, this flexibility matters.
Prompt management is a feature most observability tools skip. Langfuse lets you version, deploy, and A/B test prompts directly from the platform, then correlate prompt versions with trace quality. Combined with the evaluation framework (custom scorers, LLM-as-judge, and human annotation), you get a tight loop between "what changed" and "did it help."
Where Langfuse falls short is real-time alerting. You can build dashboards and track trends, but there's no built-in mechanism to page your on-call engineer when latency spikes or costs exceed a threshold. You'll need to export metrics to a dedicated monitoring stack for that.
Pricing: Free Hobby (50K observations/month, 2 users). Core $29/month. Pro $199/month (SOC 2, HIPAA). Enterprise $2,499/month. Self-hosting always free.
GitHub: langfuse/langfuse (23,000+ stars).
Arize Phoenix
Arize Phoenix is the most technically ambitious open-source option in this list. Built on OpenTelemetry from the ground up, Phoenix treats AI observability as a proper telemetry problem rather than a logging problem. The distinction matters when you're debugging a multi-agent system where one agent calls another agent that calls three tools before returning a hallucinated answer.
Phoenix's OpenInference instrumentation library provides auto-tracing for the major frameworks: OpenAI Agents SDK, Claude Agent SDK, LangGraph, CrewAI, LlamaIndex, Vercel AI SDK, and DSPy. The traces are OTel-native, which means you can export them to any OTel-compatible backend (Jaeger, Grafana Tempo, Datadog) while still using Phoenix's purpose-built UI for the AI-specific analysis.
The evaluation engine runs locally. You can compute retrieval metrics (NDCG, precision, MRR), hallucination detection, toxicity scoring, and custom LLM-as-judge evaluations without sending data to a third-party service. For teams handling sensitive data in healthcare, finance, or government contracts, this local-first approach eliminates a compliance conversation entirely.
With 8,900+ GitHub stars and over 2.5 million monthly downloads, Phoenix has serious traction. The trade-off is operational complexity. Phoenix is a Python/FastAPI backend with a React frontend, and running it in production means managing another service. Teams unfamiliar with OpenTelemetry concepts (collectors, exporters, semantic conventions) face a steeper learning curve than LangSmith's "set one env var" experience.
Pricing: Phoenix is free and open-source. Arize cloud platform has custom enterprise pricing.
GitHub: Arize-ai/phoenix (8,900+ stars).
Braintrust

Braintrust approaches observability from the evaluation side and works outward. While other tools start with tracing and bolt on evals, Braintrust starts with scoring and makes logging a natural byproduct. This inversion changes how you think about monitoring: instead of asking "what happened," you ask "was the output good?"
The standout feature is Loop, an AI-powered system that generates scorers, datasets, and prompt revisions from production traces using natural language instructions. Describe what "good" looks like in plain English, and Loop creates an evaluation pipeline. This dramatically lowers the barrier to setting up evals, which is the step most teams skip because defining custom scoring functions feels like a research project.
The AI Proxy is a bonus: route your LLM calls through Braintrust's proxy to get automatic logging, caching, and unified model access. The proxy is free to use even without a Braintrust account. Braintrust's purpose-built database, Brainstore, delivers 80x faster query performance than traditional databases for trace analytics.
The free tier is generous: 1 million spans and 10,000 scores with unlimited users. The Pro plan at $249/month unlocks unlimited everything. SOC 2 Type II certification, GDPR compliance, and HIPAA options make it enterprise-ready.
The limitation is ecosystem breadth. Braintrust's integrations are narrower than Langfuse or Phoenix. If you're using a less common framework, you may need to instrument manually through the SDK rather than getting auto-tracing.
Pricing: Free (1M spans, 10K scores, unlimited users). Pro $249/month (unlimited). Enterprise custom.
GitHub: SDKs at braintrustdata (~100 stars across repos). Platform is closed-source.
Helicone
Helicone took a different architectural bet: instead of an SDK you install, it's a proxy you point your API calls through. Change your base URL from api.openai.com to oai.helicone.ai, and every request gets logged with latency, cost, token counts, and response content. One line of code. No decorators, no wrappers, no SDK initialization.
This proxy-first approach makes Helicone the fastest tool to integrate in this list. It also makes it the most natural fit for cost monitoring. The dashboard shows real-time spend across providers, broken down by model, user, and feature. You can set rate limits, budget caps, and alerts directly on the proxy. For teams whose primary concern is "how much are we spending and where," Helicone answers that question before you finish setting it up.
The architecture runs on Cloudflare Workers at the edge, adding minimal latency (typically under 50ms). Self-hosting is fully supported. With 4,800+ GitHub stars and a YC W23 pedigree, Helicone is well-funded and actively developed.
Where Helicone thins out is deep agent tracing. The proxy captures LLM calls, but it doesn't automatically reconstruct the parent-child relationships in a multi-step agent workflow. If Agent A calls Agent B which calls Tool C, Helicone logs three separate requests. Correlating them into a single trace requires manual effort with custom headers. For teams building complex multi-agent systems, this gap matters. For teams running single-agent workflows with cost anxiety, Helicone is probably all you need.
Pricing: Free (10K requests/month). Growth from ~$20/seat/month. Enterprise custom.
GitHub: Helicone/helicone (4,800+ stars).
Weights & Biases Weave

W&B Weave extends the Weights & Biases experiment-tracking philosophy into the agent era. If your team already uses W&B for model training, Weave slots in without introducing a new vendor, a new dashboard, or a new billing relationship.
The tracing model uses a Python decorator (@weave.op) that wraps any function, automatically capturing inputs, outputs, latency, token costs, and evaluation metrics. It works with OpenAI, Anthropic, Google, Hugging Face, and custom models. The decorator approach is lightweight enough that you can instrument existing code incrementally rather than rewriting it.
Where Weave differs from the competition is lineage. Because it sits within the W&B ecosystem, you can trace a production failure back through the model version, the training dataset, the hyperparameters, and the evaluation run that green-lit deployment. No other tool in this list connects agent observability to ML experiment history. For teams that fine-tune models and deploy agents from those models, this lineage is invaluable for root-cause analysis.
The weakness is community size. At roughly 1,100 GitHub stars, Weave's open-source community is significantly smaller than Langfuse or Phoenix. If you're not already a W&B user, the onboarding cost includes learning the entire platform, not just Weave.
Pricing: Free tier with included credits. Usage-based billing for data ingestion and storage. Academic license includes all Pro features free.
GitHub: wandb/weave (~1,100 stars).
Datadog LLM Observability
Datadog LLM Observability is the enterprise play. If your organization already monitors infrastructure, APM, logs, and security through Datadog, adding LLM observability means your agent traces live alongside your server metrics, error rates, and deployment events. For platform teams managing AI features within larger applications, this unified view eliminates the "two dashboards" problem.
The AI Agent Monitoring feature, launched in early 2026, automatically maps each agent's decision path (inputs, tool calls, sub-agent invocations, outputs) in an interactive graph. The LLM Experiments console lets you run evaluations and compare model versions. Auto-instrumentation covers LangChain, OpenAI, Anthropic, and AWS Bedrock through the Python SDK, with native support for OpenTelemetry GenAI semantic conventions (v1.37+).
Datadog also offers security evaluations (prompt injection detection, PII scanning) that most observability-only tools don't touch.
The cost is the elephant in the room. Datadog bills based on LLM span count, and third-party reports cite a ~$120/day base premium that activates when LLM spans are detected, translating to roughly $3,600/month before span-based charges. For enterprises already spending six figures on Datadog, this is a rounding error. For startups, it's a non-starter.
Pricing: Span-based billing plus base premium (~$3,600/month estimated). 14-day free trial. Enterprise pricing on request.
GitHub: Python SDK is open-source. Platform is closed-source.
OpenTelemetry (DIY Approach)
OpenTelemetry isn't a product. It's a standard. And as of early 2026, its GenAI semantic conventions have matured enough that building your own agent observability stack on OTel is a legitimate option, not just a science project.
The semantic conventions define standardized attributes for tracing LLM calls (model, provider, token counts, latency), tool invocations, agent handoffs, and multi-agent coordination. Draft conventions for agentic systems cover tasks, actions, teams, artifacts, and memory. These aren't theoretical. Datadog, Arize Phoenix, and Langfuse already consume OTel GenAI spans natively.
The DIY approach works like this: instrument your agent code with OTel SDKs (Python and JavaScript are best supported), configure an OTel Collector to receive spans, and export to your backend of choice. Jaeger for trace visualization. Prometheus for metrics. Grafana for dashboards. ClickHouse for long-term storage and analytics. All open-source. All free.
The advantage is total control and zero vendor lock-in. You own every span, every metric, every byte of data. Because OTel is a CNCF graduated project with backing from Google, Microsoft, and every major cloud provider, the standard isn't going away.
The disadvantage is everything else. You're building and maintaining an observability platform instead of using one. There's no built-in eval framework, no prompt management, no cost tracking. For a team with dedicated platform engineers and strong opinions about infrastructure, OTel is the foundation. For everyone else, it's the layer underneath a managed tool like Phoenix or Langfuse.
Pricing: Free (open standard). Infrastructure costs for self-hosted backends vary.
GitHub: open-telemetry (main spec repo: 4,600+ stars).
Decision Matrix

Choosing the right tool depends on three questions: where you are in your stack, what you value most, and how much you want to manage yourself.
You're building with LangChain or LangGraph:
Start with LangSmith. The auto-instrumentation is unmatched for that ecosystem. If vendor lock-in concerns you, run Langfuse in parallel as a backup.
You need open-source and self-hosting:
Langfuse for the broadest integration coverage and easiest self-hosting. Arize Phoenix if you want OTel-native traces and local evaluation without cloud dependencies.
Your primary concern is cost control:
Helicone. The proxy approach captures spend data with the least integration effort. Add a dedicated observability tool later when you need deeper trace analysis.
You want evaluation-first observability:
Braintrust. Loop's automated eval generation reduces the barrier that stops most teams from setting up quality monitoring at all.
You're already on W&B for ML experiments:
Weave. The model-to-production lineage is a capability no other tool offers.
You're an enterprise on Datadog:
Datadog LLM Observability. The unified infrastructure view is worth the premium if you're already paying for it.
You want zero vendor lock-in and have platform engineers:
Build on OpenTelemetry with GenAI semantic conventions. Use Phoenix or Langfuse as a frontend if you want to avoid building UI from scratch.
Most production teams will end up using two tools: one for deep tracing and evals (LangSmith, Langfuse, or Phoenix), and one for cost monitoring (Helicone or native provider dashboards). That's not a failure of the market. It's a reflection of the fact that agent observability requires both debugging depth and financial accountability, and no single tool dominates both.
FAQ
What's the difference between LLM observability and traditional APM?
Traditional APM tracks request-response cycles: latency, error rates, throughput. LLM observability tracks reasoning chains: which tools were called, what the model decided, how much each decision cost, and whether the output was correct. An LLM call returning a 200 status code can still be a failure if it hallucinated. APM can't catch that. Agent observability tools with evaluation integration can.
Can I use multiple observability tools simultaneously?
Yes, and many teams do. OpenTelemetry makes this practical because you can instrument once and export to multiple backends. A common pattern is Langfuse or Phoenix for deep debugging and testing plus Helicone for cost dashboards. The overhead is minimal since most tools ingest traces asynchronously.
How many traces per month does a typical production agent generate?
A single user interaction with a multi-step agent might generate 5-20 LLM spans (one per model call, tool invocation, or retrieval). At 1,000 daily active users with 3 interactions each, that's 15,000-60,000 spans per day, or roughly 500,000-1.8 million per month. Free tiers cover development. Production almost always requires a paid plan or self-hosting.
Are the OpenTelemetry GenAI semantic conventions stable enough for production?
The core attributes (model name, provider, token counts, latency, response content) are stable and widely supported as of early 2026. Agent-specific conventions (tasks, handoffs, memory) are in draft but usable. Building on OTel now gives you a foundation that grows with the standard. The risk isn't instability. It's incomplete coverage for newer patterns like multi-agent delegation chains.
Sources
- LangSmith Observability Platform
- LangSmith Pricing
- Langfuse GitHub Repository
- Langfuse Cloud Pricing
- Arize Phoenix GitHub Repository
- Arize Phoenix Documentation
- Braintrust Platform
- Braintrust Pricing
- Helicone GitHub Repository
- Helicone Pricing
- W&B Weave GitHub Repository
- W&B Weave Documentation
- Datadog LLM Observability
- Datadog AI Agent Monitoring Announcement
- OpenTelemetry GenAI Semantic Conventions
- OpenTelemetry AI Agent Observability Blog
- OpenTelemetry Agentic Systems Conventions Proposal