Why Your Monitoring Is Blind to What Your AI Agents Actually Do

May 7, 2026

I shipped my first AI agent to production in late 2025. It was a support triage system that read incoming tickets, classified them, pulled context from our knowledge base, and routed them to the right team. We had Datadog dashboards, structured logs, health checks, the whole setup. I felt good about it.

Then one Monday morning a customer escalated because their urgent billing issue had been sitting in the "general feedback" queue for three days. I went to check the logs. The API returned 200. Latency was normal. No errors anywhere. According to every metric we tracked, the agent was healthy.

It wasn't. The classification model had started hallucinating a nonexistent category, and the downstream routing logic silently dropped anything it couldn't match. We had zero visibility into what was happening between the request arriving and the response leaving. That was the week I started rethinking everything about how we monitor AI agents.

The Gap Between "200 OK" and "Actually Working"

Traditional application performance monitoring was built for a world where a request comes in, your code does some deterministic work, and a response goes out. You measure latency, error rates, throughput. If the response code is 200 and the p99 is under 500ms, life is good.

AI agents break this model completely. A single user request might trigger three model invocations, two tool calls, a retrieval step, and a final synthesis pass. The agent might retry a failed tool call, switch models mid-chain, or decide to skip a step entirely based on intermediate reasoning. All of that happens inside what your APM sees as one successful POST request.

The metrics that matter for agents are fundamentally different:

Did the agent pick the right tool? Not just "did the tool execute," but was it the correct choice given the input.
How many model calls did it take? A task that should need one call but consistently takes four is a sign of poor prompt design or model confusion.
What did each intermediate step actually produce? The final output might look fine while an intermediate step generated garbage that got corrected downstream (burning tokens and time in the process).
How much did this single request cost? One runaway agent loop can blow through your monthly budget in an afternoon.

Your existing Grafana board won't tell you any of this. I learned that the hard way.

Why Structured Tracing Changes Everything

After the routing incident, my first instinct was to add more logging. I sprinkled console.log statements through the agent pipeline. Within a week I had thousands of log lines per request and still couldn't answer basic questions like "why did the agent choose tool X over tool Y on this specific request?"

Logs are great for counting things. They're terrible for understanding decisions. An agent's behavior is a tree of decisions, and you need something that preserves the relationships between those decisions. That something is distributed tracing.

Specifically, I switched to OpenTelemetry with the gen_ai semantic conventions. These conventions define a standard vocabulary for AI-specific spans: gen_ai.request for model calls, gen_ai.execute_tool for tool invocations, gen_ai.invoke_agent for agent-level orchestration. Instead of a flat list of log lines, you get a trace that shows the full decision tree for every request.

Here's a simplified version of how I instrument an agent step in TypeScript:

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("support-agent");

async function classifyTicket(ticket: string): Promise<string> {
  return tracer.startActiveSpan(
    "gen_ai.request",
    {
      attributes: {
        "gen_ai.system": "openai",
        "gen_ai.request.model": "gpt-4o",
        "gen_ai.operation.name": "classify_ticket",
        "gen_ai.request.max_tokens": 256,
      },
    },
    async (span) => {
      try {
        const result = await openai.chat.completions.create({
          model: "gpt-4o",
          messages: [
            {
              role: "system",
              content:
                "Classify this support ticket into one of: billing, technical, feedback, urgent_escalation.",
            },
            { role: "user", content: ticket },
          ],
          max_tokens: 256,
        });

        const category = result.choices[0].message.content?.trim() ?? "unknown";

        span.setAttributes({
          "gen_ai.response.model": result.model,
          "gen_ai.usage.input_tokens": result.usage?.prompt_tokens ?? 0,
          "gen_ai.usage.output_tokens": result.usage?.completion_tokens ?? 0,
          "app.classification_result": category,
        });

        return category;
      } catch (err) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: String(err) });
        throw err;
      } finally {
        span.end();
      }
    },
  );
}

The key detail is those gen_ai.* attributes. They follow the OpenTelemetry semantic conventions for generative AI, which means any observability backend that supports them (Sentry, Honeycomb, Jaeger, whatever you're using) can parse and display them in a structured way. You're not inventing your own schema. You're speaking a shared language.

Auto-instrumentation Saves You From Yourself

Manually wrapping every model call gets old fast. The good news is that most AI SDKs now have OpenTelemetry auto-instrumentation packages. For example, if you're using the OpenAI SDK and the Vercel AI SDK together (which I do on a few projects), the setup looks something like this:

import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OpenAIInstrumentation } from "@arizeai/openinference-instrumentation-openai";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [new OpenAIInstrumentation()],
});

sdk.start();

That's it. Every openai.chat.completions.create call now automatically produces a span with model name, token counts, latency, and status. No manual wrapping needed. You still want custom spans for your business logic (like the classification result above), but the boilerplate disappears.

The Metrics That Actually Matter

Once I had tracing in place, I went through a phase where I tracked everything. Dozens of custom metrics, dashboards with fifteen panels. Most of it was noise. After a few months of actually using this data to debug production issues, I've settled on three categories that I check daily.

Reliability Metrics

Tool failure rate per tool. Not an aggregate. If your "search knowledge base" tool fails 8% of the time but your "send email" tool never fails, an overall 4% failure rate hides the real problem. Break it down.
Agent completion rate. What percentage of requests result in the agent actually finishing its task versus hitting a timeout, an error, or a retry limit?
Latency by step. The total request latency matters less than knowing which step is slow. In my case, the retrieval step was 70% of total latency. Optimizing the model calls would have been a waste of time.

Cost Metrics

This is the one that surprised me the most. I had rough estimates of what our agent would cost based on token pricing and average prompt length. I was off by 3x.

The reason: I wasn't accounting for retries, multi-step chains, or the variance between requests. Some tickets were simple one-shot classifications. Others triggered a five-step investigation with multiple retrieval calls and a long synthesis pass. The average was meaningless. What mattered was the distribution.

I now track cost per request at the p50, p90, and p99. The p99 is where your budget actually goes. I also break down cost by user tier, because our enterprise customers' tickets tend to be more complex and trigger longer agent chains. That breakdown changed how we think about pricing.

// After each agent run, record cost metrics
span.setAttributes({
  "app.total_cost_usd": totalCost,
  "app.user_tier": user.tier,
  "app.total_input_tokens": inputTokens,
  "app.total_output_tokens": outputTokens,
  "app.model_calls_count": modelCallCount,
  "app.tools_invoked": toolNames.join(","),
});

Quality Metrics (the Tricky Ones)

Reliability and cost are straightforward to measure. Quality is harder. You can't easily compute "did the agent give the right answer" from telemetry alone. But you can measure proxies:

Tool call frequency. If the agent is calling the same tool three times per request when it used to call it once, something changed. Maybe your prompts drifted, maybe the model version changed, maybe the tool's output format shifted.
Token efficiency. Output tokens per successful task. If this number is climbing, the agent is getting more verbose without getting more useful.
Cache hit rate on retrieval. If you're doing RAG, the percentage of retrievals that hit your vector cache versus requiring a fresh embedding lookup tells you a lot about whether your knowledge base is well-structured.

None of these are perfect. But when they move together, they tell a story. A sudden spike in tool call frequency alongside rising costs and falling cache hits usually means something upstream changed and the agent is compensating by working harder.

Connecting Agent Traces to the Rest of Your Stack

Here's something I didn't appreciate until we had a production incident that involved both the agent and a downstream service: your agent traces need to connect to your application traces. Isolated AI observability is only half the picture.

When our classification agent miscategorized tickets, the downstream effect was that a different service (the routing API) processed them incorrectly. But because the agent traces and the routing service traces were in separate systems, it took us hours to connect the dots. The agent trace showed a correct 200 response. The routing service trace showed normal behavior. The bug was in the seam between them.

After that incident, I made sure our agent spans are part of the same trace as the HTTP request that triggered them. OpenTelemetry's context propagation handles this naturally if you set it up correctly. The parent span is the incoming HTTP request, the agent orchestration span is a child, and each model call and tool invocation is a grandchild. When something goes wrong, you can follow the trace from the user's browser click all the way through the agent's decision chain and into the downstream services.

If you're using Sentry (which we switched to for this reason), their AI module connects agent spans to error tracking, performance monitoring, and even session replays. So when a user reports that "the bot gave me the wrong answer," you can pull up their session, see exactly what the agent did, and trace it back to the specific model call that went sideways. That closed loop from user experience to agent internals is worth its weight in gold.

Sample Everything (Yes, Everything)

One last thing that goes against conventional wisdom: don't sample your AI traces. With traditional backend services, sampling at 10% or even 1% makes sense. You have thousands of requests per second, they're mostly identical, and you just need statistical significance.

AI agent requests are different. They're expensive, they're highly variable, and the interesting failures are rare. If you sample at 10%, you'll miss the one request where the agent went into a loop and burned $4 in tokens. You'll miss the edge case where the model hallucinated a tool name that doesn't exist. These are exactly the traces you need.

The cost of storing 100% of AI traces is almost always less than the cost of one undetected agent failure in production. I've done the math for our system and it's not even close.

What My Observability Stack Looks Like Today

For anyone starting from scratch, here's roughly what I'd recommend based on what's worked for me:

Instrumentation: OpenTelemetry SDK with the gen_ai semantic conventions. Vendor-neutral, future-proof, and the ecosystem is growing fast.
Auto-instrumentation: @arizeai/openinference-instrumentation-openai or whatever framework-specific package matches your stack. Eliminates boilerplate for common SDKs.
Trace backend: Sentry or Honeycomb. Both handle AI-specific spans well. Sentry wins if you want full-stack connection to errors and session replays.
Cost tracking: Custom attributes on spans piped into dashboards. There's no good off-the-shelf solution yet. Roll your own.
Alerting: PagerDuty or Opsgenie triggered by cost and reliability thresholds. Alert on p99 cost spikes and tool failure rates, not just HTTP errors.

The tooling is still maturing. Six months from now this list will probably look different. But the principles won't change: trace decisions not just outcomes, track cost as a first-class metric, and connect your AI observability to your full application stack.

The Shift in Mindset

If I had to distill everything I've learned into one sentence, it would be this: monitoring AI agents is not about watching a service, it's about watching a reasoning process. The moment you internalize that distinction, the right instrumentation strategy becomes obvious.

Your agents aren't web servers. They make choices, and those choices have costs, consequences, and failure modes that traditional monitoring was never designed to capture. Invest the time to instrument them properly. Future you, the one debugging a production incident at 11pm, will be grateful.

▓▒░█▓▒░█▓▒░█▓▒