Metrics, Logs & Traces for an AI Agent

One agent run through all three observability lenses. Metrics (tokens, LLM latency, failed tools, and a faithfulness eval) tell you something's wrong; correlation-id logs let you filter one request's journey out of the noise; and the trace waterfall shows which span ate the time. The headline: a hallucination returns HTTP 200 with green latency — only the eval catches it. For agents, success ≠ correctness.

One run of the contract-risk agent, through all three observability lenses. Inject a problem, run it, and read the metrics, then the logs (turn on the correlation id), then the trace. Note that a hallucination stays HTTP 200 — only the faithfulness metric catches it.

What just happened

▹Three lenses, one run. Metrics tell you SOMETHING is wrong, logs tell you WHERE in the request, and the trace tells you WHICH span ate the time. You drill metric → log (by correlation id) → trace.
▹Agents need agent-specific signals normal APM ignores: tokens & cost, LLM latency, tool/MCP failures, empty retrievals — and evals like faithfulness. The metric you always watch first is tokens.
▹The trap: an agent returns HTTP 200 with a confidently wrong answer. Latency and error-rate stay green; only an eval (faithfulness) catches it. For agents, success ≠ correctness — you must instrument quality, not just liveness.
▹Logs from every tool interleave into noise until a shared correlation id lets you filter to one request and reconstruct its whole journey. Traces (LangSmith / Langfuse) then show the span waterfall.