Agent Debugging and Observability for Seeing Inside the Black Box

05 Nov 2025

When your AI agent fails, hallucinates, or makes unexpected decisions, how do you find out why? Unlike traditional software where you can step through code line-by-line, AI agents operate through chains of LLM calls, tool invocations, and state transitions that can be opaque and non-deterministic. This article explores the essential techniques for making agent behavior visible, debuggable, and monitorable.

Concept Introduction

Observability means instrumenting your agent so you can see what it’s thinking, what tools it’s calling, and why it made specific decisions. Unlike traditional software where you can step through code line-by-line, AI agents operate through chains of LLM calls, tool invocations, and state transitions that are opaque and non-deterministic.

Agent observability encompasses several layers:

Trace logging: Recording every step in an agent’s execution path (LLM calls, tool uses, state transitions)
Structured events: Capturing decision points with context (prompts, responses, intermediate states)
Cost tracking: Monitoring token usage and API costs per operation
Performance metrics: Measuring latency, success rates, and error patterns
Semantic monitoring: Detecting when outputs drift from expected behavior patterns

Unlike traditional application observability (metrics, logs, traces), agent observability must capture the semantic content of LLM interactions, not just HTTP status codes.

Core Observability Patterns

The Trace Hierarchy

Every agent execution forms a tree of operations:

Agent Run
├── LLM Call #1 (planning)
│   ├── Prompt construction
│   ├── API request
│   └── Response parsing
├── Tool Call: search_database
│   ├── Input validation
│   ├── Query execution
│   └── Result formatting
├── LLM Call #2 (synthesis)
│   └── ...
└── Final response

Each node should capture:

Inputs: Exact prompt, tool parameters
Outputs: Raw responses, parsed results
Metadata: Timestamps, token counts, model used
Context: Current agent state, conversation history

The Event Structure Pattern

from dataclasses import dataclass
from typing import Any, Optional
from datetime import datetime

@dataclass
class AgentEvent:
    event_id: str
    parent_id: Optional[str]  # For nesting
    event_type: str  # "llm_call", "tool_use", "decision"
    timestamp: datetime
    inputs: dict[str, Any]
    outputs: dict[str, Any]
    metadata: dict[str, Any]

    def to_json(self) -> dict:
        """Serialize for storage/analysis"""
        return {
            "id": self.event_id,
            "parent": self.parent_id,
            "type": self.event_type,
            "time": self.timestamp.isoformat(),
            "inputs": self.inputs,
            "outputs": self.outputs,
            "meta": self.metadata
        }

Design Patterns & Architectures

The Decorator Pattern wraps agent operations with observability:

from functools import wraps
import time
import uuid

def trace_operation(operation_type: str):
    """Decorator to automatically trace agent operations"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            event_id = str(uuid.uuid4())
            start_time = time.time()

            # Capture inputs
            event = AgentEvent(
                event_id=event_id,
                parent_id=get_current_context().parent_id,
                event_type=operation_type,
                timestamp=datetime.now(),
                inputs={"args": args, "kwargs": kwargs},
                outputs={},
                metadata={}
            )

            try:
                result = func(*args, **kwargs)
                event.outputs = {"result": result}
                event.metadata["success"] = True
                return result
            except Exception as e:
                event.outputs = {"error": str(e)}
                event.metadata["success"] = False
                raise
            finally:
                event.metadata["duration_ms"] = (time.time() - start_time) * 1000
                log_event(event)

        return wrapper
    return decorator

Context Propagation uses thread-local context to maintain trace hierarchy:

from contextvars import ContextVar

# Context variable for current trace
current_trace_context: ContextVar[dict] = ContextVar('trace_context')

class TraceContext:
    def __init__(self, parent_id: Optional[str] = None):
        self.parent_id = parent_id or str(uuid.uuid4())
        self.events = []

    def __enter__(self):
        self.token = current_trace_context.set(self)
        return self

    def __exit__(self, *args):
        current_trace_context.reset(self.token)
        # Flush events to storage
        flush_events(self.events)

Practical Application

A minimal observability implementation wraps an agent with a custom callback handler that intercepts lifecycle events — on_llm_start, on_llm_end, on_tool_start, and on_tool_end — and appends structured records to an in-memory event log. LangChain is the best-fit framework here because its BaseCallbackHandler protocol makes it straightforward to inject observability without modifying agent logic: you pass a callback instance to both the ChatOpenAI model and the AgentExecutor, and every step flows through your handler automatically. The handler’s get_trace() method aggregates token counts from llm_output and computes wall-clock duration by diffing the first and last event timestamps, giving you a single dict you can log, store, or forward to a tracing backend. For production use you’d replace the in-memory list with a structured logger or an OpenTelemetry span, but the callback boundary stays the same.

Try it

Using LangChain with ChatOpenAI and AgentExecutor, build a custom BaseCallbackHandler
subclass that logs on_llm_start, on_llm_end, on_tool_start, and on_tool_end events to
a list. Each event dict should include type, ISO timestamp, and relevant payload fields
(model name, token usage, tool name, input/output). Add a get_trace() method that returns
total tokens and duration_ms. Wire the callback into both the LLM and the executor, then
run a single test query and print the event-type sequence and token total. Include inline
comments explaining each callback method. Code must be runnable end-to-end.

Latest Developments & Research

OpenTelemetry for LLMs (2024)

The OpenTelemetry project added semantic conventions for LLM observability:

from opentelemetry import trace
from opentelemetry.semconv.trace import SpanAttributes

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("llm_call") as span:
    span.set_attribute("llm.system", "openai")
    span.set_attribute("llm.request.model", "gpt-4")
    span.set_attribute("llm.request.temperature", 0.7)

    response = llm.invoke(prompt)

    span.set_attribute("llm.response.tokens", response.usage.total_tokens)
    span.set_attribute("llm.response.finish_reason", response.choices[0].finish_reason)

Agent Replay Systems (2024)

Frameworks like LangGraph now support deterministic replay:

# Record execution
checkpointer = MemorySaver()
app = graph.compile(checkpointer=checkpointer)
result = app.invoke(input, config={"configurable": {"thread_id": "1"}})

# Replay exact same execution
replay = app.replay(thread_id="1", step=3)  # Replay from step 3

Research Directions

Causal debugging: Identifying which prompt changes caused behavior shifts
Differential testing: Comparing agent traces across model versions
Semantic similarity: Detecting when outputs are “different but equivalent”
Anomaly detection: ML models that identify unusual agent behavior patterns

Cross-Disciplinary Insight

Agent observability mirrors control theory from engineering. In a control system (thermostat, autopilot), you need:

State estimation: What’s the current state? (Agent observability)
Error detection: Is behavior deviating? (Monitoring/alerting)
Feedback loops: Adjust parameters based on observations (Prompt tuning)

Like a control engineer designing a dashboard, you’re building instrumentation to understand a complex dynamic system. The key difference is that your “system” makes decisions through learned patterns rather than explicit equations.

Engineering Notes