Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

Error Recovery and Graceful Degradation in AI Agents

16 Feb 2026

Your AI agent works perfectly in development. Then it hits production: the LLM API times out, a tool returns garbage, the agent hallucinates a function that doesn’t exist, and the whole pipeline collapses. What now?

Error recovery and graceful degradation are the difference between a demo and a production system. This article covers the patterns, algorithms, and practical techniques that make AI agents resilient.

Concept Introduction

Error recovery means detecting failures and taking corrective action. Graceful degradation means falling back to simpler but still useful behavior when full capability is unavailable. AI agents face cascading failures regularly: API timeouts, hallucinated tool calls, context window overflows, and malformed outputs.

Agent error recovery operates at three levels:

The core challenge is that agents are stateful and context-dependent. Unlike stateless microservices, an agent mid-way through a reasoning chain carries accumulated context that cannot simply be “retried from scratch” without cost.

Historical & Theoretical Context

Fault tolerance in computing dates back to the 1960s with NASA’s redundant flight systems for the Apollo program. The key ideas (redundancy, failover, and graceful degradation) were formalized in dependability theory by Jean-Claude Laprie in the 1980s.

In distributed systems, these ideas evolved into patterns like the circuit breaker (Michael Nygard, Release It!, 2007) and bulkhead isolation. The Erlang programming language (1986) pioneered the “let it crash” philosophy: instead of preventing all failures, build supervision trees that detect crashes and restart components cleanly.

Modern AI agent systems face a unique combination of these challenges:

This makes traditional retry-and-restart insufficient. Agent recovery must also reason about semantic correctness, not just operational success.

Algorithms & Patterns

The Retry Hierarchy

A layered approach to recovery, from least to most disruptive:

Level 0: Retry     → Same call, same parameters (transient errors)
Level 1: Rephrase  → Same intent, different prompt/parameters
Level 2: Reroute   → Different tool or model for the same task
Level 3: Replan    → Abandon current plan, generate a new one
Level 4: Escalate  → Hand off to human or return partial results

Exponential Backoff with Jitter

For transient failures (rate limits, timeouts), the standard algorithm:

$$t_{\text{wait}} = \min\left( t_{\text{max}},\; t_{\text{base}} \cdot 2^{n} + \text{random}(0, t_{\text{jitter}}) \right)$$

Where $n$ is the retry attempt number. The jitter prevents thundering herd problems when many agents retry simultaneously.

Circuit Breaker State Machine

For persistent failures, the circuit breaker pattern prevents wasting resources on a broken dependency:

stateDiagram-v2
    [*] --> Closed
    Closed --> Open : failure_count >= threshold
    Open --> HalfOpen : timeout_elapsed
    HalfOpen --> Closed : probe_succeeds
    HalfOpen --> Open : probe_fails
  

Design Patterns & Architectures

Tiered Model Fallback

When your primary LLM fails or is too slow, fall back through a chain of increasingly simpler models:

GPT-4o (full capability)
  → Claude Haiku (faster, cheaper)
    → Local small model (no network dependency)
      → Cached/template response (deterministic)

This connects to the Chain of Responsibility pattern from software engineering, where each handler either processes the request or passes it to the next.

Checkpoint-and-Resume

For long-running agent workflows, periodically save state so that failures don’t lose all progress:

[Step 1: Research] → checkpoint → [Step 2: Analyze] → checkpoint → [Step 3: Write]
         ↑                                  ↑
     resume here                       resume here
     if step 2 fails                   if step 3 fails

This mirrors the saga pattern from distributed transactions, where each step has a defined compensation action.

Supervisor Agent

A dedicated agent monitors worker agents and takes corrective action:

graph TD
    S[Supervisor Agent] -->|monitors| W1[Worker 1]
    S -->|monitors| W2[Worker 2]
    S -->|monitors| W3[Worker 3]
    S -->|restart/replace| W1
    S -->|escalate| H[Human]
  

This is directly inspired by Erlang’s supervision trees and maps naturally to multi-agent frameworks.

Practical Application

Here is a resilient agent executor with tiered fallback and circuit breaking in Python:

import time
import random
from enum import Enum
from dataclasses import dataclass

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreaker:
    failure_threshold: int = 3
    recovery_timeout: float = 30.0
    state: CircuitState = CircuitState.CLOSED
    failure_count: int = 0
    last_failure_time: float = 0.0

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

    def record_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def can_execute(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                return True
            return False
        return True  # HALF_OPEN: allow one probe


class ResilientAgent:
    def __init__(self, models: list[dict], max_retries: int = 3):
        self.models = models  # [{"name": ..., "client": ..., "breaker": ...}]
        self.max_retries = max_retries
        self.checkpoints: dict[str, dict] = {}

    def invoke_with_fallback(self, prompt: str, task_id: str = "") -> str:
        """Try each model in order, with retries and circuit breaking."""
        last_error = None

        for model in self.models:
            breaker: CircuitBreaker = model["breaker"]

            if not breaker.can_execute():
                continue  # Skip this model, circuit is open

            for attempt in range(self.max_retries):
                try:
                    result = model["client"].invoke(prompt)
                    breaker.record_success()

                    # Save checkpoint on success
                    if task_id:
                        self.checkpoints[task_id] = {
                            "result": result,
                            "model": model["name"],
                            "timestamp": time.time(),
                        }
                    return result

                except Exception as e:
                    last_error = e
                    breaker.record_failure()
                    wait = min(60, 2 ** attempt + random.uniform(0, 1))
                    time.sleep(wait)

        raise RuntimeError(
            f"All models exhausted after fallback chain. Last error: {last_error}"
        )

    def resume_from_checkpoint(self, task_id: str) -> dict | None:
        """Resume a task from its last successful checkpoint."""
        return self.checkpoints.get(task_id)

Using it with LangGraph

In a LangGraph workflow, you can wrap nodes with recovery logic:

from langgraph.graph import StateGraph

def resilient_research_node(state: dict) -> dict:
    agent = ResilientAgent(models=[
        {"name": "gpt-4o", "client": gpt4_client,
         "breaker": CircuitBreaker()},
        {"name": "claude-haiku", "client": haiku_client,
         "breaker": CircuitBreaker()},
    ])

    try:
        result = agent.invoke_with_fallback(
            state["query"], task_id="research"
        )
        return {**state, "research": result, "error": None}
    except RuntimeError:
        return {**state, "research": None, "error": "research_failed"}

def route_after_research(state: dict) -> str:
    if state.get("error"):
        return "fallback_node"  # Simplified response path
    return "analysis_node"      # Continue normal flow

graph = StateGraph(dict)
graph.add_node("research", resilient_research_node)
graph.add_node("analysis_node", analysis_node)
graph.add_node("fallback_node", fallback_node)
graph.add_conditional_edges("research", route_after_research)

Latest Developments & Research

AWS Agentic Resilience Patterns (2025): AWS published architecture guidance for building resilient generative AI agents, recommending circuit breakers between agent clusters rather than individual connections, improving fault containment in distributed systems.

LangGraph Checkpointing (2024–2025): LangGraph’s MemorySaver and persistent checkpointers allow workflows to resume from any node after failure, making checkpoint-and-resume a first-class framework feature rather than custom infrastructure.

Adaptive Retry Research (2025): Recent work explores using lightweight classifiers to predict whether a failure is transient or permanent before retrying, reducing wasted API calls by up to 40% in production pipelines.

Open problems:

Cross-Disciplinary Insight

Agent error recovery maps directly to biological resilience. The human immune system operates on a tiered response model strikingly similar to agent fallback chains:

  1. Skin/barriers (input validation): prevent problems from entering
  2. Innate immune response (retries, circuit breakers): fast, generic response
  3. Adaptive immune response (replanning, supervisor agents): slower but targeted
  4. Fever/inflammation (graceful degradation): sacrifice performance to contain damage

Both systems share a key principle: don’t over-react to minor issues, but escalate aggressively when something is genuinely wrong. An agent that escalates every timeout to human review is like a body that triggers a fever for a paper cut, costly and unnecessary.

Daily Challenge

Exercise: Build a Fault-Tolerant Tool Executor

Create a ResilientToolRunner class that:

  1. Accepts a list of tools with the same capability (e.g., three different search APIs)
  2. Tries the primary tool first with exponential backoff (max 3 retries)
  3. Falls back to secondary tools if the primary’s circuit breaker opens
  4. Returns a structured result that includes which tool succeeded and how many attempts were needed
  5. Logs all failures for observability

Test it by simulating random failures:

import random

def unreliable_search(query: str) -> str:
    if random.random() < 0.6:  # 60% failure rate
        raise TimeoutError("Search API timed out")
    return f"Results for: {query}"

# Build your ResilientToolRunner and run it 20 times.
# Track: success rate, average attempts, fallback frequency.

Bonus: Add a “semantic validation” step that checks if the tool’s output actually answers the query, and retries with a different tool if it doesn’t.

References & Further Reading

Architecture Guides

Papers

Tools & Frameworks


● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.