Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

Temporal Difference Learning and Q-Learning as the Engine of Agent Intelligence

04 Nov 2025

Temporal Difference (TD) Learning and its most prominent variant, Q-Learning, are among the most influential algorithms in reinforcement learning. They connect the theoretical framework of MDPs to the practical problem of agents learning from experience, without requiring a model of the environment.

Concept Introduction

Temporal Difference Learning is a class of model-free reinforcement learning algorithms that learn by bootstrapping: they update estimates based on other estimates, rather than waiting for a final outcome. The key insight is the TD error: the difference between the current value estimate and a better estimate obtained after taking an action.

The TD(0) update rule for state values is:

V(s_t) ← V(s_t) + α [r_{t+1} + γ V(s_{t+1}) - V(s_t)]

Where:

Q-Learning extends this to learn action-value functions (Q-values) that tell us the quality of taking a specific action in a specific state:

Q(s_t, a_t) ← Q(s_t, a_t) + α [r_{t+1} + γ max_a Q(s_{t+1}, a) - Q(s_t, a_t)]

The key difference: we update using the maximum Q-value of the next state, regardless of what action we actually take next. This makes Q-Learning an off-policy algorithm. It can learn the optimal policy even while following a different, exploratory one.

graph LR
    A["State s(t)"] --> B["Take Action a(t)"]
    B --> C["Observe r(t+1), s(t+1)"]
    C --> D[Calculate TD Error]
    D --> E["Update Q(s_t, a_t)"]
    E --> F[Choose Next Action]
    F --> A
  

Historical & Theoretical Context

Temporal Difference learning was formalized by Richard Sutton in his 1988 paper, “Learning to Predict by the Methods of Temporal Differences.” The roots go back to earlier work in animal learning psychology and operations research.

Q-Learning was introduced by Christopher Watkins in his 1989 PhD thesis. It was the first algorithm proven to converge to the optimal policy under certain conditions (infinite exploration, appropriate learning rate decay), even when the agent follows a suboptimal exploration policy during learning.

TD learning sits at the intersection of two classical approaches:

TD learning combines the best of both: it updates online (like DP) without needing a model of the environment (like Monte Carlo). This makes it practical for agents operating in unknown environments.

Algorithms & Math

Q-Learning Algorithm (Pseudocode)

Initialize Q(s, a) arbitrarily for all s ∈ S, a ∈ A
Set Q(terminal_state, ·) = 0

For each episode:
    Initialize state s

    While s is not terminal:
        Choose action a from s using policy derived from Q
        (e.g., ε-greedy: with probability ε choose random, else choose argmax_a Q(s, a))

        Take action a, observe reward r and next state s'

        Q(s, a) ← Q(s, a) + α [r + γ max_a' Q(s', a') - Q(s, a)]

        s ← s'

Key Mathematical Properties

The TD error has a useful property: if you sum all TD errors along a trajectory, you get the Monte Carlo error.

Σ δ_t = Σ [r_{t+1} + γ V(s_{t+1}) - V(s_t)] = G_t - V(s_t)

Where G_t is the actual return (sum of discounted future rewards).

This means TD learning distributes the total prediction error across all the intermediate steps.

Design Patterns & Architectures

Q-Learning fits naturally into the Planner-Executor-Memory pattern we’ve discussed in previous articles:

Q-Learning also exemplifies the value-based approach to reinforcement learning, as opposed to policy-based methods like policy gradients. The agent doesn’t directly learn a policy; it learns values and derives a policy from them.

Integration with Modern Agent Systems

In LLM-based agents, Q-Learning can be used to:

  1. Tool selection: Learn which tools are most valuable in which contexts.
  2. Plan refinement: Assign Q-values to different planning strategies and learn which work best.
  3. Multi-step reasoning: Learn to evaluate the quality of intermediate reasoning steps.

Practical Application

Let’s build a simple Q-Learning agent that learns to navigate a grid world.

import numpy as np
import random

# Environment: 4x4 grid, goal at (3, 3), hole at (1, 1)
GRID_SIZE = 4
GOAL = (3, 3)
HOLE = (1, 1)
ACTIONS = ['up', 'down', 'left', 'right']

# Q-table: state is (x, y), action is index in ACTIONS
Q = np.zeros((GRID_SIZE, GRID_SIZE, len(ACTIONS)))

# Hyperparameters
alpha = 0.1      # Learning rate
gamma = 0.99     # Discount factor
epsilon = 0.1    # Exploration rate
episodes = 1000

def move(state, action):
    """Execute action and return new state and reward."""
    x, y = state

    if action == 'up':
        x = max(0, x - 1)
    elif action == 'down':
        x = min(GRID_SIZE - 1, x + 1)
    elif action == 'left':
        y = max(0, y - 1)
    elif action == 'right':
        y = min(GRID_SIZE - 1, y + 1)

    new_state = (x, y)

    if new_state == GOAL:
        return new_state, 10, True  # reward, done
    elif new_state == HOLE:
        return new_state, -10, True
    else:
        return new_state, -0.1, False  # Small penalty for each step

def choose_action(state, epsilon):
    """Epsilon-greedy action selection."""
    if random.random() < epsilon:
        return random.randint(0, len(ACTIONS) - 1)
    else:
        x, y = state
        return np.argmax(Q[x, y])

# Training loop
for episode in range(episodes):
    state = (0, 0)
    done = False

    while not done:
        action_idx = choose_action(state, epsilon)
        next_state, reward, done = move(state, ACTIONS[action_idx])

        # Q-Learning update
        x, y = state
        nx, ny = next_state

        old_q = Q[x, y, action_idx]
        max_next_q = np.max(Q[nx, ny])

        # TD update
        td_error = reward + gamma * max_next_q - old_q
        Q[x, y, action_idx] = old_q + alpha * td_error

        state = next_state

    if episode % 100 == 0:
        print(f"Episode {episode} complete")

# Display learned policy
print("\nLearned Policy (best action at each state):")
for x in range(GRID_SIZE):
    row = []
    for y in range(GRID_SIZE):
        if (x, y) == GOAL:
            row.append('G')
        elif (x, y) == HOLE:
            row.append('H')
        else:
            best_action = np.argmax(Q[x, y])
            symbols = {'up': '↑', 'down': '↓', 'left': '←', 'right': '→'}
            row.append(symbols[ACTIONS[best_action]])
    print(' '.join(row))

In Modern Frameworks

While classical Q-Learning uses tables, modern implementations use neural networks (Deep Q-Networks, DQN):

# Conceptual example with LangGraph for tool selection
from langgraph.graph import StateGraph

class AgentState:
    situation: str
    available_tools: list
    q_values: dict  # Maps (situation, tool) to Q-value

def select_tool_with_q_learning(state: AgentState):
    """Select tool using epsilon-greedy based on learned Q-values."""
    if random.random() < epsilon:
        return random.choice(state.available_tools)
    else:
        # Pick tool with highest Q-value for current situation
        best_tool = max(
            state.available_tools,
            key=lambda t: state.q_values.get((state.situation, t), 0)
        )
        return best_tool

# Build agent graph
graph = StateGraph(AgentState)
graph.add_node("select_tool", select_tool_with_q_learning)
graph.add_node("execute_tool", execute_tool_function)
graph.add_node("update_q", update_q_values_based_on_outcome)
# ... connect nodes

Latest Developments & Research

The field has evolved dramatically since 1989:

Deep Q-Networks (DQN) - 2013

DeepMind’s DQN paper revolutionized the field by using neural networks to approximate Q-values, enabling agents to play Atari games from pixels. Key innovations:

Rainbow DQN - 2017

Combined six extensions to DQN:

  1. Double Q-Learning (reduces overestimation)
  2. Prioritized Experience Replay
  3. Dueling Networks (separate value and advantage streams)
  4. Multi-step Learning
  5. Distributional RL
  6. Noisy Nets (better exploration)

Open Problems:

Cross-Disciplinary Insight

Q-Learning mirrors concepts from neuroscience, particularly dopamine-based reward learning. Neuroscientist Wolfram Schultz discovered that dopamine neurons fire in proportion to a reward prediction error, which is essentially the TD error.

When an animal expects a reward and doesn’t get it, dopamine dips (negative TD error). When it gets an unexpected reward, dopamine spikes (positive TD error).

From economics, Q-Learning relates to the concept of option value: the value of having the option to make a choice in the future. The Q-value captures not just immediate reward, but the value of the state you’ll be in, which includes all future options.

Daily Challenge / Thought Exercise

Exercise: Modify the grid world code above to include a second goal at position (0, 3) that gives a reward of +5.

  1. Run the training and observe the learned policy.
  2. Now change the goal at (3, 3) to give a reward of +3 instead of +10.
  3. Train again from scratch. How does the policy change?
  4. Challenge: Can you modify the code to use Double Q-Learning to reduce overestimation bias? (Hint: Maintain two Q-tables and randomly choose which to update and which to use for selecting the max action.)

Thought Experiment: Imagine an LLM agent choosing between different tools (search, calculator, code executor). How would you define the “state” for Q-Learning? How would you handle the fact that the state space is effectively infinite (any text prompt)?

References & Further Reading

Classic Papers

  1. Sutton, R. S. (1988). “Learning to Predict by the Methods of Temporal Differences.” Machine Learning 3, 9-44
  2. Watkins, C. J., & Dayan, P. (1992). “Q-learning.” Machine Learning 8, 279-292

Modern Breakthroughs

  1. Mnih, V., et al. (2013). “Playing Atari with Deep Reinforcement Learning.” arXiv:1312.5602 (The original DQN paper)
  2. Hessel, M., et al. (2017). “Rainbow: Combining Improvements in Deep Reinforcement Learning.” arXiv:1710.02298

Neuroscience Connection

  1. Schultz, W., Dayan, P., & Montague, P. R. (1997). “A Neural Substrate of Prediction and Reward.” Science 275(5306), 1593-1599

Tutorials & Code

  1. Sutton & Barto’s Reinforcement Learning Book (2nd Edition, 2018): The bible of RL. Free online
  2. OpenAI Spinning Up: Excellent tutorials and clean implementations. spinningup.openai.com
  3. DeepMind’s DQN Implementation: github.com/deepmind/dqn

Recent Research

  1. Levine, S., et al. (2020). “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems.” arXiv:2005.01643

Next Steps: From Q-Learning, the natural extensions are Soft Actor-Critic (SAC) for continuous actions and policy-based methods like PPO (Proximal Policy Optimization).

● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.