Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

Policy Gradient Methods and Actor-Critic Architectures

24 Oct 2025

Concept Introduction

In reinforcement learning, a policy π(a|s) is a probability distribution over actions a given state s. Traditional value-based methods (like Q-learning) learn the value of states or state-action pairs and derive a policy from those values. Policy gradient methods instead directly optimize the policy parameters θ by following the gradient of expected reward.

Actor-Critic combines policy gradients (actor) with value function approximation (critic). The actor proposes actions based on the current policy π_θ, while the critic estimates the value function V(s) or advantage function A(s,a) to provide feedback. This reduces variance in policy gradient estimates while maintaining the benefits of direct policy optimization.

Historical & Theoretical Context

Value-based methods (Q-learning, DQN) struggle with:

Policy gradients solve these by directly parameterizing the policy, making them essential for robotics, game AI, and language model fine-tuning.

Algorithms & Mathematics

The Policy Gradient Theorem

The goal is to maximize expected cumulative reward:

J(θ) = E_τ~π_θ [R(τ)]

Where τ is a trajectory (sequence of states and actions), and R(τ) is the total reward.

The policy gradient theorem states:

∇_θ J(θ) = E_τ~π_θ [∑_t ∇_θ log π_θ(a_t|s_t) · G_t]

Where G_t is the return (cumulative reward) from time t onward.

Intuition: Increase probability of actions that led to high rewards, decrease probability of actions that led to low rewards.

REINFORCE Algorithm (Vanilla Policy Gradient)

# Pseudocode
for episode in episodes:
    τ = generate_trajectory(π_θ)  # Run policy to collect data

    for t in range(T):
        G_t = sum of rewards from t to end
        J  _θ log π_θ(a_t|s_t) · G_t
        θ = θ + α · J  # Gradient ascent

Problem: High variance because G_t includes random future rewards.

Actor-Critic: Reducing Variance

Instead of using full return G_t, use a baseline (typically the value function):

Advantage: A(s_t, a_t) = Q(s_t, a_t) - V(s_t)
           ≈ r_t + γV(s_{t+1}) - V(s_t)  (TD error)

Algorithm:

for episode in episodes:
    s = initial_state

    while not done:
        a = actor.select_action(s)  # Policy π_θ
        s', r, done = env.step(a)

        # Critic update (TD learning)
        td_error = r + γ·V_w(s') - V_w(s)
        w = w + α_critic · td_error · _w V_w(s)

        # Actor update (policy gradient)
        θ = θ + α_actor · td_error · _θ log π_θ(a|s)

        s = s'

Design Patterns & Architectures

Integration with Agent Systems

graph TD
    A[Environment State] --> B[Actor Network]
    A --> C[Critic Network]
    B --> D[Action]
    D --> E[Environment]
    E --> F[Reward + Next State]
    F --> C
    C --> G[Value Estimate / TD Error]
    G --> B
    G --> C
  

Common Architectural Patterns

  1. Shared Representations: Actor and critic share early layers (common in vision tasks)
  2. Separate Networks: Independent actor and critic (more stable, slower)
  3. Distributed Learning: Multiple actors collect data, central critic updates (A3C, IMPALA)

Connections to Agent Frameworks

Practical Application

Minimal Actor-Critic in Python

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class Actor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, state):
        return self.net(state)

class Critic(nn.Module):
    def __init__(self, state_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )

    def forward(self, state):
        return self.net(state)

class A2CAgent:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
        self.actor = Actor(state_dim, action_dim)
        self.critic = Critic(state_dim)
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr)
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr)
        self.gamma = gamma

    def select_action(self, state):
        state = torch.FloatTensor(state)
        probs = self.actor(state)
        action = torch.multinomial(probs, 1).item()
        return action, probs[action]

    def update(self, state, action, reward, next_state, done):
        state = torch.FloatTensor(state)
        next_state = torch.FloatTensor(next_state)

        # Compute TD error
        value = self.critic(state)
        next_value = 0 if done else self.critic(next_state).detach()
        td_target = reward + self.gamma * next_value
        td_error = td_target - value

        # Update critic (minimize value loss)
        critic_loss = td_error.pow(2)
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # Update actor (maximize log_prob * advantage)
        probs = self.actor(state)
        log_prob = torch.log(probs[action])
        actor_loss = -log_prob * td_error.detach()  # Detach to not backprop through critic
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

# Usage example
# env = gym.make('CartPole-v1')
# agent = A2CAgent(state_dim=4, action_dim=2)
#
# for episode in range(1000):
#     state = env.reset()
#     done = False
#
#     while not done:
#         action, _ = agent.select_action(state)
#         next_state, reward, done, _ = env.step(action)
#         agent.update(state, action, reward, next_state, done)
#         state = next_state

Using in a Tool-Using Agent

# Example: Learning to select tools optimally
class ToolSelectionAgent(A2CAgent):
    def __init__(self, num_tools):
        super().__init__(
            state_dim=512,  # Embedding of current task
            action_dim=num_tools
        )

    def choose_tool(self, task_embedding):
        """Learn which tool to use for a given task"""
        tool_idx, prob = self.select_action(task_embedding)
        return TOOLS[tool_idx], prob

    def learn_from_outcome(self, task, tool_used, success):
        reward = 1.0 if success else -0.1
        next_task = get_next_task()  # Or terminal state
        self.update(task, tool_used, reward, next_task, done=success)

Latest Developments & Research

Recent Breakthroughs (2022-2025)

  1. RLHF for LLMs (2022-2023)

    • ChatGPT, Claude, GPT-4 use PPO to align with human preferences
    • Paper: “Training language models to follow instructions with human feedback” (OpenAI, 2022)
  2. Direct Preference Optimization (DPO) (2023)

    • Bypasses actor-critic entirely for LLM alignment
    • Simpler than PPO, comparable results
    • Paper: Rafailov et al., “Direct Preference Optimization”
  3. Offline RL + Actor-Critic (2023-2024)

    • Learn from static datasets (no environment interaction)
    • Conservative Q-Learning (CQL), Implicit Q-Learning (IQL)
  4. Multi-Agent PPO (2024)

    • Coordinate multiple agents in shared environments
    • Applications: Autonomous vehicle fleets, multiplayer games

Open Problems

Benchmarks

Cross-Disciplinary Insight

Neuroscience Parallel

The actor-critic architecture mirrors the brain’s dopaminergic system:

When reward exceeds expectation → dopamine spike → strengthen that action pathway. This biological plausibility makes actor-critic particularly elegant.

Economic Theory

Policy gradients relate to mechanism design:

The REINFORCE theorem is analogous to the likelihood ratio method in econometrics for estimating policy effects.

Control Theory

Actor-Critic is a form of adaptive control:

Daily Challenge

Thought Exercise (15 minutes)

Consider a customer service chatbot that needs to learn when to:

  1. Answer directly
  2. Ask clarifying questions
  3. Escalate to human

Questions:

  1. How would you define the state space? (User message embedding? Conversation history?)
  2. What should the reward signal be? (User satisfaction? Resolution time?)
  3. Would you use discrete actions (3 choices) or continuous (confidence scores)?
  4. What could go wrong if the critic is poorly calibrated?

Coding Exercise (30 minutes)

Extend the A2CAgent above to:

  1. Add entropy regularization to encourage exploration:

    entropy = -torch.sum(probs * torch.log(probs))
    actor_loss = actor_loss - 0.01 * entropy  # Encourage diversity
    
  2. Implement a simple gridworld where the agent learns to navigate to a goal:

    • State: (x, y) position
    • Actions: up, down, left, right
    • Reward: +1 at goal, -0.01 per step
  3. Visualize how the value function (critic) evolves over training.

Bonus: Compare convergence speed with and without the critic (pure REINFORCE vs. A2C).

References & Further Reading

Foundational Papers

  1. Williams, R. J. (1992). “Simple statistical gradient-following algorithms for connectionist reinforcement learning.”

  2. Sutton, R. S., et al. (2000). “Policy Gradient Methods for Reinforcement Learning with Function Approximation.”

  3. Schulman, J., et al. (2017). “Proximal Policy Optimization Algorithms.”

Modern Implementations

Tutorials & Courses

Recent Research


● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.