Policy Gradient Methods and Actor-Critic Architectures
Concept Introduction
In reinforcement learning, a policy π(a|s) is a probability distribution over actions a given state s. Traditional value-based methods (like Q-learning) learn the value of states or state-action pairs and derive a policy from those values. Policy gradient methods instead directly optimize the policy parameters θ by following the gradient of expected reward.
Actor-Critic combines policy gradients (actor) with value function approximation (critic). The actor proposes actions based on the current policy π_θ, while the critic estimates the value function V(s) or advantage function A(s,a) to provide feedback. This reduces variance in policy gradient estimates while maintaining the benefits of direct policy optimization.
Historical & Theoretical Context
- 1992: Ronald Williams introduced REINFORCE, the foundational policy gradient algorithm, proving that you can compute gradients of expected reward with respect to policy parameters using the log-likelihood trick.
- 1999-2000: Sutton et al. developed Actor-Critic methods, combining the best of policy gradients and value-based learning.
- 2015: The deep learning revolution brought Deep Deterministic Policy Gradient (DDPG) and Trust Region Policy Optimization (TRPO).
- 2017: OpenAI’s Proximal Policy Optimization (PPO) became the de facto standard, powering systems like ChatGPT’s RLHF training.
Value-based methods (Q-learning, DQN) struggle with:
- Continuous action spaces (infinitely many actions to evaluate)
- Stochastic policies (sometimes randomness is optimal)
- Large action spaces (combinatorial explosion)
Policy gradients solve these by directly parameterizing the policy, making them essential for robotics, game AI, and language model fine-tuning.
Algorithms & Mathematics
The Policy Gradient Theorem
The goal is to maximize expected cumulative reward:
J(θ) = E_τ~π_θ [R(τ)]
Where τ is a trajectory (sequence of states and actions), and R(τ) is the total reward.
The policy gradient theorem states:
∇_θ J(θ) = E_τ~π_θ [∑_t ∇_θ log π_θ(a_t|s_t) · G_t]
Where G_t is the return (cumulative reward) from time t onward.
Intuition: Increase probability of actions that led to high rewards, decrease probability of actions that led to low rewards.
REINFORCE Algorithm (Vanilla Policy Gradient)
# Pseudocode
for episode in episodes:
τ = generate_trajectory(π_θ) # Run policy to collect data
for t in range(T):
G_t = sum of rewards from t to end
∇J ≈ ∇_θ log π_θ(a_t|s_t) · G_t
θ = θ + α · ∇J # Gradient ascent
Problem: High variance because G_t includes random future rewards.
Actor-Critic: Reducing Variance
Instead of using full return G_t, use a baseline (typically the value function):
Advantage: A(s_t, a_t) = Q(s_t, a_t) - V(s_t)
≈ r_t + γV(s_{t+1}) - V(s_t) (TD error)
Algorithm:
for episode in episodes:
s = initial_state
while not done:
a = actor.select_action(s) # Policy π_θ
s', r, done = env.step(a)
# Critic update (TD learning)
td_error = r + γ·V_w(s') - V_w(s)
w = w + α_critic · td_error · ∇_w V_w(s)
# Actor update (policy gradient)
θ = θ + α_actor · td_error · ∇_θ log π_θ(a|s)
s = s'
Design Patterns & Architectures
Integration with Agent Systems
graph TD
A[Environment State] --> B[Actor Network]
A --> C[Critic Network]
B --> D[Action]
D --> E[Environment]
E --> F[Reward + Next State]
F --> C
C --> G[Value Estimate / TD Error]
G --> B
G --> C
Common Architectural Patterns
- Shared Representations: Actor and critic share early layers (common in vision tasks)
- Separate Networks: Independent actor and critic (more stable, slower)
- Distributed Learning: Multiple actors collect data, central critic updates (A3C, IMPALA)
Connections to Agent Frameworks
- LangGraph: Could use actor-critic for learning routing policies in multi-agent workflows
- AutoGen: Fine-tune conversation agents using PPO (like ChatGPT)
- Robotics Controllers: Actor outputs motor commands, critic evaluates trajectory quality
Practical Application
Minimal Actor-Critic in Python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
class Actor(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, action_dim),
nn.Softmax(dim=-1)
)
def forward(self, state):
return self.net(state)
class Critic(nn.Module):
def __init__(self, state_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 1)
)
def forward(self, state):
return self.net(state)
class A2CAgent:
def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
self.actor = Actor(state_dim, action_dim)
self.critic = Critic(state_dim)
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr)
self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr)
self.gamma = gamma
def select_action(self, state):
state = torch.FloatTensor(state)
probs = self.actor(state)
action = torch.multinomial(probs, 1).item()
return action, probs[action]
def update(self, state, action, reward, next_state, done):
state = torch.FloatTensor(state)
next_state = torch.FloatTensor(next_state)
# Compute TD error
value = self.critic(state)
next_value = 0 if done else self.critic(next_state).detach()
td_target = reward + self.gamma * next_value
td_error = td_target - value
# Update critic (minimize value loss)
critic_loss = td_error.pow(2)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# Update actor (maximize log_prob * advantage)
probs = self.actor(state)
log_prob = torch.log(probs[action])
actor_loss = -log_prob * td_error.detach() # Detach to not backprop through critic
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# Usage example
# env = gym.make('CartPole-v1')
# agent = A2CAgent(state_dim=4, action_dim=2)
#
# for episode in range(1000):
# state = env.reset()
# done = False
#
# while not done:
# action, _ = agent.select_action(state)
# next_state, reward, done, _ = env.step(action)
# agent.update(state, action, reward, next_state, done)
# state = next_state
Using in a Tool-Using Agent
# Example: Learning to select tools optimally
class ToolSelectionAgent(A2CAgent):
def __init__(self, num_tools):
super().__init__(
state_dim=512, # Embedding of current task
action_dim=num_tools
)
def choose_tool(self, task_embedding):
"""Learn which tool to use for a given task"""
tool_idx, prob = self.select_action(task_embedding)
return TOOLS[tool_idx], prob
def learn_from_outcome(self, task, tool_used, success):
reward = 1.0 if success else -0.1
next_task = get_next_task() # Or terminal state
self.update(task, tool_used, reward, next_task, done=success)
Latest Developments & Research
Recent Breakthroughs (2022-2025)
RLHF for LLMs (2022-2023)
- ChatGPT, Claude, GPT-4 use PPO to align with human preferences
- Paper: “Training language models to follow instructions with human feedback” (OpenAI, 2022)
Direct Preference Optimization (DPO) (2023)
- Bypasses actor-critic entirely for LLM alignment
- Simpler than PPO, comparable results
- Paper: Rafailov et al., “Direct Preference Optimization”
Offline RL + Actor-Critic (2023-2024)
- Learn from static datasets (no environment interaction)
- Conservative Q-Learning (CQL), Implicit Q-Learning (IQL)
Multi-Agent PPO (2024)
- Coordinate multiple agents in shared environments
- Applications: Autonomous vehicle fleets, multiplayer games
Open Problems
- Reward specification: How to define “good” in complex domains?
- Sample efficiency: Can we learn with 10x fewer samples?
- Generalization: Transfer learned policies to new tasks
- Safety: Ensure agents don’t take catastrophic actions during training
Benchmarks
- MuJoCo: Continuous control (humanoid walking, manipulation)
- Atari: Classic game benchmark
- OpenAI Gym: Standardized RL environments
- IsaacGym: GPU-accelerated physics simulation (10,000s parallel envs)
Cross-Disciplinary Insight
Neuroscience Parallel
The actor-critic architecture mirrors the brain’s dopaminergic system:
- Actor (Basal Ganglia): Selects actions based on learned patterns
- Critic (Ventral Tegmental Area): Produces dopamine signals encoding reward prediction error (TD error!)
When reward exceeds expectation → dopamine spike → strengthen that action pathway. This biological plausibility makes actor-critic particularly elegant.
Economic Theory
Policy gradients relate to mechanism design:
- Actor = strategic agent
- Critic = market evaluator
- Training = equilibrium seeking
The REINFORCE theorem is analogous to the likelihood ratio method in econometrics for estimating policy effects.
Control Theory
Actor-Critic is a form of adaptive control:
- Critic learns a Lyapunov function (stability measure)
- Actor improves control policy to minimize Lyapunov function
- Converges to stable controllers under certain conditions
Daily Challenge
Thought Exercise (15 minutes)
Consider a customer service chatbot that needs to learn when to:
- Answer directly
- Ask clarifying questions
- Escalate to human
Questions:
- How would you define the state space? (User message embedding? Conversation history?)
- What should the reward signal be? (User satisfaction? Resolution time?)
- Would you use discrete actions (3 choices) or continuous (confidence scores)?
- What could go wrong if the critic is poorly calibrated?
Coding Exercise (30 minutes)
Extend the A2CAgent above to:
Add entropy regularization to encourage exploration:
entropy = -torch.sum(probs * torch.log(probs)) actor_loss = actor_loss - 0.01 * entropy # Encourage diversityImplement a simple gridworld where the agent learns to navigate to a goal:
- State: (x, y) position
- Actions: up, down, left, right
- Reward: +1 at goal, -0.01 per step
Visualize how the value function (critic) evolves over training.
Bonus: Compare convergence speed with and without the critic (pure REINFORCE vs. A2C).
References & Further Reading
Foundational Papers
Williams, R. J. (1992). “Simple statistical gradient-following algorithms for connectionist reinforcement learning.”
Sutton, R. S., et al. (2000). “Policy Gradient Methods for Reinforcement Learning with Function Approximation.”
Schulman, J., et al. (2017). “Proximal Policy Optimization Algorithms.”
Modern Implementations
Stable-Baselines3: Production-ready PPO, A2C implementations
CleanRL: Single-file implementations for learning
RLlib (Ray): Distributed RL at scale
Tutorials & Courses
Spinning Up in Deep RL (OpenAI): Best conceptual introduction
David Silver’s RL Course: Lecture 7 covers policy gradients
Hugging Face Deep RL Course: Hands-on with PPO
Recent Research
DPO Paper (2023): Alternative to PPO for LLM alignment
Sample-Efficient RL Survey (2024)