Policy Gradient Methods and Actor-Critic Architectures

24 Oct 2025

Concept Introduction

In reinforcement learning, a policy π(a|s) is a probability distribution over actions a given state s. Traditional value-based methods (like Q-learning) learn the value of states or state-action pairs and derive a policy from those values. Policy gradient methods instead directly optimize the policy parameters θ by following the gradient of expected reward.

Actor-Critic combines policy gradients (actor) with value function approximation (critic). The actor proposes actions based on the current policy π_θ, while the critic estimates the value function V(s) or advantage function A(s,a) to provide feedback. This reduces variance in policy gradient estimates while maintaining the benefits of direct policy optimization.

Historical & Theoretical Context

1992: Ronald Williams introduced REINFORCE, the foundational policy gradient algorithm, proving that you can compute gradients of expected reward with respect to policy parameters using the log-likelihood trick.
1999-2000: Sutton et al. developed Actor-Critic methods, combining the best of policy gradients and value-based learning.
2015: The deep learning revolution brought Deep Deterministic Policy Gradient (DDPG) and Trust Region Policy Optimization (TRPO).
2017: OpenAI’s Proximal Policy Optimization (PPO) became the de facto standard, powering systems like ChatGPT’s RLHF training.

Value-based methods (Q-learning, DQN) struggle with:

Continuous action spaces (infinitely many actions to evaluate)
Stochastic policies (sometimes randomness is optimal)
Large action spaces (combinatorial explosion)

Policy gradients solve these by directly parameterizing the policy, making them essential for robotics, game AI, and language model fine-tuning.

Algorithms & Mathematics

The Policy Gradient Theorem

The goal is to maximize expected cumulative reward:

J(θ) = E_τ~π_θ [R(τ)]

Where τ is a trajectory (sequence of states and actions), and R(τ) is the total reward.

The policy gradient theorem states:

∇_θ J(θ) = E_τ~π_θ [∑_t ∇_θ log π_θ(a_t|s_t) · G_t]

Where G_t is the return (cumulative reward) from time t onward.

Intuition: Increase probability of actions that led to high rewards, decrease probability of actions that led to low rewards.

REINFORCE Algorithm (Vanilla Policy Gradient)

# Pseudocode
for episode in episodes:
    τ = generate_trajectory(π_θ)  # Run policy to collect data

    for t in range(T):
        G_t = sum of rewards from t to end
        ∇J ≈ ∇_θ log π_θ(a_t|s_t) · G_t
        θ = θ + α · ∇J  # Gradient ascent

Problem: High variance because G_t includes random future rewards.

Actor-Critic: Reducing Variance

Instead of using full return G_t, use a baseline (typically the value function):

Advantage: A(s_t, a_t) = Q(s_t, a_t) - V(s_t)
           ≈ r_t + γV(s_{t+1}) - V(s_t)  (TD error)

Algorithm:

for episode in episodes:
    s = initial_state

    while not done:
        a = actor.select_action(s)  # Policy π_θ
        s', r, done = env.step(a)

        # Critic update (TD learning)
        td_error = r + γ·V_w(s') - V_w(s)
        w = w + α_critic · td_error · ∇_w V_w(s)

        # Actor update (policy gradient)
        θ = θ + α_actor · td_error · ∇_θ log π_θ(a|s)

        s = s'

Design Patterns & Architectures

Integration with Agent Systems

graph TD
    A[Environment State] --> B[Actor Network]
    A --> C[Critic Network]
    B --> D[Action]
    D --> E[Environment]
    E --> F[Reward + Next State]
    F --> C
    C --> G[Value Estimate / TD Error]
    G --> B
    G --> C

Common Architectural Patterns

Shared Representations: Actor and critic share early layers (common in vision tasks)
Separate Networks: Independent actor and critic (more stable, slower)
Distributed Learning: Multiple actors collect data, central critic updates (A3C, IMPALA)

Connections to Agent Frameworks

LangGraph: Could use actor-critic for learning routing policies in multi-agent workflows
AutoGen: Fine-tune conversation agents using PPO (like ChatGPT)
Robotics Controllers: Actor outputs motor commands, critic evaluates trajectory quality

Practical Application

Minimal Actor-Critic in Python

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class Actor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, state):
        return self.net(state)

class Critic(nn.Module):
    def __init__(self, state_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )

    def forward(self, state):
        return self.net(state)

class A2CAgent:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
        self.actor = Actor(state_dim, action_dim)
        self.critic = Critic(state_dim)
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr)
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr)
        self.gamma = gamma

    def select_action(self, state):
        state = torch.FloatTensor(state)
        probs = self.actor(state)
        action = torch.multinomial(probs, 1).item()
        return action, probs[action]

    def update(self, state, action, reward, next_state, done):
        state = torch.FloatTensor(state)
        next_state = torch.FloatTensor(next_state)

        # Compute TD error
        value = self.critic(state)
        next_value = 0 if done else self.critic(next_state).detach()
        td_target = reward + self.gamma * next_value
        td_error = td_target - value

        # Update critic (minimize value loss)
        critic_loss = td_error.pow(2)
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # Update actor (maximize log_prob * advantage)
        probs = self.actor(state)
        log_prob = torch.log(probs[action])
        actor_loss = -log_prob * td_error.detach()  # Detach to not backprop through critic
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

# Usage example
# env = gym.make('CartPole-v1')
# agent = A2CAgent(state_dim=4, action_dim=2)
#
# for episode in range(1000):
#     state = env.reset()
#     done = False
#
#     while not done:
#         action, _ = agent.select_action(state)
#         next_state, reward, done, _ = env.step(action)
#         agent.update(state, action, reward, next_state, done)
#         state = next_state

Using in a Tool-Using Agent

# Example: Learning to select tools optimally
class ToolSelectionAgent(A2CAgent):
    def __init__(self, num_tools):
        super().__init__(
            state_dim=512,  # Embedding of current task
            action_dim=num_tools
        )

    def choose_tool(self, task_embedding):
        """Learn which tool to use for a given task"""
        tool_idx, prob = self.select_action(task_embedding)
        return TOOLS[tool_idx], prob

    def learn_from_outcome(self, task, tool_used, success):
        reward = 1.0 if success else -0.1
        next_task = get_next_task()  # Or terminal state
        self.update(task, tool_used, reward, next_task, done=success)

Latest Developments & Research

Recent Breakthroughs (2022-2025)

RLHF for LLMs (2022-2023)
- ChatGPT, Claude, GPT-4 use PPO to align with human preferences
- Paper: “Training language models to follow instructions with human feedback” (OpenAI, 2022)
Direct Preference Optimization (DPO) (2023)
- Bypasses actor-critic entirely for LLM alignment
- Simpler than PPO, comparable results
- Paper: Rafailov et al., “Direct Preference Optimization”
Offline RL + Actor-Critic (2023-2024)
- Learn from static datasets (no environment interaction)
- Conservative Q-Learning (CQL), Implicit Q-Learning (IQL)
Multi-Agent PPO (2024)
- Coordinate multiple agents in shared environments
- Applications: Autonomous vehicle fleets, multiplayer games

Open Problems

Reward specification: How to define “good” in complex domains?
Sample efficiency: Can we learn with 10x fewer samples?
Generalization: Transfer learned policies to new tasks
Safety: Ensure agents don’t take catastrophic actions during training

Benchmarks

MuJoCo: Continuous control (humanoid walking, manipulation)
Atari: Classic game benchmark
OpenAI Gym: Standardized RL environments
IsaacGym: GPU-accelerated physics simulation (10,000s parallel envs)

Cross-Disciplinary Insight

Neuroscience Parallel

The actor-critic architecture mirrors the brain’s dopaminergic system:

Actor (Basal Ganglia): Selects actions based on learned patterns
Critic (Ventral Tegmental Area): Produces dopamine signals encoding reward prediction error (TD error!)

When reward exceeds expectation → dopamine spike → strengthen that action pathway. This biological plausibility makes actor-critic particularly elegant.

Economic Theory

Policy gradients relate to mechanism design:

Actor = strategic agent
Critic = market evaluator
Training = equilibrium seeking

The REINFORCE theorem is analogous to the likelihood ratio method in econometrics for estimating policy effects.

Control Theory

Actor-Critic is a form of adaptive control:

Critic learns a Lyapunov function (stability measure)
Actor improves control policy to minimize Lyapunov function
Converges to stable controllers under certain conditions

Daily Challenge

Thought Exercise (15 minutes)

Consider a customer service chatbot that needs to learn when to:

Answer directly
Ask clarifying questions
Escalate to human

Questions:

How would you define the state space? (User message embedding? Conversation history?)
What should the reward signal be? (User satisfaction? Resolution time?)
Would you use discrete actions (3 choices) or continuous (confidence scores)?
What could go wrong if the critic is poorly calibrated?

Coding Exercise (30 minutes)

Extend the A2CAgent above to:

Add entropy regularization to encourage exploration:

entropy = -torch.sum(probs * torch.log(probs))
actor_loss = actor_loss - 0.01 * entropy  # Encourage diversity

Implement a simple gridworld where the agent learns to navigate to a goal:
- State: (x, y) position
- Actions: up, down, left, right
- Reward: +1 at goal, -0.01 per step
Visualize how the value function (critic) evolves over training.

Bonus: Compare convergence speed with and without the critic (pure REINFORCE vs. A2C).

References & Further Reading

Foundational Papers

Williams, R. J. (1992). “Simple statistical gradient-following algorithms for connectionist reinforcement learning.”
- Link
Sutton, R. S., et al. (2000). “Policy Gradient Methods for Reinforcement Learning with Function Approximation.”
- Link
Schulman, J., et al. (2017). “Proximal Policy Optimization Algorithms.”
- arXiv:1707.06347

Modern Implementations

Stable-Baselines3: Production-ready PPO, A2C implementations
- GitHub
CleanRL: Single-file implementations for learning
- GitHub
RLlib (Ray): Distributed RL at scale
- Docs

Tutorials & Courses

Spinning Up in Deep RL (OpenAI): Best conceptual introduction
- Link
David Silver’s RL Course: Lecture 7 covers policy gradients
- YouTube
Hugging Face Deep RL Course: Hands-on with PPO
- Link

Recent Research

DPO Paper (2023): Alternative to PPO for LLM alignment
- arXiv:2305.18290
Sample-Efficient RL Survey (2024)
- arXiv:2409.14790

● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.

Get early access → See how it works

Engineering Notes