Successor Representations a Map of Where You Are Going

03 Mar 2026

Successor Representations (SRs) give an agent a cached map of where each state leads, decoupled from what those destinations are worth. When a reward location changes, the agent only needs to update one half of its knowledge; the structural map stays fixed.

Concept Introduction

In standard Q-learning, an agent learns a single number per state-action pair: the expected cumulative reward. That number bakes together two things: the structure of the environment (where can you go from here?) and the reward function (what’s valuable?).

Successor Representations separate these two concerns. The SR answers: “If I start here and follow my current policy, which future states will I visit, and how often?” The reward function then answers: “What’s each of those states worth?” If the reward function changes (a new goal, a different task) only one half needs updating.

Formally, the SR for policy $\pi$ starting at state $s$ is:

$$M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t \mathbf{1}[s_t = s'] \;\middle|\; s_0 = s\right]$$

Here $M^\pi(s, s')$ counts the discounted expected number of future visits to state $s'$ when starting from state $s$ under policy $\pi$. Think of each row of the matrix $M^\pi$ as a “future occupancy fingerprint” of state $s$.

The key identity that makes this useful:

$$V^\pi(s) = \sum_{s'} M^\pi(s, s')\, r(s')$$

Or in matrix form: $\mathbf{v}^\pi = M^\pi \mathbf{r}$. The value function is just a linear combination of the SR rows, weighted by rewards. Change the reward vector $\mathbf{r}$ and you get the new value function instantly, with no re-training required.

Historical & Theoretical Context

The Successor Representation was introduced by Peter Dayan in 1993 (“Improving Generalization for Temporal Difference Learning: The Successor Representation”, Neural Computation). Dayan noticed that TD learning was learning both environmental structure and reward signal simultaneously, making transfer between related tasks needlessly expensive.

The idea sat relatively quiet for two decades before a 2017 renaissance:

Momennejad et al. (2017) showed that the human hippocampus appears to encode something strikingly like the SR, linking it to the neuroscience of place cells and cognitive maps.
Barreto et al. (2017) generalized SRs to Successor Features (SFs) for continuous/deep settings, and proved the Generalized Policy Improvement (GPI) theorem, enabling cross-task transfer.

The SR sits in a conceptually clean position between model-based and model-free RL, and that clarity has made it increasingly influential.

Algorithms & Math

Computing the Tabular SR

The SR satisfies a Bellman-like equation:

$$M^\pi(s, s') = \mathbf{1}[s = s'] + \gamma \sum_{a} \pi(a|s) \sum_{s''} P(s''|s,a)\, M^\pi(s'', s')$$

In matrix form: $M^\pi = (I - \gamma P^\pi)^{-1}$, where $P^\pi$ is the policy-weighted transition matrix. This is the resolvent of the transition operator, a spectral fact deeply connected to graph theory and the geometry of state spaces.

TD Learning for the SR

You can learn $M^\pi$ incrementally via a TD update that mirrors the Bellman equation above:

Initialize M(s, s') for all s, s' (e.g., M = Identity)

For each transition (s, a, s_next):
    For each s' in states:
        one_hot = 1 if s == s' else 0
        target  = one_hot + gamma * M(s_next, s')
        M(s, s') += alpha * (target - M(s, s'))

The one-hot term says “I’m in $s$ right now” and the bootstrapped remainder propagates future visitation forward, just like TD(0) for value learning.

Successor Features (Deep Extension)

For continuous state spaces, replace state-indicator features with a learned feature function $\phi(s) \in \mathbb{R}^d$. The successor feature vector becomes:

$$\psi^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t \phi(s_t) \;\middle|\; s_0 = s\right]$$

Reward is then approximated as $r(s) \approx \phi(s)^\top \mathbf{w}$, and the value function becomes:

$$V^\pi(s) = \psi^\pi(s)^\top \mathbf{w}$$

Transfer across tasks with different $\mathbf{w}$ (same environment structure) is then a simple inner product.

Design Patterns & Architectures

The SR creates a natural two-module agent architecture:

graph LR
    S[State s] --> SR["SR Encoder<br/>M or ψ"]
    SR --> VP["Value Head<br/>v = M·r or ψ·w"]
    R["Reward Vector r or w"] --> VP
    VP --> POL["Policy π"]

    subgraph "Reusable across tasks"
        SR
    end
    subgraph "Task-specific"
        R
    end

This connects naturally to several patterns:

Planner-Executor loop: The SR acts as a cached forward model that the planner can query cheaply.
Memory architecture: Each state’s SR row is a retrievable “future trajectory summary”, queryable like a key-value memory.
Skill libraries: Different $\mathbf{w}$ vectors represent different goals; GPI selects the best available policy for any new task from a library.

Practical Application

import numpy as np

class TabularSR:
    """
    Learns the Successor Representation M(s, s') via TD updates.
    Works with any discrete, finite MDP.
    """
    def __init__(self, n_states: int, gamma: float = 0.9, alpha: float = 0.1):
        self.gamma = gamma
        self.alpha = alpha
        # Initialize SR as identity: currently, only I've visited myself
        self.M = np.eye(n_states)

    def update(self, s: int, s_next: int):
        """TD update for a single transition under the current policy."""
        one_hot = np.zeros(self.M.shape[0])
        one_hot[s] = 1.0
        target = one_hot + self.gamma * self.M[s_next]
        self.M[s] += self.alpha * (target - self.M[s])

    def value(self, reward_vector: np.ndarray) -> np.ndarray:
        """Compute value function for any reward vector instantly."""
        return self.M @ reward_vector

    def transfer(self, new_reward_vector: np.ndarray) -> np.ndarray:
        """Zero-shot transfer: same SR, new task rewards."""
        return self.value(new_reward_vector)


# --- Demo: GridWorld with two reward locations ---
n_states = 16  # 4x4 grid
sr = TabularSR(n_states=n_states, gamma=0.9)

# Simulate random-walk transitions (for learning M)
np.random.seed(42)
for _ in range(5000):
    s = np.random.randint(n_states)
    # Random neighbor in grid
    neighbors = [s - 1, s + 1, s - 4, s + 4]
    s_next = np.random.choice([n for n in neighbors if 0 <= n < n_states])
    sr.update(s, s_next)

# Task 1: reward at state 15 (bottom-right corner)
r1 = np.zeros(n_states); r1[15] = 1.0
v1 = sr.value(r1)
print("Task 1 values (top row):", v1[:4].round(3))

# Task 2: reward moved to state 3 (top-right) — instant transfer, no retraining
r2 = np.zeros(n_states); r2[3] = 1.0
v2 = sr.transfer(r2)
print("Task 2 values (top row):", v2[:4].round(3))

In a LangGraph setting, the SR pattern maps naturally to routing by goal: store a successor feature vector per goal, and at runtime dot-product with the current state embedding to route to the right sub-agent without re-planning from scratch.

Latest Developments & Research

Deep Successor Features (Zhang et al., 2019) extended SFs to pixel-level Atari games, training $\phi$ and $\psi$ jointly with auxiliary losses. The learned feature space factored reward from dynamics even in visually complex domains.

Eigenoptions (Machado et al., 2018, 2023) showed that the eigenvectors of the SR matrix define a set of intrinsically useful “options” (temporally extended actions) that span the state space efficiently. This gives a spectral method for automatically discovering skills; the eigenvectors literally encode the bottleneck states and long-horizon directions of the environment.

Hippocampal SRs (Stachenfeld et al., 2017; Whittington et al., 2020) showed that grid cells and place cells in mammalian brains approximate eigenvectors of the SR under different environmental geometries. This remains one of the tightest connections between modern RL theory and cognitive neuroscience.

Online SR for LLM planning (2024-2025) is an emerging direction: using learned trajectory statistics from LLM rollouts to build approximate SRs over abstract “plan states”, enabling fast policy adjustment when a sub-goal becomes unavailable. Early results show 2-3× fewer LLM calls during replanning compared to full ReAct restarts.

Open problems include:

Learning $\phi$ that remains stable as the policy improves (non-stationarity problem)
SR under non-Markovian observation histories (relevant for LLM agents)
Efficient online SR updates in high-dimensional continuous spaces

Cross-Disciplinary Insight

The SR is, at its core, the resolvent $(I - \gamma P)^{-1}$ of the transition operator, a concept from spectral graph theory and functional analysis. Its eigenvectors are the graph Laplacian eigenfunctions of the state graph, which show up in:

Spectral clustering (grouping similar future-trajectory states)
Diffusion maps (dimensionality reduction based on random-walk geometry)
PageRank (Google’s original algorithm is a resolvent of the web link matrix)

In economics, the SR mirrors the concept of a present-value discounted stream of visits, essentially the same math used to price infinite-horizon cash flows. An agent learning SRs is, in a formal sense, pricing the “visitation futures” of each state.

In cognitive science, the SR offers a computational account of why humans can rapidly generalize: our mental maps encode where paths lead, not just how much each step is worth.

Daily Challenge

Exercise: Eigenoptions from the SR

Implement a small 5×5 gridworld with walls.
Run random-walk episodes and learn the tabular SR matrix $M$.
Compute the eigenvectors of $M$ using numpy.linalg.eig.
Visualize the first 4 eigenvectors as heatmaps on the grid. You should see patterns that roughly highlight “bottleneck” states and room boundaries.
Bonus: Use those eigenvectors as intrinsic reward signals and observe whether an agent discovers useful sub-goals faster than one using pure reward.

This exercise makes the geometry of the state space visible. The SR is literally a map of the agent’s possible futures, not just a computational trick.

References & Further Reading

Foundational Papers

Dayan, P. (1993) “Improving Generalization for Temporal Difference Learning: The Successor Representation” — Neural Computation 5(4). The original.
Barreto et al. (2017) “Successor Features for Transfer in Reinforcement Learning” — NeurIPS. Introduces SF&GPI.
Stachenfeld et al. (2017) “The hippocampus as a predictive map” — Nature Neuroscience. The neuroscience connection.

Extensions

Machado et al. (2018) “Eigenoption Discovery Through the Deep Successor Representation” — ICLR.
Zhang et al. (2019) “Learning near Optimal Policies with Low Inherent Bellman Error” — deep SF paper.
Whittington et al. (2020) “The Tolman-Eichenbaum Machine” — Cell. Unifying grid cells, place cells, and SRs.

Blog Posts & Tutorials

“The Successor Representation” — Dileep George’s tutorial on the spectral view
“Successor Features & GPI: Transfer RL Made Simple” — Lilian Weng’s blog
“Eigenoptions: Skills from the SR” — Marlos Machado’s talk slides

GitHub Repositories

successor-features-A3C: PyTorch implementation of deep successor features with GPI transfer
sr-eigenoptions: Tabular SR + eigenoption discovery in simple gridworlds
gym-minigrid: Lightweight gridworld environments ideal for SR experiments

Engineering Notes