Temporal Abstraction and the Options Framework How Agents Learn to Think in Subgoals

22 Feb 2026

Most reinforcement learning agents think one step at a time. They observe the environment, pick an action, collect a reward, and repeat. This works fine for simple tasks, but real-world problems often require sustained commitment to a plan: navigate to the kitchen, open the fridge, grab the milk, then pour it. Each of those steps involves dozens of low-level actions, and treating them all as independent decisions leads to agents that never learn anything useful.

The Options Framework solves this by letting agents reason at multiple levels of abstraction simultaneously, choosing not just which action to take but which extended behavior to pursue.

Concept Introduction

Temporal abstraction means thinking in terms of extended behaviors that span many low-level steps rather than selecting each action independently. In agent design, an option formalizes this: a mini-policy with a clear start condition and a stopping criterion. The agent commits to an option, executes it, then reassesses at the end.

Formally, the Options Framework extends the standard Markov Decision Process (MDP) with temporally extended actions. Instead of choosing from the primitive action set $\mathcal{A}$, the agent picks from an option set $\mathcal{O}$, where each option $o \in \mathcal{O}$ is a triple:

$$o = (I_o, \pi_o, \beta_o)$$

$I_o \subseteq \mathcal{S}$: the initiation set, the states from which this option can be started
$\pi_o : \mathcal{S} \times \mathcal{A} \to [0,1]$: the option’s internal policy
$\beta_o : \mathcal{S} \to [0,1]$: the termination condition, the probability of ending the option in each state

When the agent selects option $o$ in state $s$, it executes $\pi_o$ until termination (sampled from $\beta_o$), then selects a new option. This creates a two-level loop: a high-level policy selects options, and each low-level policy $\pi_o$ handles the primitive steps.

Historical & Theoretical Context

The Options Framework was formalized by Sutton, Precup, and Singh in 1999 in their landmark paper “Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning.” But the intuition traces further back.

Semi-MDPs (SMDPs) from the 1970s modeled systems where actions take variable amounts of time, providing the mathematical skeleton that options build upon.
MAXQ (Dietterich, 2000) and HAM (Hierarchical Abstract Machines, Parr & Russell, 1997) offered competing hierarchical RL frameworks around the same era.
In cognitive science, chunking theory (Newell & Simon, 1972) described how humans compress sequences of actions into single cognitive units, the biological precursor to options.

The key theoretical result is that the options framework is a conservative extension of standard RL: primitive actions are just options with $\beta_o(s) = 1$ everywhere. Any tabular MDP result carries over cleanly.

Algorithms & Math

Semi-MDP Bellman Equations

When options are in play, the standard Bellman equation generalizes. Let $V^\pi(s)$ be the value function under a policy $\pi$ that selects options. The value of starting option $o$ in state $s$ is:

$$Q(s, o) = \mathbb{E}\left[\sum_{t=0}^{\tau-1} \gamma^t r_{t+1} + \gamma^\tau \max_{o'} Q(s_\tau, o') \;\middle|\; s_0 = s, o_0 = o\right]$$

where $\tau$ is the (random) termination time. This is the intra-option Q-learning update rule.

The TD update for intra-option learning at each primitive step becomes:

$$Q(s, o) \leftarrow Q(s, o) + \alpha\left[r + \gamma\left[(1 - \beta_o(s'))\,Q(s', o) + \beta_o(s')\,\max_{o'} Q(s', o')\right] - Q(s, o)\right]$$

When an option doesn’t terminate ($\beta_o(s') \approx 0$), the agent bootstraps off its own $Q(s', o)$ value, continuing the commitment. When it terminates, it re-evaluates globally.

Subgoal Discovery via Bottleneck States

How do you find good options automatically? One classic approach identifies bottleneck states (states that frequently appear on successful trajectories between regions of the state space) and creates option subgoals for them.

function discover_bottleneck_options(trajectories):
    # Build a graph of state transitions
    G = build_transition_graph(trajectories)

    # Find states with high betweenness centrality
    bottlenecks = []
    for each state s in G:
        centrality = betweenness_centrality(G, s)
        if centrality > threshold:
            bottlenecks.append(s)

    # For each bottleneck, create an option
    options = []
    for subgoal in bottlenecks:
        option = Option(
            initiation_set = all_states,
            policy = train_policy_to_reach(subgoal),
            termination = lambda s: s == subgoal
        )
        options.append(option)

    return options

Design Patterns & Architectures

The Options Framework slots naturally into several agent architectures:

Hierarchical Actor-Critic (HAC / HIRO): Two separate actor-critic networks operate at different timescales. The high-level policy proposes a subgoal every $k$ steps; the low-level policy is rewarded for reaching that subgoal. No manual option design is needed.

Option-Critic Architecture: A single end-to-end network learns options and the over-option policy simultaneously. It contains:

A set of $n$ option policies $\{\pi_{o_1}, \ldots, \pi_{o_n}\}$
Termination heads $\{\beta_{o_1}, \ldots, \beta_{o_n}\}$
An over-option policy $\pi_\Omega(o|s)$

All components are trained jointly with policy gradient, avoiding manual subgoal specification entirely.

graph TD
    S["State s_t"] --> OPT["Over-option Policy π_Ω"]
    OPT -->|select option o| LP["Low-level Policy π_o"]
    LP -->|primitive action a| ENV["Environment"]
    ENV -->|next state| TERM["Termination β_o(s')"]
    TERM -->|"terminate (β=1)"| OPT
    TERM -->|"continue (β=0)"| LP
    ENV -->|reward r| CRITIC["Critic / Value Network"]
    CRITIC --> OPT
    CRITIC --> LP

Connection to other patterns:

The Planner-Executor loop is essentially a two-level options hierarchy
Behavior Trees encode options as subtrees with their own preconditions and postconditions
GOAP (Goal-Oriented Action Planning) discovers options dynamically via backward chaining

Practical Application

Here’s a minimal Options Framework implementation with a hierarchical agent that can navigate a grid world with reusable “go-to-room” options:

import numpy as np
from dataclasses import dataclass
from typing import Callable, Optional

@dataclass
class Option:
    name: str
    initiation_set: Callable[[tuple], bool]
    policy: Callable[[tuple], int]
    termination: Callable[[tuple], float]

class HierarchicalAgent:
    def __init__(self, options: list[Option], n_actions: int, n_states: int):
        self.options = options
        # Q-values over (state, option) pairs
        self.Q_options = np.zeros((n_states, len(options)))
        self.alpha = 0.1
        self.gamma = 0.99
        self.current_option: Optional[int] = None

    def select_option(self, state: int, epsilon: float = 0.1) -> int:
        available = [
            i for i, o in enumerate(self.options)
            if o.initiation_set(state)
        ]
        if np.random.random() < epsilon:
            return np.random.choice(available)
        return available[np.argmax(self.Q_options[state, available])]

    def step(self, state: int, next_state: int, reward: float) -> int:
        o_idx = self.current_option
        option = self.options[o_idx]

        # Should we terminate?
        terminates = np.random.random() < option.termination(next_state)

        if terminates:
            # Intra-option Q update: bootstrap off max Q at next state
            best_next = np.max(self.Q_options[next_state])
            td_target = reward + self.gamma * best_next
        else:
            # Continue: bootstrap off same option's value
            td_target = reward + self.gamma * self.Q_options[next_state, o_idx]

        self.Q_options[state, o_idx] += self.alpha * (
            td_target - self.Q_options[state, o_idx]
        )

        if terminates:
            self.current_option = self.select_option(next_state)

        return option.policy(next_state)


# Example: two "go to hallway" options for a two-room grid
hallway_left = Option(
    name="go_to_left_hallway",
    initiation_set=lambda s: s < 50,          # left room
    policy=lambda s: navigate_toward(s, goal=25),
    termination=lambda s: float(s == 25),     # reached hallway
)

hallway_right = Option(
    name="go_to_right_hallway",
    initiation_set=lambda s: s >= 50,         # right room
    policy=lambda s: navigate_toward(s, goal=75),
    termination=lambda s: float(s == 75),
)

In LangGraph, options translate naturally to reusable subgraphs. A node selects which subgraph to enter (option selection), and each subgraph has its own termination edge:

from langgraph.graph import StateGraph

def build_option_graph(options: list):
    graph = StateGraph(AgentState)

    # High-level option selector
    graph.add_node("select_option", option_selector)

    for opt in options:
        # Each option is a compiled subgraph
        graph.add_node(opt.name, opt.subgraph)
        graph.add_edge("select_option", opt.name)
        # Termination routes back to selector
        graph.add_conditional_edges(
            opt.name,
            opt.should_terminate,
            {"terminate": "select_option", "continue": opt.name}
        )

    return graph.compile()

Latest Developments & Research

Option-Critic with Interest Functions (Khetarpal et al., 2020) added interest functions $f_o(s) \in [0,1]$ that modulate how often options are active, improving diversity and specialization.

LOVE: Learning to reuse Options for Variable-length Environments (2022) showed that options transfer across tasks with different horizon lengths, a practical win for multi-task agents.

Skill Discovery in Large Language Models (2023–2024): Several papers have adapted the options idea to LLM-based agents. Instead of neural policies, options are reusable prompt templates or tool chains that activate under certain context conditions. Work like SKILL-IT (Chen et al., 2023) showed that LLMs can learn to compose skill libraries that mirror option reuse.

HIRO in robotics (Nachum et al., 2018, with recent extensions): Hierarchical reinforcement learning with HIRO-style goal conditioning has become a standard baseline for robotics manipulation benchmarks (Meta-World, FurnitureBench), with newer variants incorporating diffusion policies at the low level.

Open problems:

Principled termination learning without reward sparsity collapse
Composing options in partially observable settings (POMDPs + options)
Verifiably safe options with formal guarantees

Cross-Disciplinary Insight

The Options Framework is cognitive science in mathematical form. Chunking theory (Newell & Simon, 1972) proposes that human expertise is built by compiling frequently used action sequences into single cognitive chunks, stored in long-term memory and triggered as units. A chess grandmaster thinks in attacks, defenses, and endgame patterns, not individual piece moves. Options encode exactly these chunks as policies with initiation and termination conditions.

From motor control neuroscience, the cerebellum and basal ganglia implement a similar hierarchy: the prefrontal cortex selects behavioral programs while subcortical circuits execute them, with termination signals driven by reward prediction errors. The intra-option Q-learning update mirrors temporal difference learning in dopaminergic neurons.

In software engineering, this maps to the Strategy Pattern: a context object selects a strategy at runtime based on state, delegates execution to it, and switches strategies when conditions change.

Daily Challenge

Exercise: Option Discovery on a Maze

Take a simple 10×10 grid maze (you can hardcode walls or use gym-maze). Run a flat Q-learning agent to collect successful trajectories. Then:

Build a state-transition graph from the trajectories
Compute betweenness centrality for each state (use networkx.betweenness_centrality)
Mark the top-5 high-centrality states as subgoals
Train one option per subgoal: a policy that navigates from any state to that subgoal
Run a hierarchical agent that uses these options instead of primitive moves

Compare the learning curves of flat Q-learning vs. the hierarchical agent on a longer maze variant. How many fewer environment steps does the option-based agent need to find a good policy?

Bonus challenge: Replace the manual betweenness centrality step with a learned termination function using Option-Critic. Does it discover the same bottleneck states?

References & Further Reading

Foundational Papers

Sutton, Precup & Singh (1999) — “Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning” — the original Options paper
Dietterich (2000) — “Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition”
Bacon, Harb & Precup (2017) — “The Option-Critic Architecture” — end-to-end option learning

Hierarchical RL & Applications

Nachum et al. (2018) — “Data-Efficient Hierarchical Reinforcement Learning (HIRO)”
Levy et al. (2019) — “Learning Multi-Level Hierarchies with Hindsight (HAC)”
Khetarpal et al. (2020) — “Options of Interest: Temporal Abstraction with Interest Functions”

LLM & Skill Libraries

Chen et al. (2023) — “SKILL-IT! A Data-Driven Skills Framework for Understanding and Training Language Models”
Wang et al. (2023) — “Voyager: An Open-Ended Embodied Agent with Large Language Models” — Minecraft agent that builds a skill library as reusable JS code

Implementations

stable-baselines3: includes DDPG/SAC used in HIRO-style agents
RL Baselines3 Zoo: https://github.com/DLR-RM/rl-baselines3-zoo
Gym-Taxi / Four Rooms: classic option benchmark environments in OpenAI Gym

Engineering Notes