Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

Reward Machines and Automata-Based Task Specification for AI Agents

06 Mar 2026

Reinforcement learning agents learn by chasing rewards. But what happens when the reward depends on history, not just the current state? “First pick up the key, then open the door, then reach the exit” — this kind of structured, sequential task breaks the Markov assumption that most RL algorithms rely on. Reward machines solve this elegantly by encoding task structure as a finite-state automaton that emits rewards as the agent progresses.

1. Concept Introduction

Simple Explanation

Imagine teaching an agent to make a cup of tea. The full reward only comes at the end — a perfect cup of tea. But the agent needs intermediate guidance: boil water (+1), add teabag (+1), steep for 3 minutes (+1), then pour (+1). The tricky part: these steps must happen in order. Doing them in the wrong order earns no reward.

A reward machine is a finite-state machine (FSM) where:

Technical Detail

Formally, a reward machine is a tuple $\mathcal{R} = (U, u_0, F, \delta_u, \delta_r)$ where:

At each environment step, the agent observes state $s$, takes action $a$, and receives a truth assignment $\ell \in 2^\mathcal{P}$ (e.g., “key-picked-up = true”). The machine transitions:

$$u_{t+1} = \delta_u(u_t, \ell_t), \quad r_t = \delta_r(u_t, \ell_t)$$

The full agent state becomes the augmented pair $(s_t, u_t)$, which is Markovian — even though the original reward was not.

2. Historical & Theoretical Context

The idea of using automata to specify rewards emerged from the temporal logic and formal verification community. In the 1990s, researchers explored Linear Temporal Logic (LTL) for expressing robot task specifications. However, translating LTL formulae into practical reward signals for learning agents was cumbersome.

The breakthrough came with Icarte et al. (2018) — “Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning” (ICML 2018). They showed that:

  1. Reward machines can compactly represent non-Markovian reward functions
  2. The augmented MDP $(s, u)$ restores the Markov property
  3. Counterfactual reasoning (learning from what would have happened in other machine states) dramatically accelerates learning

The connection to automata-theoretic model checking (Vardi, Wolper, 1986) is deep — reward machines are essentially deterministic finite automata (DFAs) over propositional symbols, a fragment of LTL.

3. Algorithms & Math

Q-Learning for Reward Machines (QRM)

The standard approach augments Q-learning with the machine state:

$$Q(s, u, a) \leftarrow Q(s, u, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', u', a') - Q(s, u, a) \right]$$

where $u' = \delta_u(u, \ell)$ and $r = \delta_r(u, \ell)$.

The key acceleration: counterfactual experience sharing. After each transition $(s, a, \ell, s')$, the agent updates all machine states simultaneously:

for each machine state u in U:
    u' = delta_u(u, l)           # where would we be if we were in state u?
    r  = delta_r(u, l)
    update Q(s, u, a) using (r, s', u')

This effectively runs $|U|$ parallel Q-learners that share environment transitions — a massive sample efficiency gain.

Reward Machine + HRL

Reward machines compose naturally with hierarchical RL. Each machine state $u$ defines a subgoal: learn a policy $\pi_u$ that achieves the propositional condition on the next edge. The machine state serves as an automatic subgoal curriculum:

$$\pi^*(s, u) = \text{policy optimized for reaching } \delta_u(u, \cdot) \text{ from state } u$$

The overall policy decomposes as: follow $\pi_u$ until a transition fires, then switch to $\pi_{u'}$.

Pseudocode

Initialize: Q(s, u, a) = 0 for all s, u, a
            current_machine_state u = u_0

for each episode:
    s = env.reset()
    u = u_0
    while not done:
        a = epsilon_greedy(Q(s, u, *))
        s', l = env.step(a)          # l = propositional labels observed

        # Update ALL machine states (counterfactual)
        for each u_cf in U:
            u_cf_next = delta_u(u_cf, l)
            r_cf      = delta_r(u_cf, l)
            Q(s, u_cf, a) += alpha * (r_cf + gamma * max_a' Q(s', u_cf_next, a') - Q(s, u_cf, a))

        u = delta_u(u, l)            # actual machine transition
        s = s'

4. Design Patterns & Architectures

Reward machines slot naturally into several agent architecture patterns:

Planner-Executor Pattern: The reward machine is the plan. The current machine state tells the executor which subgoal to pursue. When a subgoal is achieved (edge condition fires), the planner advances the machine.

Event-Driven Architecture: Environment events (propositions becoming true) drive machine transitions. This maps cleanly onto message-passing systems where tools or sensors publish events.

Compositional Task Design: Complex tasks are built from simpler machines using automata operations (product, union, concatenation). A “coffee-making” machine can be composed from a “water-boiling” machine and a “cup-preparation” machine.

stateDiagram-v2
    [*] --> u0
    u0 --> u1: pick_up_key / r=+1
    u1 --> u2: open_door / r=+1
    u2 --> u3: reach_exit / r=+10
    u0 --> u0: other / r=0
    u1 --> u1: other / r=0
    u2 --> u2: other / r=0
    u3 --> [*]
  

5. Practical Application

Here’s a minimal reward machine implementation and integration with a tabular RL agent:

from dataclasses import dataclass, field
from typing import Callable, Dict, Set, Tuple, Any
import numpy as np
from collections import defaultdict

@dataclass
class RewardMachine:
    """A finite-state automaton that specifies task structure and emits rewards."""
    states: Set[str]
    initial_state: str
    terminal_states: Set[str]
    # transitions[(u, frozenset(props))] -> (next_u, reward)
    transitions: Dict[Tuple[str, frozenset], Tuple[str, float]]

    def step(self, state: str, labels: Set[str]) -> Tuple[str, float]:
        key = (state, frozenset(labels))
        if key in self.transitions:
            return self.transitions[key]
        # Default: stay, zero reward
        return state, 0.0

    def is_terminal(self, state: str) -> bool:
        return state in self.terminal_states


def build_key_door_machine() -> RewardMachine:
    """
    Task: pick up key, open door, reach exit (in order).
    Labels: 'key', 'door', 'exit'
    """
    return RewardMachine(
        states={"u0", "u1", "u2", "u3"},
        initial_state="u0",
        terminal_states={"u3"},
        transitions={
            ("u0", frozenset({"key"})): ("u1", 1.0),
            ("u1", frozenset({"door"})): ("u2", 1.0),
            ("u2", frozenset({"exit"})): ("u3", 10.0),
        },
    )


class QRMAgent:
    """Q-learning for Reward Machines with counterfactual updates."""

    def __init__(self, rm: RewardMachine, n_states: int, n_actions: int,
                 alpha: float = 0.1, gamma: float = 0.99, epsilon: float = 0.1):
        self.rm = rm
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        # Q[(env_state, machine_state, action)]
        self.Q = defaultdict(float)
        self.machine_states = list(rm.states)
        self.n_actions = n_actions

    def act(self, env_state: int, machine_state: str) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        q_values = [self.Q[(env_state, machine_state, a)] for a in range(self.n_actions)]
        return int(np.argmax(q_values))

    def update(self, s: int, a: int, labels: Set[str], s_next: int):
        """Counterfactual update across all machine states."""
        for u in self.machine_states:
            u_next, r = self.rm.step(u, labels)
            best_next = max(self.Q[(s_next, u_next, a2)] for a2 in range(self.n_actions))
            td_target = r + self.gamma * best_next
            self.Q[(s, u, a)] += self.alpha * (td_target - self.Q[(s, u, a)])


# --- Usage sketch with a grid-world environment ---
def run_episode(env, agent: QRMAgent, rm: RewardMachine) -> float:
    s = env.reset()
    u = rm.initial_state
    total_reward = 0.0

    for _ in range(200):
        a = agent.act(s, u)
        s_next, labels = env.step(a)    # env returns propositional labels
        agent.update(s, a, labels, s_next)

        u_next, r = rm.step(u, labels)
        total_reward += r
        s, u = s_next, u_next

        if rm.is_terminal(u):
            break

    return total_reward

This pattern integrates cleanly with LangGraph: the reward machine state maps to the graph’s current node, and edge conditions correspond to tool-call outcomes or LLM-extracted propositions.

6. Comparisons & Tradeoffs

ApproachHandles non-Markovian rewardsSample efficiencyInterpretabilityScalability
Standard RLNo (needs history)BaselineLowHigh
Reward Machines (QRM)YesHigh (counterfactual)HighMedium
LSTM / recurrent policyYes (implicit)LowVery lowHigh
LTL + shieldYesLowHighLow
HRL with manual subgoalsPartialMediumMediumMedium

Strengths: Reward machines are fully interpretable — you can read off the task structure at a glance. Counterfactual updates give dramatic sample efficiency gains (often 5–10x over vanilla Q-learning on structured tasks). They compose: product machines let you combine constraints.

Limitations: Someone must define the propositional vocabulary $\mathcal{P}$ and label each environment step. This requires domain knowledge. Automatically learning the reward machine from demonstrations is an active research problem. The number of machine states can grow exponentially with task complexity (analogous to state explosion in model checking).

7. Latest Developments & Research

Learning reward machines from data (2019–2023): Several papers tackle the inverse problem — inferring the reward machine structure from agent trajectories. Icarte et al. (2019) showed this can be framed as a SAT problem. More recently, Furelos-Blanco et al. (2021, 2023) use inductive logic programming (ILP) to learn machines from minimal examples.

Reward machines + LLMs (2023–2025): A natural synthesis: use an LLM to generate the reward machine from a natural language task description, then use QRM for efficient learning. The LLM handles the symbolic abstraction; the RM handles the credit assignment. Papers from NeurIPS 2024 demonstrated this pipeline on household task benchmarks.

Probabilistic reward machines: Extensions handle noisy or stochastic propositional labels, important for real-world sensor uncertainty. The machine transitions become distributions rather than deterministic mappings.

Temporal logic RL (TLTL, LTL2Action): Projects like ltl2action (Illanes et al., 2020) and safety-gym integrate LTL specifications directly into training loops, with reward machines as the compilation target.

Open problems: Can we automatically discover the right propositional vocabulary? How do reward machines scale to continuous-state robotics tasks where propositions must be extracted from raw perception?

8. Cross-Disciplinary Insight

Reward machines are formally identical to Mealy machines from automata theory, and to process algebra specifications from concurrent systems design. The connection to workflow engines in software engineering is striking — BPMN (Business Process Model and Notation) diagrams, state machines in XState, and saga patterns in distributed systems all encode the same intuition: complex processes are sequences of stages, and completion of each stage triggers the next.

In neuroscience, the basal ganglia and prefrontal cortex are thought to implement something like a hierarchical state machine for goal-directed behavior. The machine state $u$ parallels what neuroscientists call the “task set” — a configuration of rules and goals currently active in working memory.

The connection to event sourcing in software architecture is direct: the propositional label stream $(\ell_0, \ell_1, \ell_2, \ldots)$ is an event log, and the reward machine is a stateful projection over that log.

9. Daily Challenge

Build a coffee-machine reward machine.

Model the following task for a kitchen robot:

  1. Fill kettle with water
  2. Boil kettle
  3. Place coffee in cup
  4. Pour hot water into cup
  5. Wait 3 minutes (model as 3 consecutive “wait” steps)
  6. Remove grounds
  7. Optionally add milk (if “milk_requested” label is true)

Questions to answer:

# Starter: implement the machine and a simulate_episode function
# that returns total_reward and episode_length

def build_coffee_machine() -> RewardMachine:
    # Your implementation here
    pass

10. References & Further Reading

Foundational Papers

Temporal Logic & Formal Methods

Libraries & Code

Blog Posts & Tutorials


Key Takeaways

  1. Non-Markovian rewards are the norm in real tasks: order matters, history matters. Reward machines restore the Markov property by augmenting the state space.
  2. Counterfactual updates make QRM dramatically more sample-efficient than vanilla RL — a single trajectory trains $|U|$ Q-functions simultaneously.
  3. Interpretability is a first-class citizen: the machine diagram is the task specification. Debugging why an agent failed means reading the machine state trace.
  4. Composition is free: product automata let you combine safety constraints, progress objectives, and optional side-tasks without rewriting learning algorithms.
  5. The LLM + reward machine pipeline is emerging as a powerful pattern: LLM generates the symbolic scaffold, RL fills in the low-level policy — the best of both worlds.
● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.