Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

Counterfactual Regret Minimization for Solving Imperfect Information Games

10 Mar 2026

Chess and Go have complete information — every player sees the whole board. But most real-world agent problems don’t. Negotiation, cybersecurity, auctions, medical triage: information is partial, private, or hidden. Counterfactual Regret Minimization (CFR) is the algorithm that cracked this class of problems, producing the first superhuman poker agents and reshaping how we think about strategic reasoning under uncertainty.

1. Concept Introduction

Simple Explanation

Imagine playing poker. You can’t see your opponent’s cards. Each time you fold when you should have called, you feel regret. CFR asks: what if I had played differently at every moment I can control? It computes this counterfactual regret — the regret you’d have if you could retroactively change a single decision — and then nudges your strategy toward the actions you wish you’d taken.

Do this repeatedly, and your strategy converges to a Nash equilibrium: a point where no player can improve by switching strategies unilaterally, even knowing the opponent’s strategy.

Technical Detail

CFR operates on extensive-form games — the formal representation for sequential games with hidden information. Unlike normal-form games (payoff matrices), extensive-form games encode the game tree explicitly: who acts when, what they can observe, and what the payoffs are.

The key insight is the information set: the set of all game states that a player cannot distinguish given their observations. Your strategy maps information sets to distributions over actions, not individual states to actions.

CFR’s power: it decomposes the global regret minimization problem into local, per-information-set problems that can be solved independently.

2. Historical & Theoretical Context

CFR was introduced by Zinkevich, Johanson, Bowling, and Piccione at NeurIPS 2007. It built on regret matching, an older online learning algorithm, extending it to the sequential setting.

The timeline of impact:

The game-theoretic foundation comes from von Neumann’s minimax theorem and Nash’s equilibrium concept. CFR operationalizes Nash equilibrium computation for games too large for LP solvers.

3. Algorithms and Math

Extensive-Form Game Setup

A game has:

The reach probability to node $h$ under strategy profile $\sigma$:

$$\pi^\sigma(h) = \prod_{h' \cdot a \sqsubseteq h} \sigma_{P(h')}(h', a)$$

Split into player $i$’s contribution $\pi^\sigma_i(h)$ and the rest $\pi^\sigma_{-i}(h)$.

Counterfactual Value and Regret

The counterfactual value at information set $I$ for action $a$ is the expected payoff as if player $i$ had tried to reach $I$:

$$v_i^\sigma(I, a) = \sum_{h \in I} \pi^\sigma_{-i}(h) \sum_{z \succ h \cdot a} \pi^\sigma(h \cdot a, z) \cdot u_i(z)$$

The instantaneous counterfactual regret for not playing action $a$ at $I$:

$$r^t(I, a) = v_i^{\sigma^t}(I, a) - v_i^{\sigma^t}(I)$$

Cumulative regret over $T$ iterations:

$$R^T(I, a) = \sum_{t=1}^{T} r^t(I, a)$$

CFR Strategy Update (Regret Matching)

At each iteration, play proportional to positive cumulative regret:

$$\sigma^{T+1}(I, a) = \frac{\max(R^T(I, a),\ 0)}{\sum_{a'} \max(R^T(I, a'),\ 0)}$$

If all regrets are non-positive, play uniformly.

Pseudocode

function CFR(history h, player i, t):
    if h is terminal:
        return u_i(h)
    if h is chance node:
        return sum over actions a: P(a) * CFR(h·a, i, t)

    I = information_set(h)
    strategy = regret_match(R[I])
    value = 0
    action_values = {}

    for a in actions(h):
        action_values[a] = CFR(h·a, i, t)
        value += strategy[a] * action_values[a]

    if P(h) == i:
        for a in actions(h):
            R[I][a] += counterfactual_reach(h) * (action_values[a] - value)

    return value

repeat for T iterations:
    for each player i:
        CFR(root, i, t)
final_strategy = average over all iterations

The average strategy (not the final strategy) converges to Nash equilibrium. The average regret shrinks as $O(1/\sqrt{T})$.

4. Design Patterns and Architectures

CFR fits into agent architectures in several ways:

Blueprint + Real-time Search: Libratus uses CFR offline to produce a coarse blueprint strategy, then refines it online during play with nested subgame solving. This pattern — offline planning, online refinement — generalizes broadly.

graph TD
    A[Offline CFR Training] --> B[Blueprint Strategy]
    B --> C[Runtime Agent]
    D[Game State / Observation] --> E[Information Set Lookup]
    E --> F{Real-time Subgame Solver}
    F --> G[Action Distribution]
    B --> F
  

Abstraction Layer: Real games like Texas Hold’em have $10^{160}$ nodes. CFR requires abstraction — grouping similar cards, bet sizes, or board states. The agent computes CFR on the abstract game and translates actions back.

Connection to MCTS: Both MCTS and CFR traverse game trees iteratively, but MCTS samples trajectories while CFR computes exact counterfactuals. For very large trees, Monte Carlo CFR (MCCFR) samples subsets of nodes, combining both ideas.

5. Practical Application

Here is a minimal CFR implementation for Kuhn Poker (a tiny 3-card game that captures the core structure of Texas Hold’em):

import numpy as np
from collections import defaultdict

CARDS = [1, 2, 3]  # J, Q, K
NUM_ACTIONS = 2    # 0=pass, 1=bet

# Cumulative regret and strategy sum per info set
regret_sum = defaultdict(lambda: np.zeros(NUM_ACTIONS))
strategy_sum = defaultdict(lambda: np.zeros(NUM_ACTIONS))

def get_strategy(info_set, realization_weight):
    regrets = regret_sum[info_set]
    strategy = np.maximum(regrets, 0)
    total = strategy.sum()
    if total > 0:
        strategy /= total
    else:
        strategy = np.ones(NUM_ACTIONS) / NUM_ACTIONS
    strategy_sum[info_set] += realization_weight * strategy
    return strategy

def cfr(cards, history, p0, p1):
    plays = len(history)
    player = plays % 2
    opponent = 1 - player

    # Terminal conditions for Kuhn Poker
    if plays >= 2:
        if history[-1] == 0:  # last action was pass
            if history[-2] == 1:  # opponent bet, I passed -> fold
                return 1 if player == 1 else -1
            else:  # both passed -> showdown
                return 1 if cards[player] > cards[opponent] else -1
        elif history[-1] == 1 and history[-2] == 1:  # both bet -> showdown
            return 1 if cards[player] > cards[opponent] else -1

    info_set = f"{cards[player]}{''.join(map(str, history))}"
    reach = p0 if player == 0 else p1
    cfr_reach = p1 if player == 0 else p0

    strategy = get_strategy(info_set, reach)
    action_values = np.zeros(NUM_ACTIONS)

    for a in range(NUM_ACTIONS):
        next_history = history + [a]
        if player == 0:
            action_values[a] = -cfr(cards, next_history, p0 * strategy[a], p1)
        else:
            action_values[a] = -cfr(cards, next_history, p0, p1 * strategy[a])

    node_value = strategy @ action_values
    regret_sum[info_set] += cfr_reach * (action_values - node_value)
    return node_value

def train(iterations=10000):
    import random
    total_util = 0
    for _ in range(iterations):
        cards = random.sample(CARDS, 2)
        total_util += cfr(cards, [], 1.0, 1.0)
    print(f"Average game value: {total_util / iterations:.4f}")
    print("\nNash equilibrium strategies:")
    for info_set, ssum in sorted(strategy_sum.items()):
        total = ssum.sum()
        if total > 0:
            norm = ssum / total
            print(f"  {info_set}: pass={norm[0]:.2f} bet={norm[1]:.2f}")

train(50000)

In a LangGraph agent, CFR-derived strategies become policy lookup tables or neural networks that map observation histories to action distributions, enabling rational behavior in information-asymmetric negotiations or adversarial tool-use scenarios.

6. Comparisons and Tradeoffs

MethodGame TypeScalabilityGuarantees
CFRImperfect info, 2+ playersMedium (with abstraction)Nash convergence
Alpha-Beta / MCTSPerfect infoHighOptimal (bounded)
Deep RL (PPO, SAC)AnyHighNo Nash guarantee
Double OracleNormal-formLowNash in small games
Neural Fictitious PlayGeneralHighApproximate

CFR’s core limitation: it requires traversing (or sampling) the full game tree, which is intractable without abstraction for large games. Deep CFR (2019) addresses this by using neural networks to approximate counterfactual values, enabling CFR to scale without explicit abstraction.

Another limitation: CFR finds Nash equilibria, which are safe strategies but not necessarily exploitative ones. Against a weak opponent, pure Nash play leaves value on the table. Opponent modeling and exploitation require additional machinery.

7. Latest Developments and Research

Deep CFR (Brown et al., 2019): Replaces the regret tables with neural networks. Achieves superhuman performance in heads-up no-limit hold’em without hand-crafted abstractions.

ReBeL (Brown et al., 2020, NeurIPS): Combines CFR with deep RL value functions, generalizing the blueprint + subgame approach into a unified framework. Works for games beyond poker.

Player of Games (Schmid et al., 2023, Science): A single algorithm combining MCTS and CFR that achieves strong performance on both perfect-information games (Chess, Go) and imperfect-information games (poker, Scotland Yard) — a step toward truly general game-playing agents.

CFR in LLM multi-agent settings (2024): Emerging research explores CFR-style regret minimization in natural language negotiation, where information sets correspond to conversation histories and actions are utterances.

Open problems: scaling CFR to continuous action spaces, handling non-stationary opponents, and combining CFR with language model priors for grounded strategic reasoning.

8. Cross-Disciplinary Insight

CFR is a form of online learning with regret guarantees, deeply connected to the no-regret learning literature in economics and operations research. Economists use similar frameworks in mechanism design and auction theory.

The information set structure mirrors signal processing: what you observe is a noisy, partial view of ground truth, and your decisions must be robust to that uncertainty. Shannon’s information theory quantifies how much is unknown; CFR operationalizes rational behavior under that uncertainty.

In neuroscience, there’s growing evidence that the prefrontal cortex maintains something like counterfactual values — simulating “what would have happened if I’d acted differently” — to update decision-making policies. CFR formalizes this as an algorithm.

9. Daily Challenge

Implement one-card poker (simpler than Kuhn): each player gets a card from 1–10, player 1 acts first (bet or check), player 2 sees player 1’s action and responds. Run CFR for 10,000 iterations. Then:

  1. Print the Nash equilibrium strategy for player 1 at each card value.
  2. Observe the bluffing threshold: at what card value does player 1 switch from mostly checking to mostly betting?
  3. Now hand-craft a “naive” strategy (always bet if card > 5). Measure how much value CFR extracts against it compared to Nash play.

Thought experiment: In a three-way negotiation between AI agents (each with private information about their true preferences), how would you define “information sets”? What would convergence to Nash look like, and is Nash equilibrium even the right solution concept here?

10. References and Further Reading

● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.