Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

Multi-Agent Reinforcement Learning Teaching Agents to Cooperate Compete and Coexist

23 Feb 2026

What happens when you put two learning agents in the same environment? They don’t just solve a problem together — they change each other’s learning problem. Every action each agent takes shifts the ground beneath the other’s feet. This feedback loop between co-adapting learners is the heart of Multi-Agent Reinforcement Learning (MARL), and it produces some of the most fascinating and counterintuitive behaviors in all of AI.

1. Concept Introduction

Simple Explanation

In standard reinforcement learning, one agent explores an environment, collects rewards, and updates its policy. The environment doesn’t fight back or adapt. MARL changes this: now multiple agents share the same world, and each agent is part of the environment for all the others.

Think of traffic: each driver is trying to reach their destination quickly (individual reward), but every choice — lane changes, braking, merging — affects every other driver. No central controller tells anyone what to do. Yet somehow, traffic flows. That emergent coordination is exactly what MARL studies and tries to engineer.

Technical Detail

Formally, MARL extends the single-agent MDP into a Markov Game (also called a stochastic game), defined as a tuple:

$$\langle \mathcal{N}, \mathcal{S}, \{\mathcal{A}^i\}_{i \in \mathcal{N}}, \mathcal{T}, \{R^i\}_{i \in \mathcal{N}}, \gamma \rangle$$

where $\mathcal{N} = \{1, \ldots, n\}$ is the set of agents, $\mathcal{S}$ is the shared state space, $\mathcal{A}^i$ is each agent’s action space, $\mathcal{T}: \mathcal{S} \times \mathcal{A}^1 \times \cdots \times \mathcal{A}^n \to \Delta(\mathcal{S})$ is the transition function, and each agent $i$ has its own reward function $R^i$.

The key difficulty: each agent’s optimal policy $\pi^{*i}$ depends on what every other agent does. You can’t optimize them independently.

2. Historical & Theoretical Context

MARL traces to Markov games formalized by Lloyd Shapley in 1953 — the same Shapley who later received the Nobel in economics for mechanism design. The connection was intentional: game theory and RL share a deep ancestry in sequential decision-making under uncertainty.

The modern MARL era began in the 1990s with Littman’s minimax-Q algorithm (1994) for two-player zero-sum games, and Hu and Wellman’s Nash-Q (1998) for general-sum games. But these methods required full observability and exact Nash equilibrium computation — impractical for large problems.

The deep learning revolution changed everything. In 2016, DeepMind’s MADDPG (Multi-Agent Deep Deterministic Policy Gradient) brought actor-critic methods to MARL. OpenAI’s work on multi-agent hide-and-seek (2019) showed that simple competitive setups could produce astonishing emergent tool use. AlphaStar (2019) and OpenAI Five (2019) demonstrated that MARL at scale could exceed human experts in complex strategy games.

3. Algorithms & Math

Three Settings, Three Challenges

MARL problems split into three fundamental settings:

Cooperative: All agents share a single reward $R^1 = R^2 = \cdots = R^n = R$. The challenge is coordination without communication.

Competitive (Zero-sum): $\sum_i R^i = 0$. One agent’s gain is another’s loss. The goal is finding a Nash equilibrium policy.

Mixed-motive: Agents have different but coupled rewards. This is most realistic — teammates compete for resources while cooperating toward a shared goal.

The Non-Stationarity Problem

For agent $i$, its Q-function is:

$$Q^i(s, a^1, \ldots, a^n) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R^i_t \,\Big|\, s_0 = s,\, a^i_0 = a^i\right]$$

But as other agents update their policies, this Q-function changes even without changes to the environment. From agent $i$’s perspective, the world is non-stationary — a fundamental violation of standard RL assumptions.

Centralized Training with Decentralized Execution (CTDE)

The dominant paradigm for cooperative MARL. During training, agents share privileged information (joint observations, other agents’ actions). At execution time, each agent acts using only its own local observation.

Training:
  Critic sees: [s, a¹, a², ..., aⁿ]   ← joint info
  Actor learns: π^i(a^i | o^i)         ← local only

Execution:
  Agent i observes o^i, acts via π^i
  No communication required

QMIX: Factoring the Joint Q-Function

For cooperative tasks, QMIX (Rashid et al., 2018) decomposes the global Q-value:

$$Q_{tot}(s, \mathbf{a}) = f_{\theta}\left(Q^1(o^1, a^1),\, Q^2(o^2, a^2),\, \ldots,\, Q^n(o^n, a^n),\, s\right)$$

where $f_\theta$ is a monotonic mixing network (weights are non-negative). The monotonicity constraint guarantees:

$$\frac{\partial Q_{tot}}{\partial Q^i} \geq 0 \quad \forall i$$

This means the action that maximizes each local $Q^i$ also maximizes the joint $Q_{tot}$ — agents can be greedy locally while still being globally optimal.

MAPPO

Multi-Agent PPO simply applies the PPO update with a centralized critic that receives the global state $s$ rather than local observations. Each agent $i$ optimizes:

$$\mathcal{L}^i(\theta) = \mathbb{E}\left[\min\left(r^i_t(\theta) \hat{A}^i_t,\, \text{clip}(r^i_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}^i_t\right)\right]$$

where $r^i_t(\theta) = \frac{\pi^i_\theta(a^i_t | o^i_t)}{\pi^i_{\theta_\text{old}}(a^i_t | o^i_t)}$ and the advantage $\hat{A}^i_t$ is estimated from the centralized critic.

4. Design Patterns & Architectures

The CTDE Pattern

graph TD
    ENV[Shared Environment]
    A1[Agent 1 Actor π¹]
    A2[Agent 2 Actor π²]
    C[Centralized Critic<br>V(s) or Q(s, a¹, a²)]

    ENV -->|o¹| A1
    ENV -->|o²| A2
    A1 -->|a¹| ENV
    A2 -->|a²| ENV
    ENV -->|s, rewards| C
    A1 --> C
    A2 --> C
    C -->|gradient| A1
    C -->|gradient| A2

    style C fill:#f9f,stroke:#333
  

The centralized critic is a training-time scaffold — it disappears at deployment. Each agent’s actor only needs local observations to act.

The Parameter Sharing Pattern

For homogeneous agents (same observation/action spaces), share weights across all agents. Agent identity is passed as input:

# Instead of n separate networks:
# policy_1 = PolicyNetwork()
# policy_2 = PolicyNetwork()
# ...

# Use one shared network with agent ID:
shared_policy = PolicyNetwork(input_dim=obs_dim + n_agents)

def act(agent_id: int, obs: np.ndarray) -> np.ndarray:
    agent_one_hot = np.zeros(n_agents)
    agent_one_hot[agent_id] = 1
    augmented_obs = np.concatenate([obs, agent_one_hot])
    return shared_policy(augmented_obs)

This dramatically reduces parameters, accelerates learning, and improves generalization to new team sizes.

Communication Protocols

Some architectures allow learned communication:

5. Practical Application

Here’s a minimal MARL cooperative setup using PettingZoo (the multi-agent equivalent of Gymnasium):

import numpy as np
from pettingzoo.mpe import simple_spread_v3
from collections import defaultdict

# Cooperative navigation: agents must cover landmarks
env = simple_spread_v3(N=3, local_ratio=0.5, render_mode=None)

# Simple parameter-sharing policy (linear for demo)
class SharedLinearPolicy:
    def __init__(self, obs_dim: int, act_dim: int, n_agents: int):
        input_dim = obs_dim + n_agents
        self.W = np.random.randn(input_dim, act_dim) * 0.1
        self.n_agents = n_agents

    def act(self, agent_id: int, obs: np.ndarray) -> np.ndarray:
        one_hot = np.zeros(self.n_agents)
        one_hot[agent_id] = 1
        x = np.concatenate([obs, one_hot])
        logits = x @ self.W
        # Softmax
        exp = np.exp(logits - logits.max())
        probs = exp / exp.sum()
        return np.random.choice(len(probs), p=probs)

env.reset(seed=42)
obs_dim = env.observation_space("agent_0").shape[0]
act_dim = env.action_space("agent_0").n
policy = SharedLinearPolicy(obs_dim, act_dim, n_agents=3)

# Agent ID mapping
agent_ids = {name: i for i, name in enumerate(env.possible_agents)}

# Run one episode
episode_rewards = defaultdict(float)
observations, _ = env.reset()

for agent in env.agent_iter():
    obs, reward, terminated, truncated, info = env.last()
    episode_rewards[agent] += reward

    if terminated or truncated:
        action = None
    else:
        agent_id = agent_ids[agent]
        action = policy.act(agent_id, obs)

    env.step(action)

print("Episode rewards:", dict(episode_rewards))
print("Total team reward:", sum(episode_rewards.values()))
env.close()

For a production MARL system, consider RLlib with its built-in QMIX and MADDPG implementations:

from ray.rllib.algorithms.qmix import QMixConfig

config = (
    QMixConfig()
    .environment("your_env")
    .multi_agent(
        policies={"shared_policy"},
        policy_mapping_fn=lambda agent_id, *args, **kwargs: "shared_policy",
    )
    .training(mixer="qmix", mixing_embed_dim=32)
)

algo = config.build()
for i in range(100):
    result = algo.train()
    print(f"Iter {i}: reward={result['episode_reward_mean']:.2f}")

6. Comparisons & Tradeoffs

ApproachScalabilityCoordinationSample EfficiencyUse Case
Independent Q-Learning (IQL)HighPoor (ignores others)MediumWeakly coupled tasks
QMIXMediumGood (cooperative)GoodFully cooperative, discrete actions
MADDPGLow-MediumGood (any setting)LowContinuous actions, mixed-motive
MAPPOHighGoodHighLarge teams, complex tasks
Self-playHighN/A (competitive)MediumTwo-player zero-sum

Key tradeoffs:

7. Latest Developments & Research

Foundation models as MARL agents (2024–2025): LLM-based agents (like those in AutoGen or ChatDev) can be viewed as MARL systems where each agent runs a language model. Emergent coordination through natural language replaces learned communication vectors. Research by Park et al. (“Generative Agents”, 2023) showed surprisingly sophisticated social behavior from pure language-based coordination.

MARL for real-world systems (2024): Google DeepMind applied MARL to data center cooling control, reducing energy by 40%. Waymo uses MARL to model interactions between autonomous vehicles and human drivers simultaneously.

Opponent shaping (2023–2025): Instead of treating other agents as fixed, algorithms like LOLA (Learning with Opponent-Aware Learning) and MAIA actively shape other agents’ learning — a meta-game above the base game. This is both powerful and ethically complex.

Open problems:

8. Cross-Disciplinary Insight

MARL is evolutionary game theory made computational. In biology, replicator dynamics describe how strategies spread through a population based on fitness:

$$\dot{x}_i = x_i\left(f_i(\mathbf{x}) - \bar{f}(\mathbf{x})\right)$$

where $x_i$ is the fraction of population using strategy $i$, $f_i$ is its fitness, and $\bar{f}$ is the average. Strategies that outperform average grow; others shrink.

MARL policy gradient does exactly this in miniature: better-than-baseline actions increase in probability. The Nash equilibrium corresponds to an evolutionarily stable strategy — a population composition that can’t be invaded by mutants.

This connection reveals why MARL is hard: evolution doesn’t converge to optimal outcomes; it converges to stable ones, which may be locally trapped. Ant colonies, immune systems, and financial markets all exhibit this property. Your MARL agents are no different.

9. Daily Challenge

Exercise: Coordination Without Communication

Use PettingZoo’s simple_spread_v3 (or similar cooperative environment). Train two agents using independent Q-learning (each agent ignores the other’s actions). Then train with a shared reward signal and compare coordination quality.

Questions to answer:

  1. Do agents converge to divide-and-conquer or duplicate each other’s work?
  2. What happens to the coordination pattern as you add a 3rd agent?
  3. Can you implement a simple difference reward — give each agent credit only for the improvement it caused? Compare sample efficiency vs. shared reward.
# Difference reward for agent i:
# r_diff_i = R(s, a¹, ..., aⁿ) - R(s, a¹, ..., a^i_default, ..., aⁿ)
# where a^i_default is a "do nothing" action

def difference_reward(env, joint_actions, agent_id, default_action):
    """Counterfactual: what would the team reward be without agent i?"""
    counterfactual_actions = joint_actions.copy()
    counterfactual_actions[agent_id] = default_action
    # ... simulate counterfactual and return delta

Bonus: Visualize how Q-values change over training for each agent. Do they converge to stable values, oscillate, or collapse?

10. References & Further Reading

Papers

Books & Courses

Tools & Environments

Blog Posts


Key Takeaways

  1. MARL is fundamentally harder than single-agent RL: non-stationarity breaks convergence guarantees
  2. CTDE is the dominant paradigm: centralized training, decentralized execution
  3. Value decomposition (QMIX) enables scalable cooperative learning
  4. Parameter sharing accelerates learning and improves generalization for homogeneous teams
  5. Emergent behavior is a feature, not a bug — complex strategies arise from simple incentives
  6. Game theory provides the stability concepts (Nash equilibrium, ESS) that MARL aims for
  7. LLM-based agents are MARL systems — AutoGen, CrewAI, and friends are applying these principles through language

MARL is where reinforcement learning meets sociology. Get it right, and your agents don’t just solve problems — they form societies.

● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.