Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

Grounded Language Agents Connecting Words to Actions in the Physical World

27 Feb 2026

Grounded language agents are systems that can reason in language and act meaningfully in physical or simulated environments. Building them requires solving a deep problem: how do words connect to the real world? A robot instructed to “grab the red mug near the coffee machine” must bridge language, perception, and motor control in ways that pure language models cannot.

Concept Introduction

Grounding means connecting symbols (words, tokens) to perceptual experiences and actions in the world. A grounded agent doesn’t just predict text about picking up a mug. It actually perceives the mug, plans a trajectory, and executes the grip. Language models trained only on text are in the position of someone who learned a language entirely from definitions: syntactically fluent but experientially empty.

Grounded language agents combine several subsystems:

The key challenge is the semantic gap between language tokens and continuous sensorimotor signals. Bridging this gap efficiently, without requiring massive amounts of paired (language, action) data, is the central problem in embodied AI research.

Historical & Theoretical Context

The symbol grounding problem was formally articulated by Stevan Harnad in 1990. He argued that symbols in a formal system cannot be meaningful unless they are grounded in non-symbolic experience (perception and action). This posed a direct challenge to pure symbolic AI.

The field responded with several approaches over the decades:

The modern resurgence came with two forces colliding: transformer-based LLMs that can reason fluently about tasks, and deep learning perception systems (ViT, CLIP) that can interpret rich visual scenes. Combining them unlocked a new generation of grounded agents.

Algorithms & Math

Affordance-Conditioned Planning

The landmark SayCan paper (Ahn et al., Google, 2022) introduced a principled way to combine LLM reasoning with physical feasibility. Given a goal $g$ and current state $s$, the agent selects skill $a$ that maximizes:

$$\pi^*(a \mid s, g) \propto p_{\text{LLM}}(a \mid g) \cdot p_{\text{afford}}(a \mid s)$$

Where:

This elegantly separates what is semantically sensible (language model) from what is physically possible (affordance model). An LLM might suggest “fly to the kitchen” but the affordance model assigns that zero probability, keeping the agent grounded in reality.

Vision-Language Action Models

RT-2 (Brohan et al., Google DeepMind, 2023) takes a different approach: fine-tune a vision-language model end-to-end to output robot actions as tokens. Robot actions (joint angles, gripper positions) are discretized and treated like text tokens:

Input:  [image_tokens] + "Pick up the apple and put it in the bowl"
Output: "move_arm 0.23 -0.15 0.40 close_gripper"

This reframes robot control as a language modeling problem, allowing the model to leverage internet-scale pretraining. The key insight: the same attention mechanisms that learn relationships between words can learn relationships between visual features and motor commands.

Pseudocode: SayCan-Style Planning Loop

def saycan_plan(goal: str, environment: Environment, llm, affordance_model):
    plan = []
    state = environment.observe()

    while not goal_achieved(state, goal):
        # Get candidate skills from LLM
        candidates = llm.propose_skills(goal, plan, state)

        # Score each skill by affordance (physical feasibility)
        scores = []
        for skill in candidates:
            p_lang = llm.score_skill(skill, goal)
            p_afford = affordance_model.score(skill, state)
            scores.append((skill, p_lang * p_afford))

        # Execute highest-scoring feasible skill
        best_skill = max(scores, key=lambda x: x[1])[0]
        plan.append(best_skill)
        state = environment.execute(best_skill)

    return plan

Design Patterns & Architectures

The Perception-Reasoning-Action Loop

graph TD
    E[Environment] -->|raw observations| P[Perception Module]
    P -->|structured scene graph / embeddings| R[Language Reasoner]
    R -->|high-level plan| A[Action Module]
    A -->|motor commands / API calls| E
    R -->|goal progress| R
  

Key Patterns

Scene Graph Grounding: The perception module builds a structured graph of objects, their properties, and spatial relationships. This graph is serialized into language (“a red mug is to the left of the coffee machine”) and fed to the LLM. This is far more token-efficient than raw image description.

Hierarchical Grounding: High-level instructions (“clean the kitchen”) are decomposed by the LLM into grounded subgoals (“pick up the dish”, “place it in the sink”), which are then executed by low-level controllers. This matches the planner-executor pattern but with physical grounding at each level.

Affordance-Aware Memory: The agent’s memory includes not just facts but affordances: “the drawer is stuck and cannot be opened”, “the robot arm cannot reach above shelf 3”. This grounds future planning in physical experience.

Practical Application

Here’s a minimal grounded agent that uses a vision-language model to answer questions about a scene and execute actions in a simulated environment:

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def encode_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

def grounded_agent(goal: str, image_path: str, available_actions: list[str]) -> str:
    """
    A grounded agent that reasons about a visual scene and selects an action.
    """
    image_data = encode_image(image_path)
    actions_str = "\n".join(f"- {a}" for a in available_actions)

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": f"""You are a grounded robot agent.
Goal: {goal}

Available actions:
{actions_str}

Analyze the scene. Respond in this format:
OBSERVATION: [what you see]
REASONING: [why this action makes sense]
ACTION: [exactly one action from the list]"""
                }
            ]
        }]
    )

    return response.content[0].text

# Simulated environment execution
def execute_action(action: str, environment: dict) -> dict:
    """Apply action to environment state."""
    new_state = environment.copy()
    if action.startswith("pick_up"):
        obj = action.split("pick_up_")[1]
        new_state["holding"] = obj
        new_state["objects"].remove(obj)
    elif action.startswith("place"):
        new_state["objects"].append(new_state.pop("holding", None))
    return new_state

# Usage
available_actions = [
    "pick_up_red_mug",
    "pick_up_blue_cup",
    "move_to_sink",
    "move_to_table",
    "open_dishwasher",
    "wait"
]

result = grounded_agent(
    goal="Clean up the red mug",
    image_path="kitchen_scene.jpg",
    available_actions=available_actions
)
print(result)

In production systems (like those built on ROS 2 or Isaac Sim), the execute_action function would interface with real motor controllers or physics simulators.

Latest Developments & Research

RT-2 and RT-X (2023): Google DeepMind trained a single robot policy across 22 different robot embodiments by pooling data, showing that language grounding helps transfer across physical platforms.

SayPlan (2023): Extended SayCan to longer-horizon planning using 3D scene graphs. The LLM reasons over a compressed graph rather than raw images, enabling room-scale manipulation planning.

Code as Policies (Liang et al., 2023): LLMs write Python code that calls a robot API, a technique sometimes called “grounding through code.” The policy is interpretable and compositional, and the agent can write loops, conditionals, and calls to perception APIs.

OpenVLA (2024): An open-source 7B-parameter vision-language-action model, making RT-2-style models accessible to academic researchers without Google-scale compute.

Embodied agents in simulation: Platforms like AI2-THOR, Habitat 3.0, and Isaac Lab provide photo-realistic environments for training grounded agents before real-world deployment. The sim-to-real gap remains a key open problem.

Open problems: How do agents ground abstract language (“be careful”, “hurry up”) in physical behavior? How do they handle novel objects never seen in training? Robust failure detection, knowing when the affordance model is wrong, remains unsolved.

Cross-Disciplinary Insight

The symbol grounding problem maps directly onto debates in cognitive linguistics and philosophy of mind. Philosophers like John Searle (the Chinese Room argument) argued that syntactic symbol manipulation can never produce genuine understanding without grounding in experience.

The SayCan architecture mirrors how the cerebellum and prefrontal cortex collaborate in humans: the prefrontal cortex handles high-level goal reasoning (analogous to the LLM), while the cerebellum handles learned motor programs encoding what movements are feasible in context (analogous to the affordance model). Neither alone produces intelligent behavior.

In control theory, this maps onto the classic separation of a reference model (what should happen) from a plant model (what can happen given physics). Grounded language agents are essentially building these models from data rather than from first principles.

Daily Challenge

Build a Text-World Grounded Agent

The TextWorld library (Microsoft) provides text-based games where an agent must navigate rooms and manipulate objects using language commands, a simplified grounding testbed without real-world complexity.

# pip install textworld
import textworld
import textworld.gym

# Create a simple cooking game
options = textworld.GameOptions()
options.seeds = 42
game_file, _ = textworld.make("tw-cooking-recipe1+cut+go6", options)

env_id = textworld.gym.register_game(game_file, max_episode_steps=50)

import gym
env = gym.make(env_id)
obs, infos = env.reset()
print(obs)  # "You are in a kitchen. You see a knife and a tomato."

# Your challenge: build an agent that:
# Parses the text observation into a structured state
# Uses an LLM to propose an action
# Executes it and observes the result
# Repeats until the goal is achieved (or max steps)

# Hint: the affordance model here is implicit — invalid actions
# return "That's not something you can do" messages.
# Can you learn to avoid invalid actions without trying them?

Bonus: Add a memory module that tracks which actions failed and why, so the agent doesn’t repeat mistakes.

References & Further Reading

Foundational Papers

Surveys & Background

Simulators & Environments

Frameworks