Knowledge Distillation Teaching Smaller Agents From Larger Ones

15 Feb 2026

Knowledge distillation trains a smaller, cheaper “student” agent to mimic the behavior of a larger “teacher” model. The result is a system that can handle most queries at a fraction of the cost and latency, while routing genuinely hard cases back to the full model.

Concept Introduction

A large, powerful model (the “teacher”) generates training signals — not just correct answers, but the distribution of its confidence across all possible answers. A smaller model (the “student”) learns from these soft signals, absorbing the teacher’s reasoning patterns in a compressed form.

In the context of AI agents, distillation goes beyond single model outputs. An agent produces:

Action selections: which tool to call, what to say
Reasoning traces: chain-of-thought steps, planning sequences
Confidence distributions: soft probabilities over possible next actions
State assessments: how the agent evaluates the current situation

Distilling an agent means transferring all of these behavioral patterns, not just the final answers, from a high-capability system to a leaner one that can operate in production at scale.

Historical & Theoretical Context

Knowledge distillation was formalized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their 2015 paper “Distilling the Knowledge in a Neural Network.” The core insight was elegant: the “wrong” answers a teacher model gives contain valuable information. When a digit classifier says an image is “70% a 7, 20% a 2, 10% a 1,” the relationship between those probabilities encodes structural knowledge about what makes digits similar.

The idea has roots in model compression (Buciluă et al., 2006) and connects to several foundational concepts:

Minimum Description Length: Compressing knowledge into a simpler representation
PAC learning: What can a simpler hypothesis class learn from a more complex one?
Transfer learning: Reusing knowledge across tasks and architectures

In multi-agent systems, distillation connects to the principle of agent specialization — the idea that not every agent in a system needs full general capability. Some agents only need to handle a narrow slice of the problem space, and distillation is how you carve that slice efficiently.

Algorithms & Math

The Distillation Loss

The standard distillation objective combines two losses:

L = α · L_hard + (1 - α) · L_soft

Where:

L_hard = Cross-entropy between student predictions and ground-truth labels
L_soft = KL divergence between softened teacher and student distributions
α = Weighting factor (typically 0.1–0.5)

The “softening” uses a temperature parameter T:

softmax(z_i / T) = exp(z_i / T) / Σ_j exp(z_j / T)

Higher temperature (T > 1) produces softer probability distributions, revealing more of the teacher’s internal ranking of alternatives.

Pseudocode for Agent Distillation

Algorithm: Agent Behavior Distillation
─────────────────────────────────────
Input: Teacher agent A_T, Student agent A_S, Task distribution D
Output: Trained student agent A_S*

1. COLLECT trajectories:
   For each task t ~ D:
     Run A_T on task t
     Record: (state, action, reasoning_trace, confidence) tuples

2. BUILD distillation dataset:
   For each trajectory:
     Extract (input_state, teacher_action_distribution, teacher_reasoning)

3. TRAIN student:
   For each epoch:
     For each (state, teacher_dist, reasoning) in dataset:
       student_dist = A_S.predict(state)
       L_action = KL(teacher_dist || student_dist)
       L_reasoning = CrossEntropy(reasoning, A_S.generate_reasoning(state))
       L_total = α · L_action + (1 - α) · L_reasoning
       Update A_S parameters via gradient descent on L_total

4. EVALUATE:
   Compare A_S* vs A_T on held-out tasks
   Return A_S* if performance meets threshold

Design Patterns & Architectures

Teacher-Student Pipeline

graph LR
    A[Task Distribution] --> B[Teacher Agent<br/>Large Model]
    B --> C[Trajectory Store<br/>Actions + Reasoning]
    C --> D[Distillation<br/>Training Loop]
    D --> E[Student Agent<br/>Small Model]
    E --> F[Production<br/>Deployment]
    F -->|Hard cases| B

The key architectural element is fallback escalation: when the student agent encounters inputs outside its competence, it routes to the teacher. This creates a tiered system where most requests are handled cheaply and only edge cases incur full cost.

Cascade Distillation

Instead of a single teacher-student pair, arrange agents in a cascade:

Tier 1 (smallest): Handles common, simple queries
Tier 2 (medium): Handles moderate complexity
Tier 3 (largest): Handles the long tail of difficult cases

Each tier is distilled from the one above it, specialized for the difficulty level it serves.

Connection to Existing Patterns

Mixture of Experts: Distillation can create specialized experts for different subdomains
Semantic Routing: A router decides which tier handles each request (see your semantic routing article)
Planner-Executor: The teacher can serve as the planner while distilled students serve as fast executors

Practical Application

Here’s a concrete example: distilling a large agent’s tool-calling behavior into a smaller model using OpenAI’s API and a simple training loop.

import json
import openai
from dataclasses import dataclass


@dataclass
class DistillationSample:
    user_query: str
    teacher_reasoning: str
    teacher_action: str
    teacher_tool_calls: list[dict]


def collect_teacher_trajectories(
    queries: list[str],
    teacher_model: str = "gpt-4o",
) -> list[DistillationSample]:
    """Run the teacher agent and collect its behavior."""
    client = openai.OpenAI()
    samples = []

    tools = [
        {
            "type": "function",
            "function": {
                "name": "search_database",
                "description": "Search the product database",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"},
                        "limit": {"type": "integer", "default": 10},
                    },
                    "required": ["query"],
                },
            },
        }
    ]

    for query in queries:
        response = client.chat.completions.create(
            model=teacher_model,
            messages=[
                {"role": "system", "content": "Think step-by-step before acting."},
                {"role": "user", "content": query},
            ],
            tools=tools,
            temperature=0.3,
        )

        msg = response.choices[0].message
        samples.append(
            DistillationSample(
                user_query=query,
                teacher_reasoning=msg.content or "",
                teacher_action=msg.role,
                teacher_tool_calls=[
                    {"name": tc.function.name, "args": tc.function.arguments}
                    for tc in (msg.tool_calls or [])
                ],
            )
        )

    return samples


def build_finetuning_dataset(
    samples: list[DistillationSample],
    output_path: str = "distillation_train.jsonl",
):
    """Convert teacher trajectories into fine-tuning format."""
    with open(output_path, "w") as f:
        for sample in samples:
            record = {
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a helpful assistant. "
                        "Think briefly, then act.",
                    },
                    {"role": "user", "content": sample.user_query},
                    {
                        "role": "assistant",
                        "content": sample.teacher_reasoning,
                        "tool_calls": [
                            {
                                "id": f"call_{i}",
                                "type": "function",
                                "function": tc,
                            }
                            for i, tc in enumerate(sample.teacher_tool_calls)
                        ]
                        if sample.teacher_tool_calls
                        else None,
                    },
                ]
            }
            f.write(json.dumps(record) + "\n")

    return output_path


# Usage:
# Collect teacher behavior
# samples = collect_teacher_trajectories(your_query_list)
# Build fine-tuning dataset
# build_finetuning_dataset(samples)
# Fine-tune a smaller model (e.g., gpt-4o-mini) on this dataset
# Deploy the student model in production with fallback to teacher

Latest Developments & Research

Distilling Reasoning (2024-2025)

A major trend is distilling not just outputs but chain-of-thought reasoning. OpenAI’s approach with the o1 model family demonstrated that reasoning traces themselves are valuable training signals. Papers like “Orca 2: Teaching Small Language Models How to Reason” (Microsoft, 2023) showed that carefully curated reasoning demonstrations can make 13B-parameter models competitive with much larger ones on specific tasks.

Constitutional AI Distillation (Anthropic, 2024)

Anthropic explored distilling safety behaviors from large models into smaller ones, showing that alignment properties can partially transfer through distillation — though gaps remain in edge cases.

Agent-Specific Distillation Benchmarks

AgentBench (2024): Evaluates distilled agents across web, code, and database tasks
τ-bench (2024): Tests whether distilled agents maintain tool-use accuracy
The consistent finding: distilled agents retain 85-95% of teacher performance on in-distribution tasks but degrade on out-of-distribution inputs

Open Problems

Reasoning collapse: Students sometimes learn to produce reasoning-shaped text without actual reasoning
Calibration drift: Student confidence scores diverge from teacher calibration
Multi-step degradation: Errors compound faster in distilled agents over long trajectories

Cross-Disciplinary Insight

Knowledge distillation mirrors apprenticeship and institutional knowledge transfer in organizational theory. When a senior engineer leaves a company, their knowledge doesn’t transfer through documentation alone — it transfers through working alongside juniors, showing not just what decisions to make but how to think about problems.

In distributed computing, this maps to caching hierarchies. An L1 cache (student) handles most requests with low latency; cache misses escalate to L2 (teacher). The system works because most access patterns are predictable, just as most user queries follow common patterns that a distilled model can handle.

From biology, distillation resembles genetic assimilation (the Baldwin effect): behaviors initially learned through expensive individual experience become encoded in simpler, faster mechanisms over generations.

Daily Challenge

Exercise: Build a Distillation Quality Monitor

Create a system that compares teacher and student agent outputs to measure distillation quality:

Run 50 test queries through both a large model (teacher) and a small model (student)
Compare: Do they select the same tools? Do they produce equivalent reasoning?
Calculate an “agreement score” and identify the query types where the student diverges most
Design a routing rule: which queries should be escalated to the teacher?

def measure_distillation_quality(
    queries: list[str],
    teacher_model: str,
    student_model: str,
) -> dict:
    """Compare teacher and student on the same queries."""
    results = {"agreement": 0, "divergences": []}

    for query in queries:
        teacher_output = run_agent(query, teacher_model)
        student_output = run_agent(query, student_model)

        # TODO: Compare tool selections
        # TODO: Compare reasoning similarity (use embedding cosine similarity)
        # TODO: Compare final answer equivalence
        # TODO: Track which query categories diverge most

    results["agreement_rate"] = results["agreement"] / len(queries)
    return results

Bonus: Implement an automatic escalation threshold — if the student’s confidence on a query is below a certain level, route to the teacher instead.

References & Further Reading

Papers

“Distilling the Knowledge in a Neural Network” (Hinton, Vinyals, Dean, 2015): The foundational distillation paper
“Orca 2: Teaching Small Language Models How to Reason” (Microsoft, 2023): Reasoning distillation for smaller models
“Lion: Adversarial Distillation of Closed-Source Large Language Model” (Jiang et al., 2023): Distilling from black-box APIs
“AgentBench: Evaluating LLMs as Agents” (Liu et al., 2024): Benchmarking agent capabilities including distilled models

Blog Posts & Guides

“How to Fine-Tune GPT-4o-mini” (OpenAI Cookbook): Practical fine-tuning for distillation
“Distilling Step-by-Step” (Google Research Blog, 2023): Extracting reasoning as training signal
“Building Tiered Agent Systems” (LangChain Blog): Cascading agents for cost optimization

GitHub Repositories

distilabel by Argilla: https://github.com/argilla-io/distilabel — Framework for AI feedback and distillation dataset generation
LitGPT: https://github.com/Lightning-AI/litgpt — Lightweight models suitable as distillation students
OpenRLHF: https://github.com/OpenRLHF/OpenRLHF — Open-source RLHF and distillation framework

● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.

Get early access → See how it works

Engineering Notes