Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

RLHF and Preference Learning Teaching Agents What Humans Actually Want

01 Mar 2026

The gap between measurable objectives and human intent is what Reinforcement Learning from Human Feedback (RLHF) was designed to close. It is the core algorithm behind ChatGPT, Claude, Gemini, and nearly every modern aligned language model. A model can be grammatically correct and factually accurate while still being too verbose, too hedging, or simply unpleasant to use. RLHF addresses that problem directly.

Concept Introduction

A human evaluator is shown two agent responses and picks which is better. A reward model learns to predict these preferences. Then the agent is trained to produce outputs that score highly on the reward model, while not drifting too far from its original behavior.

The full RLHF pipeline has three stages:

  1. Supervised Fine-Tuning (SFT): Fine-tune a base LLM on high-quality demonstrations to get a well-behaved starting point.
  2. Reward Model Training: Train a separate model $r_\phi(x, y)$ to predict which of two completions $y_w$ (winner) vs $y_l$ (loser) a human prefers, given prompt $x$.
  3. RL Fine-Tuning: Use the reward model as a proxy reward signal to optimize the policy $\pi_\theta$ with PPO, while adding a KL-divergence penalty against the SFT model to prevent reward hacking.

Historical & Theoretical Context

The idea of learning reward functions from human feedback predates LLMs by decades. Key milestones:

The theoretical underpinning comes from utility theory and the Bradley-Terry model of paired comparisons, a framework originally developed for chess ranking in the 1950s.

Algorithms & Math

The Bradley-Terry Preference Model

Given a prompt $x$ and two completions $y_w$ and $y_l$, humans prefer $y_w$ with probability:

$$P(y_w \succ y_l \mid x) = \sigma\!\left(r^*(x, y_w) - r^*(x, y_l)\right)$$

where $r^*(x, y)$ is a latent true reward and $\sigma$ is the sigmoid function. We can’t observe $r^*$, so we train a parametric reward model $r_\phi$ by minimizing the negative log-likelihood over a dataset $\mathcal{D}$ of human preferences:

$$\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$$

PPO with KL Penalty

Once we have $r_\phi$, we optimize the policy with a KL-penalized objective:

$$\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)}\!\left[r_\phi(x, y)\right] - \beta \cdot D_{\text{KL}}\!\left(\pi_\theta(\cdot \mid x) \;\|\; \pi_{\text{SFT}}(\cdot \mid x)\right)$$

The KL term (weighted by $\beta$) prevents the model from exploiting the reward model in degenerate ways, such as generating gibberish that scores high but isn’t actually good.

Direct Preference Optimization (DPO)

DPO (Rafailov et al., 2023) showed that the optimal solution to the KL-penalized RL objective can be written analytically, giving a supervised loss directly over preferences:

$$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]$$

No reward model. No RL loop. Just fine-tuning.

Pseudocode for the RLHF Pipeline

# Stage 1: Supervised Fine-Tuning
sft_model = fine_tune(base_model, demonstrations)

# Stage 2: Reward Model Training
reward_model = RM(sft_model)  # Initialize from SFT
for (prompt, chosen, rejected) in preference_dataset:
    r_chosen = reward_model(prompt, chosen)
    r_rejected = reward_model(prompt, rejected)
    loss = -log(sigmoid(r_chosen - r_rejected))
    update(reward_model, loss)

# Stage 3: PPO Fine-Tuning
policy = copy(sft_model)
ref_policy = copy(sft_model)  # Frozen reference
for batch in prompts:
    responses = policy.generate(batch)
    rewards = reward_model(batch, responses)
    kl = kl_divergence(policy, ref_policy, batch)
    ppo_objective = rewards - beta * kl
    update_via_ppo(policy, ppo_objective)

Design Patterns & Architectures

RLHF slots into agent architectures in several ways:

graph TD
    A[Human Annotators] -->|Pairwise preferences| B[Preference Dataset]
    B --> C[Reward Model Training]
    C --> D[Reward Model r_φ]
    E[SFT Model π_SFT] --> F[PPO Fine-Tuning]
    D --> F
    F --> G[Aligned Policy π_θ]
    G -->|Outputs| A
    style G fill:#2d6a4f,color:#fff
  

RLHF as a feedback loop: The aligned policy generates outputs, which enter a human evaluation loop, which improves the reward model, which in turn improves the policy.

Integration with agent frameworks: In agentic settings, RLHF typically trains the underlying LLM that powers the agent’s reasoning. Recent work applies preference learning directly to trajectories (entire sequences of agent actions) rather than single responses.

Practical Application

Here’s a minimal DPO training loop using Hugging Face’s trl library:

from datasets import Dataset
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-1.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(model_name)
ref_model = AutoModelForCausalLM.from_pretrained(model_name)  # Frozen reference
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Preference dataset: each row has a prompt, chosen, and rejected response
preference_data = [
    {
        "prompt": "Explain recursion to a 10-year-old.",
        "chosen": "Imagine a Russian doll — each doll contains a smaller version of itself. Recursion works the same way: a function calls itself on a smaller problem until it's simple enough to solve directly.",
        "rejected": "Recursion is when a function calls itself. It requires a base case to terminate the recursive calls and prevent stack overflow.",
    },
    {
        "prompt": "What is the capital of France?",
        "chosen": "Paris is the capital of France.",
        "rejected": "The capital city of France, a country located in Western Europe, is Paris, which is also its largest city.",
    },
]

dataset = Dataset.from_list(preference_data)

training_args = DPOConfig(
    output_dir="./dpo-output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=5e-7,
    beta=0.1,           # KL penalty weight
    loss_type="sigmoid", # Standard DPO loss
    logging_steps=10,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()

For agentic trajectory-level preference learning, you’d build preference pairs over full action sequences:

# Trajectory-level preference dataset
trajectory_preferences = [
    {
        "prompt": "Book me a flight to Tokyo for next week.",
        "chosen": [
            {"tool": "search_flights", "args": {"destination": "Tokyo", "date": "2026-03-08"}},
            {"tool": "filter_results", "args": {"max_price": 1200, "direct_only": False}},
            {"tool": "book_flight", "args": {"flight_id": "NH106"}},
        ],
        "rejected": [
            {"tool": "search_hotels", "args": {"city": "Tokyo"}},  # Wrong order
            {"tool": "search_flights", "args": {"destination": "Tokyo"}},
        ],
    }
]

Latest Developments & Research

SimPO (Simple Preference Optimization, 2024) eliminates the reference model while using average log-probability as an implicit reward, outperforming DPO on AlpacaEval 2 and MT-Bench.

Online DPO (Guo et al., 2024) uses the model’s own current generations as the source of preference pairs rather than a fixed dataset, closing the distribution shift gap that hurts offline DPO.

SPIN (Self-Play Fine-Tuning, 2024) frames alignment as a two-player game: the current policy tries to fool a discriminator that distinguishes model outputs from human demonstrations. No human preference labels required.

Process Reward Models (PRMs) (Lightman et al., OpenAI, 2023): Instead of rating full outputs, reward each reasoning step separately. Critical for math and code, where a correct final answer can follow from flawed intermediate steps.

Trajectory-level RLHF for agents: Recent work (e.g., AgentTuning, 2024) applies preference learning to full agent trajectories collected from real environments, teaching agents better tool-use and planning behaviors rather than just response style.

Cross-Disciplinary Insight

RLHF is deeply rooted in psychometrics and social choice theory. The Bradley-Terry model was originally developed by R.A. Bradley and M.E. Terry in 1952 for ranking sports teams from win-loss data. The Thurstone model (1927) preceded it with a similar structure for psychological scaling.

In economics, this mirrors revealed preference theory (Samuelson, 1938): you can’t directly observe utility, but you can infer it from choices. RLHF operationalizes this: instead of asking “what do you want?”, we observe “which of these two do you prefer?”, a much more reliable signal.

The instability of RL fine-tuning also echoes control theory: high-gain feedback loops amplify noise, and the KL penalty acts as a stabilizing damping term. Any system that maximizes a proxy of a true objective will eventually diverge. This is Goodhart’s Law, and it applies directly here.

Daily Challenge

Exercise: Build a Mini Preference Dataset

Pick any task where subjective quality matters (writing style, explanation clarity, code readability). Generate 10–20 pairs of responses from a small model (Qwen-1.5B, Phi-3.8B, etc.) and manually annotate your preferences. Then:

  1. Calculate the inter-annotator agreement if you have a friend annotate the same pairs. You’ll be surprised how often humans disagree.
  2. Train a simple reward model: a small classifier that takes the concatenated [prompt + response] as input and predicts your preference score.
  3. Thought experiment: If you used your reward model to generate more training data automatically (RLAIF), what biases might compound?

Bonus: Implement the DPO loss from scratch:

import torch
import torch.nn.functional as F

def dpo_loss(logits_chosen, logits_rejected, ref_logits_chosen, ref_logits_rejected, beta=0.1):
    """
    logits_*: log-probabilities of chosen/rejected under the policy
    ref_logits_*: log-probabilities under the frozen reference model
    """
    policy_ratio_chosen = logits_chosen - ref_logits_chosen
    policy_ratio_rejected = logits_rejected - ref_logits_rejected
    loss = -F.logsigmoid(beta * (policy_ratio_chosen - policy_ratio_rejected))
    return loss.mean()

Run it with a few synthetic examples and observe how the loss changes as you vary $\beta$.

References & Further Reading

Foundational Papers

Recent Advances

Libraries & Tools

Blog Posts


● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.