Tag: Alignment

20 Mar 2026
Corrigibility and Interruptibility Building Agents That Accept Human Override
How to design AI agents that accept correction, allow safe interruption, and remain under human control even as their capabilities grow
01 Mar 2026
RLHF and Preference Learning Teaching Agents What Humans Actually Want
Master reinforcement learning from human feedback — the algorithm behind ChatGPT and modern aligned agents — from reward modeling and PPO to Direct Preference Optimization.
17 Feb 2026
Inverse Reinforcement Learning Inferring Goals From Behavior
Learn how inverse reinforcement learning lets AI agents discover hidden reward functions by observing expert behavior, and why it matters for agent alignment and autonomous systems
05 Oct 2025
Ensuring Agent Safety with Constitutional AI Guardrails
## Concept Introduction **Constitutional AI (CAI)** is a method for training an AI to supervise itself. Rather than re...

Engineering Notes