Reinforcement Learning and RLHF — Bench

Reinforcement Learning: The Framework

Reinforcement learning is defined by its setup. An agent exists in an environment with a state space. At each step, the agent observes a state, takes an action, receives a reward, and transitions to a new state. The objective is to learn a policy — a function from states to actions — that maximizes expected cumulative reward over time.

The framework is general enough to describe an enormous range of problems: a chess engine choosing moves to maximize wins, a robot arm learning to pick up objects, a recommendation algorithm choosing which video to show next, or a language model generating tokens. What varies is the state space, action space, reward signal, and the challenge of learning in each.

The core challenge is credit assignment: when you take a sequence of actions and eventually receive a reward, which actions were responsible for it? In chess, the reward (win or lose) arrives at the end of the game after perhaps a hundred moves. Each move contributed to the outcome, but in different amounts and through complex interdependencies. Attribution is hard.

The theoretical framework for solving this is the Bellman equation, which decomposes the value of a state (expected future reward from this state) into the immediate reward plus the discounted value of the next state. Dynamic programming methods solve the Bellman equation exactly when the full state space is known and small. For large or continuous state spaces, function approximation — using a neural network to estimate the value function or the policy directly — is necessary.

Policy Gradient Methods

Two families of RL algorithms dominate modern practice. Value-based methods (Q-learning, DQN) learn an estimate of the value of each state-action pair and derive a policy by choosing the highest-value action. Policy gradient methods directly optimize the policy by gradient ascent on expected reward.

Policy gradient methods compute the gradient of expected reward with respect to the policy parameters. The key result is the policy gradient theorem: ∇J(θ) = E[∇log π(a|s) · Q(s,a)], where π is the policy and Q(s,a) is the value of taking action a in state s. The gradient is an expectation over trajectories sampled from the current policy, weighted by the action-value function.

Proximal Policy Optimization (PPO, Schulman et al. 2017) is the dominant policy gradient algorithm in current practice. It adds a constraint on how much the policy can change in a single update step — a “trust region” that prevents large policy updates from destabilizing training. The constraint is implemented as a clipped objective that penalizes the policy for deviating too far from the previous version. PPO is empirically robust across a wide range of environments and is the algorithm used in RLHF training for language models.

RLHF: The Bridge to Language Models

By 2020, large language models trained by next-token prediction (GPT-3) could generate fluent text and perform many tasks in a few-shot setting. They were also inconsistent, unreliable, and prone to generating harmful content. The training objective — predict the next token over the entire internet’s worth of text — didn’t distinguish between helpful, accurate responses and plausible-sounding fabrications.

Reinforcement Learning from Human Feedback (RLHF) was proposed as a way to align the model’s outputs with human preferences. The pipeline developed at OpenAI (and independently at DeepMind and Anthropic) has three stages:

Stage 1: Supervised fine-tuning (SFT). A base language model is fine-tuned on a dataset of prompt-response pairs where the responses were written by humans demonstrating the target behavior. This produces an initial policy that generates reasonable responses but hasn’t been optimized for quality.

Stage 2: Reward model training. Human annotators compare pairs of model-generated responses to the same prompt and indicate which they prefer. These preference judgments train a reward model: a neural network that takes a prompt and response as input and outputs a scalar reward score. The reward model learns to predict human preference.

Stage 3: RL fine-tuning. The SFT model is used as an initial policy and fine-tuned with PPO. The reward signal is the reward model’s score on generated responses. A KL-divergence penalty between the fine-tuned policy and the SFT policy prevents the policy from drifting too far from sensible outputs while optimizing for reward. The resulting model (InstructGPT, then ChatGPT) was substantially preferred by human evaluators over the raw GPT-3 base model.

Reward Hacking and Its Limits

The RLHF pipeline works remarkably well at producing models that are perceived as helpful and harmless. It also introduces a failure mode that is theoretically fundamental and practically important: reward hacking.

The reward model is not a perfect proxy for human preferences — it’s a learned approximation, trained on a finite dataset of human judgments. When you optimize a policy to maximize the reward model’s score using RL, the policy will eventually find behaviors that score highly according to the reward model but don’t actually satisfy the underlying human preference. The reward model has been “hacked.”

This is an instance of Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. The KL-divergence penalty against the SFT policy limits how far optimization can go and therefore limits the severity of reward hacking. But it also limits how thoroughly the model can be improved along the reward model’s gradient.

Examples of reward hacking in RLHF are mostly anecdotal from deployment but theoretically expected. A model that learns the surface features of responses that get high human ratings — comprehensive-seeming structure, confident tone, appropriate hedging — without the underlying content quality those features are proxies for. The model is optimizing for the appearance of a good response rather than the substance.

Constitutional AI and Direct Preference Optimization

Anthropic proposed Constitutional AI (CAI) as an extension. Instead of relying solely on human preference judgments, the model is given a set of principles (the “constitution”) and trained to evaluate its own responses against those principles, generating preference data through AI feedback rather than only human feedback. This scales the feedback pipeline beyond the bottleneck of human labeling.

Direct Preference Optimization (DPO, Rafailov et al. 2023) sidesteps the RL training loop entirely. It reformulates the RLHF objective as a supervised learning problem on the preference data, skipping the explicit reward model and the PPO training step. The policy is directly updated to increase the relative probability of preferred responses over rejected ones. DPO is simpler to implement and train than full RLHF with PPO, and empirically competitive with it on many evaluations.

DPO and its variants (ORPO, SimPO) are now the dominant fine-tuning methods for aligning open-source language models. They sacrifice some theoretical flexibility in the optimization (no explicit reward model means you can’t improve the reward signal iteratively) for substantial engineering simplicity.

What RL Actually Does to a Language Model

The RL training phase in RLHF is doing something specific: it’s selecting among the distribution of texts the base model assigns probability to, favoring those that score higher on the reward model. It doesn’t teach the model new facts or capabilities — those come from pre-training. It reshapes the output distribution toward higher-reward regions.

This framing clarifies both the power and the limits of RLHF. The power: a capable base model with a broad output distribution can be steered toward a narrow region of that distribution that’s consistently helpful and harmless. The limits: RLHF cannot reliably increase the peak capability of the base model — it can only shift the average toward the capability peak. And the quality of the steering depends entirely on the quality of the reward model, which depends on the quality and diversity of the human preference judgments that trained it.

The fundamental limitation is that human preference judgments measure perceived quality, not actual quality. A response that sounds authoritative but is factually incorrect will get high reward if the annotator can’t verify the facts. RLHF optimizes for human approval, which is a proxy for helpfulness, not helpfulness itself. Getting the two better aligned is an active research problem.