Skip to content

DPO — Paper-to-Code Mock Interview

Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov et al., 2023. arXiv: 2305.18290

Format: Read (~15 min) → explain the real benefit → implement the core idea in Colab → sanity-check it.

Companion notebook: dpo_mock.ipynb (download) — a preference-alignment toy task + a dpo_loss stub to fill in, plus verification cells. Open in Google Colab via File → Upload notebook. A reference solution is included at the bottom of this page.

Difficulty: 🟡🔴 Medium-hard. The loss is short, but the conceptual leap (why this replaces a whole RL pipeline) is what’s tested.


How to run this as a timed drill (~60 min)

Section titled “How to run this as a timed drill (~60 min)”
TimeBlockWhat you produce
0:00–0:15Read (use the three-pass method)The DPO loss + why it removes the reward model and the RL loop
0:15–0:20Explain the benefit out loud (cover Part 2)The “policy is secretly a reward model” reparameterization
0:20–0:50Implement from the stub (Part 3)A working dpo_loss + a policy whose implicit reward orders preferences
last 10 minSanity-check (Part 4)All 6 checks passing, narrated out loud

Self-grading rubric — “what good looks like”

Section titled “Self-grading rubric — “what good looks like””
  • ✅ Explained DPO as RLHF without a separate reward model and without RL/PPO — a single classification-style loss on preference pairs.
  • ✅ Knew the loss is on log-prob differences relative to a frozen reference, not on raw log-probs.
  • ✅ Could state the implicit reward r(y) = β·log(πθ(y|x)/πref(y|x)) and why ordering by it recovers the preferences.
  • ✅ Knew the reference policy is frozen and why (it anchors the KL constraint).
  • ⚠️ Red flags: describing a PPO loop, training a reward model, forgetting the reference term, claiming DPO needs online sampling (it’s offline on a fixed preference dataset).

Standard RLHF aligns a language model in three stages: supervised fine-tune, then train a separate reward model on human preference pairs, then optimize the policy against that reward with RL (PPO) plus a KL penalty to a reference model. That pipeline is fiddly — reward-model overfitting, unstable on-policy sampling, lots of hyperparameters. DPO collapses the last two stages into one supervised loss:

  • No reward model — the policy’s own log-probs (relative to a frozen reference) are the implicit reward.
  • No RL loop — no PPO, no online rollouts, no value network. It’s a plain, stable, offline classification-style objective over preferred-vs-dispreferred pairs.
  • Same optimum as the RLHF objective it replaces, but far simpler and cheaper to run.

The core idea (Method — you implement this)

Section titled “The core idea (Method — you implement this)”

RLHF maximizes expected reward minus a KL penalty to the reference policy. That constrained problem has a closed-form optimal policy: π(yx)πref(yx)exp ⁣(1βr(x,y))\pi^*(y\mid x) \propto \pi_{\text{ref}}(y\mid x)\,\exp\!\big(\tfrac{1}{\beta} r(x,y)\big). Invert it and the reward is expressible through the policy itself:

r(x,y)=βlogπθ(yx)πref(yx)+βlogZ(x)r(x, y) = \beta \, \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)

Plug that into the Bradley–Terry preference model P(ywyl)=σ(r(x,yw)r(x,yl))P(y_w \succ y_l) = \sigma\big(r(x,y_w) - r(x,y_l)\big). The partition term Z(x)Z(x) cancels (it’s the same for both responses), leaving a loss with no reward model and no RL — for a preferred response ywy_w and dispreferred yly_l:

LDPO=logσ ⁣(β[(logπθ(ywx)logπref(ywx))(logπθ(ylx)logπref(ylx))])\mathcal{L}_{\text{DPO}} = -\log \sigma\!\Big( \beta \big[ \big(\log \pi_\theta(y_w\mid x) - \log \pi_{\text{ref}}(y_w\mid x)\big) - \big(\log \pi_\theta(y_l\mid x) - \log \pi_{\text{ref}}(y_l\mid x)\big) \big] \Big)

Key details (the things an interviewer probes):

  • It’s a logistic-regression-style loss on the difference of log-ratios. You push the policy to raise the log-prob of ywy_w and lower that of yly_lrelative to the frozen reference.
  • β controls how far the policy may drift from the reference (the KL strength). Larger β = stay closer to ref.
  • The reference πref is frozen (usually the SFT model). It anchors the KL constraint; if you let it move, you lose the regularizer.
  • The implicit reward is r(x,y)=βlogπθ(yx)πref(yx)r(x,y) = \beta\,\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}. After training, ranking responses by this implicit reward should match the human preferences — “your language model is secretly a reward model.”
  • Offline & stable: trains on a fixed dataset of (x,yw,yl)(x, y_w, y_l) triples, no sampling from the current policy.

Where the evidence lives (tables/figures that matter)

Section titled “Where the evidence lives (tables/figures that matter)”

(Hedge on exact numbers — quote the shape of the result, not memorized digits.)

  • Sentiment / summarization / dialogue experiments: DPO matches or beats PPO-based RLHF on the reward-vs-KL frontier — same alignment, simpler method. This is the core benefit claim.
  • Reward-vs-KL frontier figure: DPO reaches higher reward at the same KL divergence from the reference → it’s not just simpler, it’s competitive on the actual tradeoff.
  • Stability/ablations: DPO is less sensitive to the sampling temperature and hyperparameters that make PPO finicky.
  • Needs a good reference & a preference dataset: it’s offline, so it can only exploit the pairs you give it; no exploration of new responses.
  • Can over-optimize / degenerate if β is too small (policy drifts far from ref) — you still need the KL anchor.
  • Distribution shift: because it never samples on-policy, the preference data must cover responses near where the policy ends up; off-distribution pairs help less than in online RLHF.
  • Length / reward-hacking biases in the preference data get baked in directly, just as they would into a reward model.

Part 2 — The interview dialogue (interviewer ⇄ interviewee)

Section titled “Part 2 — The interview dialogue (interviewer ⇄ interviewee)”

🧑‍💼 Interviewer: One paragraph — what does DPO actually buy me over RLHF?

🧑‍💻 Interviewee: It removes two of the three RLHF stages. Classic RLHF trains a separate reward model on preference pairs, then runs PPO to optimize the policy against it with a KL penalty. DPO shows that the optimal RLHF policy has a closed form, which lets you express the reward through the policy’s own log-ratio against a frozen reference. Substituting that into the Bradley–Terry preference likelihood gives a single supervised loss — basically logistic regression on preferred-vs-dispreferred pairs. No reward model, no RL loop, no online sampling. Same objective, much simpler and more stable.

🧑‍💼 Interviewer: Write the loss. What’s actually being compared?

🧑‍💻 Interviewee: L = -log σ(β·[(logπθ(y_w|x) − logπref(y_w|x)) − (logπθ(y_l|x) − logπref(y_l|x))]). For each response I take the log-ratio of policy to reference — that’s the implicit reward up to a constant. I take the chosen response’s log-ratio minus the rejected one’s, scale by β, and push it through a log-sigmoid. So I’m maximizing the margin by which the policy prefers y_w over y_l more than the reference does.

🧑‍💼 Interviewer: Where did the partition function / reward model go?

🧑‍💻 Interviewee: The closed-form optimal policy has a per-prompt normalizer Z(x). In the preference model you only ever take a difference of rewards for two responses to the same prompt, so Z(x) is identical in both terms and cancels. That’s the trick — it’s why you never have to compute or train the reward; the policy’s relative log-probs carry it.

🧑‍💼 Interviewer: Why is the reference policy frozen, and what does β do?

🧑‍💻 Interviewee: The reference is the KL anchor — the original RLHF objective penalizes drift from it, and that penalty becomes the −logπref terms. If I let the reference move, I lose the regularizer and the policy can collapse. β is the KL strength: large β keeps the policy close to the reference; small β lets it drift further and chase the preferences harder, at the risk of degenerating.

🧑‍💼 Interviewer: Implement it and show the policy’s implicit reward recovers a known preference ranking — with no reward model trained.


The whole method is a few lines: take per-sequence log-probs under the policy and the frozen reference for the chosen and rejected responses, form the difference of log-ratios, and pass it through -logsigmoid.

import torch
import torch.nn.functional as F
def dpo_loss(logp_pol_chosen, logp_pol_rejected,
logp_ref_chosen, logp_ref_rejected, beta=0.1):
"""DPO loss for a batch of (chosen, rejected) preference pairs.
Each argument is a per-sequence log-prob log pi(y|x): the chosen/rejected
response under the policy and under the FROZEN reference. No reward model.
"""
pol_logratio = logp_pol_chosen - logp_pol_rejected # policy's relative pref
ref_logratio = logp_ref_chosen - logp_ref_rejected # reference's relative pref
logits = beta * (pol_logratio - ref_logratio) # margin over the reference
return -F.logsigmoid(logits).mean()
def implicit_reward(logp_pol, logp_ref, beta=0.1):
"""The reward DPO implicitly optimizes: r(y) = beta * log(pi_theta / pi_ref)."""
return beta * (logp_pol - logp_ref)
  • logp_pol_chosen - logp_pol_rejected — the policy’s log-odds of preferring chosen over rejected. This is what we push up.
  • logp_ref_chosen - logp_ref_rejected — the same quantity under the frozen reference. We optimize the margin over the reference, not the raw policy preference — that’s the KL anchor showing up.
  • beta * (...) — scales the margin; this is the inverse-KL-strength β. At init (policy == reference) the bracket is 0, so logits == 0 and loss == -log σ(0) == log 2.
  • -F.logsigmoid(logits) — Bradley–Terry NLL. Numerically stable log-sigmoid (don’t write -log(sigmoid(...))).
  • implicit_reward — never used in the loss directly; it’s how we read off the learned reward afterward to check the ordering. No separate reward model is ever trained.

Demonstrating the benefit (preference-alignment toy task)

Section titled “Demonstrating the benefit (preference-alignment toy task)”

A real DPO run needs an LLM. To isolate the method, we use a tiny discrete policy: a categorical over a small set of “responses,” with learnable logits, conditioned on a couple of prompts. The reference is a frozen copy of the initial (uniform) policy. We feed it synthetic preference pairs from a known ranking and train only with dpo_loss.

class ToyPolicy(torch.nn.Module):
"""pi(y|x): a categorical over n_responses, per prompt. Logits ARE the policy."""
def __init__(self, n_prompts, n_responses):
super().__init__()
self.logits = torch.nn.Parameter(torch.zeros(n_prompts, n_responses)) # uniform
def logprobs(self, prompt_idx):
return F.log_softmax(self.logits[prompt_idx], dim=-1)
def run_toy_task(seed=0, beta=0.1, steps=400):
torch.manual_seed(seed)
n_prompts, n_responses = 2, 6
policy = ToyPolicy(n_prompts, n_responses)
# Frozen reference = a copy of the INITIAL policy.
ref = ToyPolicy(n_prompts, n_responses)
ref.load_state_dict(policy.state_dict())
for p in ref.parameters():
p.requires_grad_(False)
# Known rankings (best -> worst); build all chosen>rejected pairs from them.
rankings = {0: [0, 1, 2, 3, 4, 5], 1: [5, 4, 3, 2, 1, 0]}
pairs = [(pr, o[a], o[b]) for pr, o in rankings.items()
for a in range(len(o)) for b in range(a + 1, len(o))]
prompt_t = torch.tensor([p for p, _, _ in pairs])
chosen_t = torch.tensor([c for _, c, _ in pairs])
reject_t = torch.tensor([r for _, _, r in pairs])
before = policy.logprobs(torch.arange(n_prompts)).exp().detach()
opt = torch.optim.Adam(policy.parameters(), lr=0.05)
for _ in range(steps):
lp_pol = policy.logprobs(prompt_t)
with torch.no_grad():
lp_ref = ref.logprobs(prompt_t)
loss = dpo_loss(lp_pol.gather(1, chosen_t[:, None]).squeeze(1),
lp_pol.gather(1, reject_t[:, None]).squeeze(1),
lp_ref.gather(1, chosen_t[:, None]).squeeze(1),
lp_ref.gather(1, reject_t[:, None]).squeeze(1), beta=beta)
opt.zero_grad(); loss.backward(); opt.step()
with torch.no_grad():
after = policy.logprobs(torch.arange(n_prompts)).exp()
rewards = implicit_reward(policy.logprobs(torch.arange(n_prompts)),
ref.logprobs(torch.arange(n_prompts)), beta=beta)
return dict(before=before, after=after, rewards=rewards,
rankings=rankings, ref=ref, policy=policy)
out = run_toy_task(seed=0)
for pr, order in out["rankings"].items():
ranked = sorted(range(6), key=lambda y: out["rewards"][pr, y].item(), reverse=True)
print(f"prompt {pr}: top response prob {out['before'][pr, order[0]]:.3f} -> "
f"{out['after'][pr, order[0]]:.3f} reward ranking {ranked} "
f"(target {order}, match={ranked == order})")

You should see the most-preferred response’s probability climb well above its uniform 1/6 ≈ 0.167 start, and the implicit-reward ranking exactly match the target preference order — all without ever training a reward model or running an RL loop. (Exact probabilities are seed/step dependent; the ordering is the point.)


Check 1 — At initialization the DPO loss is log 2

Section titled “Check 1 — At initialization the DPO loss is log 2”

When πθ == πref the bracket is 0, so loss = -log σ(0) = log 2 ≈ 0.693.

import math
z = torch.zeros(8)
loss0 = dpo_loss(z, z, z, z, beta=0.1)
print("loss at init:", loss0.item(), "(expected log2 =", math.log(2), ")")
assert abs(loss0.item() - math.log(2)) < 1e-6

Check 2 — The reference policy is frozen

Section titled “Check 2 — The reference policy is frozen”
ref = out["ref"]
assert all(not p.requires_grad for p in ref.parameters()) # no grad
assert torch.equal(ref.logits, torch.zeros_like(ref.logits)) # never moved
print("OK: reference has no grad and is unchanged after training")
pol = ToyPolicy(2, 6)
with torch.no_grad(): pol.logits[0, 0] += 0.3 # nudge off uniform
lp, lpr = pol.logprobs(torch.tensor([0])), torch.zeros(1, 6).log_softmax(-1)
dpo_loss(lp[:, 0], lp[:, 1], lpr[:, 0], lpr[:, 1], beta=0.1).backward()
assert pol.logits.grad is not None and pol.logits.grad.abs().sum() > 0
print("OK: |grad| =", pol.logits.grad.abs().sum().item())

Check 4 — Bigger margin lowers the loss; swapping chosen/rejected raises it

Section titled “Check 4 — Bigger margin lowers the loss; swapping chosen/rejected raises it”
def L(pc, pr): return dpo_loss(torch.tensor([pc]), torch.tensor([pr]),
torch.tensor([0.0]), torch.tensor([0.0]), beta=1.0).item()
base, big, swap = L(0.0, 0.0), L(3.0, -3.0), L(-3.0, 3.0)
print(f"base={base:.4f} big_margin={big:.4f} swapped={swap:.4f}")
assert big < base < swap and big < 0.01
print("OK: favoring chosen -> loss ~0; favoring rejected -> loss grows")

Check 5 — After training, the implicit-reward ranking matches the preferences

Section titled “Check 5 — After training, the implicit-reward ranking matches the preferences”
for pr, order in out["rankings"].items():
ranked = sorted(range(6), key=lambda y: out["rewards"][pr, y].item(), reverse=True)
assert ranked == order
assert out["after"][pr, order[0]] > out["before"][pr, order[0]] # mass moved up
print("OK: implicit reward orders responses exactly like the preferences")

Check 6 — The implicit reward scales linearly with β

Section titled “Check 6 — The implicit reward scales linearly with β”

r(y) = β·log(πθ/πref), so doubling β doubles the reward.

lp, lr = torch.tensor([math.log(0.5)]), torch.tensor([math.log(0.25)])
r1, r2 = implicit_reward(lp, lr, beta=0.1), implicit_reward(lp, lr, beta=0.2)
assert torch.allclose(r2, 2 * r1, atol=1e-6)
print("OK: beta 0.1 ->", r1.item(), " beta 0.2 ->", r2.item())

  • “How does DPO relate to PPO-based RLHF?” — Same underlying KL-regularized reward objective; DPO uses the closed-form optimal policy to turn it into a supervised loss, skipping both the reward model and the on-policy RL optimization.
  • “Why does the partition function Z(x) cancel?” — Preferences compare two responses to the same prompt, so the per-prompt normalizer is identical in both reward terms and drops out of the difference.
  • “What if you don’t have a reference / use a bad one?” — The reference is the KL anchor; a weak SFT reference means the policy can drift into degenerate text. People sometimes set the reference to the SFT model and keep it fixed.
  • “What breaks if β is too small or too large?” — Too small: policy over-optimizes the preferences and drifts far from the reference (degeneration, reward hacking). Too large: it barely moves from the reference and under-fits the preferences.
  • “Online vs offline?” — Vanilla DPO is offline on a fixed preference dataset. Variants (e.g., iterative/online DPO) sample fresh responses and re-label to combat distribution shift.
  • “Other variants?” — IPO (fixes an over-optimization issue), KTO (works from unpaired good/bad labels), and length-regularized DPO all build on this loss.

ThingAnswer
Core ideaRLHF without a reward model or RL — one supervised loss on preference pairs
Loss-log σ(β·[(logπθ(y_w)−logπref(y_w)) − (logπθ(y_l)−logπref(y_l))])
Implicit reward`r(y) = β·log(πθ(y
Why no reward modelThe policy’s log-ratio is the reward; Z(x) cancels in the difference
Reference policyFrozen (the KL anchor, usually the SFT model)
βKL strength: large = stay near ref, small = drift / over-optimize
BenefitSimpler, cheaper, more stable than PPO RLHF; offline
LimitationOffline (no exploration); sensitive to ref/β; bakes in data biases