RoPE — Paper-to-Code Mock Interview
Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al., 2021. arXiv: 2104.09864
Format: Read (~15 min) → explain the real benefit → implement the core idea in Colab → sanity-check it.
Companion notebook:
rope_mock.ipynb(download) — a relative-position-invariance demo + anapply_ropestub to fill in, plus verification cells. Open in Google Colab via File → Upload notebook. A reference solution is included at the bottom of this page.Difficulty: 🟡 Medium. You need to be comfortable with attention (see the attention mock) and a little trig.
How to run this as a timed drill (~55 min)
Section titled “How to run this as a timed drill (~55 min)”| Time | Block | What you produce |
|---|---|---|
| 0:00–0:15 | Read (use the three-pass method) | Why rotate q/k + how relative position falls out |
| 0:15–0:20 | Explain the benefit out loud (cover Part 2) | The “relative for free, no params” pitch + extrapolation |
| 0:20–0:50 | Implement from the stub (Part 3) | A working apply_rope + a score that depends only on m−n |
| last 5 min | Sanity-check (Part 4) | All checks passing, narrated out loud |
Self-grading rubric — “what good looks like”
Section titled “Self-grading rubric — “what good looks like””- ✅ Explained RoPE as rotating q/k by an angle ∝ absolute position, so the dot product depends only on the relative offset — not “it adds position vectors.”
- ✅ Knew why the dot product becomes relative: composing rotations subtracts angles (
R_m^T R_n = R_{n−m}). - ✅ Implemented it as a per-pair 2-D rotation, with frequencies
θ_i = base^(−2i/d), applied to q and k only (not V). - ✅ Demonstrated the benefit with the shift-invariance property (max diff ≈ 0), not just “it runs.”
- ⚠️ Red flags: applying RoPE to V, treating it as a learned/added embedding, forgetting it has zero parameters, claiming it changes vector norms.
Part 1 — Structured read of THIS paper
Section titled “Part 1 — Structured read of THIS paper”The 30-second summary (the “benefit”)
Section titled “The 30-second summary (the “benefit”)”Transformers have no built-in notion of order, so you must inject position. Learned absolute embeddings add a position vector to each token; they cost parameters and extrapolate poorly past the trained context length. RoPE instead rotates each query and key vector by an angle proportional to its absolute position. Because of how rotations compose, the attention score q_m · k_n ends up depending only on the relative offset (m − n) and the content. The payoff:
- Relative position “for free” — you encode absolute position per token, but the dot product sees only the relative offset.
- Zero extra parameters — it’s a fixed, deterministic rotation, not a learned table.
- Better length extrapolation and a clean way to integrate with standard scaled-dot-product attention (it’s now the default in LLaMA, GPT-NeoX, PaLM, etc.).
The core idea (Method — you implement this)
Section titled “The core idea (Method — you implement this)”Split the d-dim vector into d/2 consecutive pairs (x_{2i}, x_{2i+1}). For a token at position m, rotate pair i by angle m·θ_i, where the per-pair frequency is
Each pair is rotated by the standard 2-D rotation matrix:
Write the whole rotation as a block-diagonal orthogonal matrix R_m. Apply it to query and key, then take the dot product. The magic is that rotations compose by adding angles, so
which depends on positions only through the offset n − m. That is relative position, obtained by encoding absolute position on each side.
Key details (the things an interviewer probes):
- Applied to q and k only, not to V — RoPE shapes the score, not the value mix.
- No parameters.
R_mis fixed; the only knob isbase(10000), which sets the wavelength spectrum. - Different frequencies per pair. Low
irotates fast (short wavelength, local), highirotates slowly (long wavelength, global) — like the sinusoidal-embedding spectrum, but multiplicative. - Norm-preserving. Rotation is orthogonal, so
‖R_m x‖ = ‖x‖; RoPE never changes a vector’s magnitude. - Position 0 is the identity (angle 0 ⇒ no rotation).
Where the evidence lives (tables that matter)
Section titled “Where the evidence lives (tables that matter)”(Hedge: figure/table numbers below are from memory of the RoFormer paper — verify against the PDF.)
- Machine-translation / GLUE-style language-modeling tables: RoPE matches or beats sinusoidal and learned absolute embeddings → the quality claim.
- Faster/lower training-loss curves vs the BERT-style baseline → the convergence claim.
- Long-sequence experiments: stable behavior as context grows → the extrapolation claim that made RoPE ubiquitous in modern LLMs.
The honest limitations (have an opinion)
Section titled “The honest limitations (have an opinion)”- Vanilla RoPE still degrades far beyond the trained context length. The “for free” extrapolation is better, not unlimited — hence the whole follow-up family (NTK-aware scaling, Position Interpolation, YaRN) that rescales
base/frequencies. - Relative, not arbitrary. It encodes a smooth function of
(m − n); it can’t represent arbitrary learned position-pair interactions the way a full relative-attention bias table could. - Pairing/interleaving convention matters. The
(2i, 2i+1)interleaved layout vs the “rotate-half” (split-in-two) layout differ; mixing conventions between training and inference silently breaks a model.
Part 2 — The interview dialogue (interviewer ⇄ interviewee)
Section titled “Part 2 — The interview dialogue (interviewer ⇄ interviewee)”🧑💼 Interviewer: One paragraph — what does RoPE actually buy me over absolute position embeddings?
🧑💻 Interviewee: It gives you relative position essentially for free and with zero parameters. Instead of adding a learned position vector, I rotate each query and key by an angle proportional to its absolute position. Because rotations compose by adding angles, when I dot a rotated query at position
mwith a rotated key at positionn, the position dependence collapses to a function of the offsetm − n. So I encode absolute position on each side but the attention score only sees relative position — and it extrapolates to longer contexts better than a learned absolute table, which is why modern LLMs use it.
🧑💼 Interviewer: Walk me through why the dot product becomes relative.
🧑💻 Interviewee: RoPE multiplies the query by an orthogonal rotation
R_mand the key byR_n. The score is(R_m q)ᵀ(R_n k) = qᵀ R_mᵀ R_n k. For rotations,R_mᵀ = R_{−m}, and they compose additively, soR_mᵀ R_n = R_{n−m}. The whole thing isqᵀ R_{n−m} k— positions enter only throughn − m. Per 2-D pair it’s just the angle-subtraction identity forcos/sin.
🧑💼 Interviewer: Why apply it to q and k but not V?
🧑💻 Interviewee: RoPE’s job is to make the attention score position-aware. The score is the only place where q and k meet, and that’s where the rotation cancels into a relative offset. V carries the content you actually aggregate; rotating it would inject position into the output values for no benefit and would break the clean relative property. So RoPE touches q and k, attention proceeds normally, V is untouched.
🧑💼 Interviewer: It has no parameters — so what’s the one knob, and what does it do?
🧑💻 Interviewee: The
base(default 10000). It sets the geometric spread of per-pair frequenciesθ_i = base^(−2i/d): low dimensions rotate fast (capture local, fine-grained offsets), high dimensions rotate slowly (capture long-range structure). Increasingbaselengthens wavelengths, which is exactly the lever the extrapolation methods like NTK-aware scaling and YaRN tune to stretch a model to longer contexts without retraining from scratch.
🧑💼 Interviewer: Implement it and show the score depends only on the relative offset.
Part 3 — Implementation
Section titled “Part 3 — Implementation”The whole method is a per-pair 2-D rotation applied to q and k. No parameters, no learned state.
import torch
def apply_rope(x, positions, base=10000.0): """Rotate consecutive dim pairs (2i, 2i+1) by angle = pos * base^(-2i/dim).
x: (..., seq, dim) with EVEN dim. positions: (seq,) absolute positions for each token. returns: same shape as x, rotated. """ *_, seq, dim = x.shape assert dim % 2 == 0, "dim must be even: dims are rotated in pairs" half = dim // 2
i = torch.arange(half, device=x.device, dtype=x.dtype) theta = base ** (-2.0 * i / dim) # (half,) per-pair frequencies angles = positions.to(x.dtype)[:, None] * theta[None, :] # (seq, half) angle per (pos, pair) cos, sin = torch.cos(angles), torch.sin(angles)
x_even, x_odd = x[..., 0::2], x[..., 1::2] # the two halves of each pair rot_even = x_even * cos - x_odd * sin # standard 2-D rotation rot_odd = x_even * sin + x_odd * cos out = torch.empty_like(x) out[..., 0::2], out[..., 1::2] = rot_even, rot_odd # re-interleave return outWhy each line matters (talk through it)
Section titled “Why each line matters (talk through it)”assert dim % 2 == 0— RoPE rotates pairs of dimensions; an odddimhas a leftover scalar with no partner to rotate against.theta = base ** (-2.0 * i / dim)— the frequency spectrum: pair 0 is the fastest, the last pair the slowest. This is the only design knob.positions[:, None] * theta[None, :]— broadcasts to one angle per (position, pair): the angle is proportional to absolute position, the heart of RoPE.x_even * cos - x_odd * sin/x_even * sin + x_odd * cos— the literal 2-D rotation matrix applied to each pair. Orthogonal, so it preserves norms.0::2/1::2— the interleaved(2i, 2i+1)convention. (LLaMA’s reference uses a “rotate-half” split layout; same idea, different bookkeeping — pick one and be consistent.)
Demonstrating the property (relative-position invariance)
Section titled “Demonstrating the property (relative-position invariance)”This is the headline correctness demo — not a benchmark. Take fixed content q and k. The attention score between q at position m and k at position n must be unchanged when you shift both positions by any s, because it depends only on m − n.
torch.manual_seed(0)dim = 8q = torch.randn(dim) # fixed CONTENT for the queryk = torch.randn(dim) # fixed CONTENT for the keym, n = 5, 2 # absolute positions; relative offset m - n = 3
def score(content_q, content_k, pm, pn): qr = apply_rope(content_q[None, :], torch.tensor([pm]))[0] kr = apply_rope(content_k[None, :], torch.tensor([pn]))[0] return torch.dot(qr, kr)
base_score = score(q, k, m, n)print(f"score(q@{m}, k@{n}) = {base_score.item():.6f} (offset {m-n})")diffs = []for s in (1, 3, 7, 50, 123): sc = score(q, k, m + s, n + s) diffs.append((sc - base_score).abs().item()) print(f"score(q@{m+s}, k@{n+s}) = {sc.item():.6f} shift s={s}")print(f"max abs difference across shifts = {max(diffs):.2e} (~0 => depends only on m-n)")Expected output (numbers are seed-dependent; the invariance is the point):
score(q@5, k@2) = 1.178293 (offset 3)score(q@6, k@3) = 1.178293 shift s=1score(q@8, k@5) = 1.178293 shift s=3score(q@12, k@9) = 1.178293 shift s=7score(q@55, k@52) = 1.178293 shift s=50score(q@128, k@125) = 1.178293 shift s=123max abs difference across shifts = 2.38e-07 (~0 => depends only on m-n)The score is identical for every equal shift — absolute positions changed by up to 123, but because the offset stayed 3, the attention score never moved. That is relative position, encoded for free.
Part 4 — Sanity checks (don’t skip)
Section titled “Part 4 — Sanity checks (don’t skip)”Check 1 — RoPE preserves the L2 norm (rotation is orthogonal)
Section titled “Check 1 — RoPE preserves the L2 norm (rotation is orthogonal)”dim = 8x = torch.randn(4, dim)xr = apply_rope(x, torch.arange(4))assert torch.allclose(x.norm(dim=-1), xr.norm(dim=-1), atol=1e-5)print("OK: per-vector norm unchanged")Check 2 — Position 0 is the identity
Section titled “Check 2 — Position 0 is the identity”x0 = torch.randn(1, dim)assert torch.allclose(apply_rope(x0, torch.tensor([0])), x0, atol=1e-6)print("OK: position 0 == identity")Check 3 — Relative-offset invariance of the score (the core property)
Section titled “Check 3 — Relative-offset invariance of the score (the core property)”m, n, s = 5, 2, 17assert torch.allclose(score(q, k, m, n), score(q, k, m + s, n + s), atol=1e-4)print("OK: q.k score invariant under equal shift (depends only on m-n)")Check 4 — Shape is preserved
Section titled “Check 4 — Shape is preserved”big = torch.randn(2, 6, 10) # (batch, seq, dim)assert apply_rope(big, torch.arange(6)).shape == big.shapeprint("OK: output shape == input shape")Check 5 — Different relative offsets give DIFFERENT scores (it actually encodes position)
Section titled “Check 5 — Different relative offsets give DIFFERENT scores (it actually encodes position)”s_off3 = score(q, k, 5, 2) # offset 3s_off5 = score(q, k, 5, 0) # offset 5assert not torch.allclose(s_off3, s_off5, atol=1e-3)print(f"OK: offsets differ => scores differ ({s_off3.item():.4f} vs {s_off5.item():.4f})")Check 6 — Composition: rotate(a) then rotate(b) == rotate(a+b)
Section titled “Check 6 — Composition: rotate(a) then rotate(b) == rotate(a+b)”a, b = 3.0, 4.0twostep = apply_rope(apply_rope(x0, torch.tensor([a])), torch.tensor([b]))onestep = apply_rope(x0, torch.tensor([a + b]))assert torch.allclose(twostep, onestep, atol=1e-5)print("OK: rotations compose additively")All six should print OK. Check 3 is the one that matters most — it’s the property the whole paper is built on; checks 1, 2, 6 confirm it’s a genuine rotation; checks 4, 5 confirm it’s non-trivial.
Part 5 — Likely follow-up questions
Section titled “Part 5 — Likely follow-up questions”- “Interleaved
(2i,2i+1)vs LLaMA’s rotate-half layout?” — Same rotation, different dimension pairing. Interleaved rotates adjacent dims; rotate-half pairs dimiwith dimi + d/2. Both are valid as long as train and inference agree; weights are not portable across conventions. - “How does RoPE extrapolate to longer contexts, and where does it break?” — Better than learned absolute embeddings because it’s a smooth function of relative offset, but vanilla RoPE still degrades well past the trained length. Fixes rescale frequencies: Position Interpolation squashes positions into the trained range, NTK-aware / YaRN adjust
baseper-frequency to stretch the context with little or no retraining. - “Why not just add sinusoidal embeddings (Vaswani et al.)?” — Sinusoidal/absolute embeddings are added to inputs and bias toward absolute position; RoPE is multiplicative and makes the score depend on relative offset, which generalizes across positions better and composes cleanly with attention.
- “Does RoPE cost FLOPs or memory?” — Negligible: two elementwise mul-adds per element on q and k, no parameters, no extra activations to store. The
cos/sintables can be precomputed and cached per position. - “Why apply it inside each head rather than once on the embedding?” — Position must enter at the q·k interaction per head, after the q/k projections, so each head sees rotated queries/keys. Rotating the shared embedding once wouldn’t survive the per-head linear projections.
TL;DR cheat sheet
Section titled “TL;DR cheat sheet”| Thing | Answer |
|---|---|
| Core idea | Rotate q/k by an angle ∝ absolute position; dot product then depends only on offset m−n |
| Formula | rotate pair i by m·θ_i, θ_i = base^(−2i/d), base=10000 |
| Why relative | R_mᵀ R_n = R_{n−m} — composing rotations subtracts angles |
| Applied to | q and k only, not V |
| Parameters | None (fixed rotation); only knob is base |
| Benefit | Relative position for free + better length extrapolation than learned absolute |
| Norm | Preserved (rotation is orthogonal) |
| #1 gotcha | Mixing interleaved vs rotate-half conventions across train/inference |
| Limitation | Still degrades far beyond trained length → PI / NTK / YaRN rescaling |