Paper Mocks: Overview & Study Plan

A library of paper-to-code mock interviews for the increasingly common ML/research interview format: you’re handed a paper, asked to explain its actual benefit, then implement the core idea in Google Colab.

Each page is a self-contained, timed drill with the same shape:

How to read the paper in 15 minutes — the three-pass method (map → method+evidence → pressure-test).
Explain the real benefit — a structured read + an interviewer⇄interviewee dialogue with model answers.
Implement it — a clean PyTorch reference, a “why each line matters” walkthrough, and a companion Colab notebook (a fill-in-the-blank stub + reference solution).
Sanity-check it — 6 checks that prove the code is correct, plus a toy task that demonstrates the claim. Every toy task and check in this library was executed before being written down.

How to use a page: set a timer, read on the real PDF, talk your answer out loud, fill in the notebook stub without peeking, then run the sanity checks. The honest goal isn’t a perfect implementation — it’s reading critically, narrating your reasoning, and verifying as you go.

The 20 papers, by difficulty

| 🟢 = warm-up · 🟢🟡 = easy–medium · 🟡 = medium · 🟡🔴 = medium–hard · 🔴 = hard |

#	Paper	Difficulty	The one-line benefit / what you implement
1	Dropout	🟢	Randomly zero units to break co-adaptation; train/eval + inverted-dropout scaling
2	RMSNorm	🟢	LayerNorm minus the mean-subtraction; scale-invariance
3	ResNet	🟢🟡	Identity skip connections make very deep nets trainable
4	SwiGLU	🟢🟡	Gated FFN; the 2/3 param-matching + an honest “no toy win”
5	AdamW	🟡	Decouple weight decay from the adaptive step (L2 ≠ weight decay)
6	BatchNorm	🟡	Normalize per batch; train/eval running stats
7	LoRA	🟡	Freeze the weights, train a tiny low-rank update
8	Knowledge Distillation	🟡	A student learns the teacher’s soft “dark knowledge”
9	word2vec	🟡	Word embeddings via skip-gram + negative sampling (a cheap softmax)
10	Focal Loss	🟡	Down-weight easy examples to fight class imbalance
11	RoPE	🟡	Rotary positions — relative position for free
12	GQA / MQA	🟡	Share KV heads to shrink the inference KV cache
13	ViT	🟡	An image as a sequence of patches through a plain Transformer
14	SimCLR	🟡🔴	Contrastive representations with no labels (NT-Xent)
15	VAE	🟡🔴	Reparameterization trick + ELBO → a sample-able latent space
16	DPO	🟡🔴	Preference alignment with no reward model and no RL
17	MoE	🟡🔴	Route tokens to top-k experts: capacity ≫ FLOPs/token
18	LSTM	🟡🔴	A gated cell state beats vanishing gradients (long-range memory)
19	Attention	🔴	Scaled dot-product + multi-head self-attention
20	DDPM	🔴	Denoising diffusion: learn to reverse a noising process

⚠️ One dependency to know: Attention sorts near the end because it’s the hardest to implement fully — but RoPE, GQA, and ViT all build on it. If you’re heading for those three, do Attention first regardless of its difficulty rating.

Study plans

The recommended ladder (1 paper/day)

A prerequisite-aware path — close to the difficulty order, but with Attention pulled earlier so the things that depend on it come after.

Week 1 — fundamentals (training & building blocks)

Dropout → 2. RMSNorm → 3. ResNet → 4. SwiGLU → 5. AdamW → 6. BatchNorm → 7. LoRA

Week 2 — sequences & attention 8. Attention (the keystone) → 9. RoPE → 10. GQA/MQA → 11. ViT → 12. word2vec → 13. LSTM

Week 3 — objectives, alignment & generative 14. Focal Loss → 15. Knowledge Distillation → 16. SimCLR → 17. VAE → 18. DPO → 19. MoE → 20. DDPM

At one a day that’s ~3 weeks. Two rest/review days fit naturally at the week boundaries.

Short on time?

Have ~1 week: do the bold keystones — Attention, LoRA, RMSNorm, AdamW, ResNet, BatchNorm, Dropout. These are the highest-probability “implement this” papers and cover the most reusable mechanics.
The single best rep: do one full timed mock end-to-end (read → explain → implement → sanity-check) rather than skimming five. The combined drill is what actually predicts interview performance.
Day before: re-read the cheat sheet table at the bottom of each page you’ve done; re-run one notebook from scratch to confirm muscle memory.

Themed tracks (pick by the role)

Modern LLM: Attention → RoPE → GQA/MQA → RMSNorm → SwiGLU → MoE → LoRA → DPO
Generative modeling: VAE → DDPM (+ how they relate to GANs / score matching)
Computer vision: ResNet → BatchNorm → ViT → SimCLR
Training & optimization: Dropout → BatchNorm → AdamW → ResNet → Focal Loss → Knowledge Distillation
NLP foundations: word2vec → LSTM → Attention → RoPE

What “good” looks like (the cross-cutting rubric)

Every page has its own rubric, but these habits transfer to all of them:

✅ Anchor a benefit claim to a specific table/figure (“the ablation shows…”), not the abstract.
✅ State the tradeoff / limitation unprompted, not only the upside.
✅ Implement without copy-pasting — and narrate decisions while coding.
✅ Write ≥2 sanity checks before being asked (shapes + “only X trains” are free wins).
✅ Be honest when a toy can’t show the headline benefit (e.g. SwiGLU, AdamW, DDPM at scale) — show the mechanism instead of overselling a number.