Paper Mocks: Overview & Study Plan
A library of paper-to-code mock interviews for the increasingly common ML/research interview format: you’re handed a paper, asked to explain its actual benefit, then implement the core idea in Google Colab.
Each page is a self-contained, timed drill with the same shape:
- How to read the paper in 15 minutes — the three-pass method (map → method+evidence → pressure-test).
- Explain the real benefit — a structured read + an interviewer⇄interviewee dialogue with model answers.
- Implement it — a clean PyTorch reference, a “why each line matters” walkthrough, and a companion Colab notebook (a fill-in-the-blank stub + reference solution).
- Sanity-check it — 6 checks that prove the code is correct, plus a toy task that demonstrates the claim. Every toy task and check in this library was executed before being written down.
How to use a page: set a timer, read on the real PDF, talk your answer out loud, fill in the notebook stub without peeking, then run the sanity checks. The honest goal isn’t a perfect implementation — it’s reading critically, narrating your reasoning, and verifying as you go.
The 20 papers, by difficulty
Section titled “The 20 papers, by difficulty”| 🟢 = warm-up · 🟢🟡 = easy–medium · 🟡 = medium · 🟡🔴 = medium–hard · 🔴 = hard |
| # | Paper | Difficulty | The one-line benefit / what you implement |
|---|---|---|---|
| 1 | Dropout | 🟢 | Randomly zero units to break co-adaptation; train/eval + inverted-dropout scaling |
| 2 | RMSNorm | 🟢 | LayerNorm minus the mean-subtraction; scale-invariance |
| 3 | ResNet | 🟢🟡 | Identity skip connections make very deep nets trainable |
| 4 | SwiGLU | 🟢🟡 | Gated FFN; the 2/3 param-matching + an honest “no toy win” |
| 5 | AdamW | 🟡 | Decouple weight decay from the adaptive step (L2 ≠ weight decay) |
| 6 | BatchNorm | 🟡 | Normalize per batch; train/eval running stats |
| 7 | LoRA | 🟡 | Freeze the weights, train a tiny low-rank update |
| 8 | Knowledge Distillation | 🟡 | A student learns the teacher’s soft “dark knowledge” |
| 9 | word2vec | 🟡 | Word embeddings via skip-gram + negative sampling (a cheap softmax) |
| 10 | Focal Loss | 🟡 | Down-weight easy examples to fight class imbalance |
| 11 | RoPE | 🟡 | Rotary positions — relative position for free |
| 12 | GQA / MQA | 🟡 | Share KV heads to shrink the inference KV cache |
| 13 | ViT | 🟡 | An image as a sequence of patches through a plain Transformer |
| 14 | SimCLR | 🟡🔴 | Contrastive representations with no labels (NT-Xent) |
| 15 | VAE | 🟡🔴 | Reparameterization trick + ELBO → a sample-able latent space |
| 16 | DPO | 🟡🔴 | Preference alignment with no reward model and no RL |
| 17 | MoE | 🟡🔴 | Route tokens to top-k experts: capacity ≫ FLOPs/token |
| 18 | LSTM | 🟡🔴 | A gated cell state beats vanishing gradients (long-range memory) |
| 19 | Attention | 🔴 | Scaled dot-product + multi-head self-attention |
| 20 | DDPM | 🔴 | Denoising diffusion: learn to reverse a noising process |
⚠️ One dependency to know: Attention sorts near the end because it’s the hardest to implement fully — but RoPE, GQA, and ViT all build on it. If you’re heading for those three, do Attention first regardless of its difficulty rating.
Study plans
Section titled “Study plans”The recommended ladder (1 paper/day)
Section titled “The recommended ladder (1 paper/day)”A prerequisite-aware path — close to the difficulty order, but with Attention pulled earlier so the things that depend on it come after.
Week 1 — fundamentals (training & building blocks)
- Dropout → 2. RMSNorm → 3. ResNet → 4. SwiGLU → 5. AdamW → 6. BatchNorm → 7. LoRA
Week 2 — sequences & attention 8. Attention (the keystone) → 9. RoPE → 10. GQA/MQA → 11. ViT → 12. word2vec → 13. LSTM
Week 3 — objectives, alignment & generative 14. Focal Loss → 15. Knowledge Distillation → 16. SimCLR → 17. VAE → 18. DPO → 19. MoE → 20. DDPM
At one a day that’s ~3 weeks. Two rest/review days fit naturally at the week boundaries.
Short on time?
Section titled “Short on time?”- Have ~1 week: do the bold keystones — Attention, LoRA, RMSNorm, AdamW, ResNet, BatchNorm, Dropout. These are the highest-probability “implement this” papers and cover the most reusable mechanics.
- The single best rep: do one full timed mock end-to-end (read → explain → implement → sanity-check) rather than skimming five. The combined drill is what actually predicts interview performance.
- Day before: re-read the cheat sheet table at the bottom of each page you’ve done; re-run one notebook from scratch to confirm muscle memory.
Themed tracks (pick by the role)
Section titled “Themed tracks (pick by the role)”- Modern LLM: Attention → RoPE → GQA/MQA → RMSNorm → SwiGLU → MoE → LoRA → DPO
- Generative modeling: VAE → DDPM (+ how they relate to GANs / score matching)
- Computer vision: ResNet → BatchNorm → ViT → SimCLR
- Training & optimization: Dropout → BatchNorm → AdamW → ResNet → Focal Loss → Knowledge Distillation
- NLP foundations: word2vec → LSTM → Attention → RoPE
What “good” looks like (the cross-cutting rubric)
Section titled “What “good” looks like (the cross-cutting rubric)”Every page has its own rubric, but these habits transfer to all of them:
- ✅ Anchor a benefit claim to a specific table/figure (“the ablation shows…”), not the abstract.
- ✅ State the tradeoff / limitation unprompted, not only the upside.
- ✅ Implement without copy-pasting — and narrate decisions while coding.
- ✅ Write ≥2 sanity checks before being asked (shapes + “only X trains” are free wins).
- ✅ Be honest when a toy can’t show the headline benefit (e.g. SwiGLU, AdamW, DDPM at scale) — show the mechanism instead of overselling a number.