
DeepSeek: Advanced Reinforcement Learning Approaches
Presented at UWO Seminar, Western University.
DeepSeek-R1 demonstrated that strong reasoning capabilities can emerge from reinforcement learning alone, without relying on supervised fine-tuning on human-curated chain-of-thought data. This talk dissects the techniques that made this possible.
Topics Covered
- Group Relative Policy Optimization (GRPO) — a simpler RL algorithm that avoids value function estimation
- Outcome-supervised reward modeling (ORM) — rewarding correct final answers rather than per-step supervision
- Emergent chain-of-thought — how extended thinking and self-verification arise from RL without explicit SFT
- Cold-start problem — why pure RL from scratch is unstable and how DeepSeek addresses it
- DeepSeek-V3 architecture — mixture of experts, multi-head latent attention, and FP8 training
- Comparison with OpenAI o1 — similarities, differences, and open-source implications