DeepSeek: Advanced Reinforcement Learning Approaches

Presented at UWO Seminar, Western University.

DeepSeek-R1 demonstrated that strong reasoning capabilities can emerge from reinforcement learning alone, without relying on supervised fine-tuning on human-curated chain-of-thought data. This talk dissects the techniques that made this possible.

Topics Covered

Group Relative Policy Optimization (GRPO) — a simpler RL algorithm that avoids value function estimation
Outcome-supervised reward modeling (ORM) — rewarding correct final answers rather than per-step supervision
Emergent chain-of-thought — how extended thinking and self-verification arise from RL without explicit SFT
Cold-start problem — why pure RL from scratch is unstable and how DeepSeek addresses it
DeepSeek-V3 architecture — mixture of experts, multi-head latent attention, and FP8 training
Comparison with OpenAI o1 — similarities, differences, and open-source implications

Slides