Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Rlhf

DeepSeek Recipe

Recipe

Cold start with SFT on synthetic reasoning data
- very carefully prepared dataset
- readies loss landscape for emergent behaviors
Large scale RL till convergence
- data = prompts; modle generates completions
- verifiable reward on completions
- rewards
  - accuracy = bonuses for verifiable response
  - format = follow correct formatting for stable inference
  - language consistency = matching language of question, easier to use model
- Group Relative Policy Optimization
- building scorers is important
Rejection sampling on 3/4 reasoning problems, 1/4 general queries
- Q. what does the rewards model look like / where does it come from?
- 800k completions: 600k reasoning + 200k general chat
- relies on llm judges, shared post training data, chat data augmented with CoT
RL mixing reasoning problems & preference tuning reward models
- multiple reward models
Needs a strong base model with long context
- specific traits are unknown
Visualize training time responses
- observed generation length going up as well