- 
Cold start with SFT on synthetic reasoning data 
- very carefully prepared dataset
- readies loss landscape for emergent behaviors
 
- 
Large scale RL till convergence 
- data = prompts; modle generates completions
- verifiable reward on completions
- rewards
- accuracy = bonuses for verifiable response
- format = follow correct formatting for stable inference
- language consistency = matching language of question, easier to use model
 
- Group Relative Policy Optimization
- building scorers is important
 
- 
Rejection sampling on 3/4 reasoning problems, 1/4 general queries 
- Q. what does the rewards model look like / where does it come from?
- 800k completions: 600k reasoning + 200k general chat
- relies on llm judges, shared post training data, chat data augmented with CoT
 
- 
RL mixing reasoning problems & preference tuning reward models 
- 
Needs a strong base model with long context 
- specific traits are unknown
 
- 
Visualize training time responses 
- observed generation length going up as well