-
Cold start with SFT on synthetic reasoning data
- very carefully prepared dataset
- readies loss landscape for emergent behaviors
-
Large scale RL till convergence
- data = prompts; modle generates completions
- verifiable reward on completions
- rewards
- accuracy = bonuses for verifiable response
- format = follow correct formatting for stable inference
- language consistency = matching language of question, easier to use model
- Group Relative Policy Optimization
- building scorers is important
-
Rejection sampling on 3/4 reasoning problems, 1/4 general queries
- Q. what does the rewards model look like / where does it come from?
- 800k completions: 600k reasoning + 200k general chat
- relies on llm judges, shared post training data, chat data augmented with CoT
-
RL mixing reasoning problems & preference tuning reward models
-
Needs a strong base model with long context
- specific traits are unknown
-
Visualize training time responses
- observed generation length going up as well