Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Deep Seek R1 [2025-01-26]

tags
transformers
when
2024-12-27
publish
2025

Large scale reinforcement learning

Group Relative Policy Optimization

for each question q, sample a group of outputs from old policy optimize policy model by maximizing ...?

Reward modeling

Cold start data

collect good examples from humans (details on how to make it work)

Misc: Rejection sampling

Deep Seek V3 [2024-12-27]

Precis

To learn

Numbers

Notes

architecture

Follow ups

Kunal