Working Notes: my commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Deep Seek R1 [2025-01-26]

Large scale reinforcement learning

Group Relative Policy Optimization

for each question q, sample a group of outputs from old policy optimize policy model by maximizing …?

Reward modeling

Cold start data

collect good examples from humans (details on how to make it work)

Misc: Rejection sampling

Deep Seek V3 [2024-12-27]

Precis

To learn

Numbers

Notes

architecture

Follow ups

Kunal