Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Deep Seek R1 [2025-01-26]

tags: transformers
when: 2024-12-27
publish: 2025

Large scale reinforcement learning

directly apply RL without SFT
explore CoT -- chain of thought
defined as a pipeline

Group Relative Policy Optimization

for each question q, sample a group of outputs from old policy optimize policy model by maximizing ...?

expectation value of ()
follow up with second paper

Reward modeling

rule based reward
- accuracy -- checks for correct responses in the right format
- format rewards -- put thinking process explicitly
no neural reward model (to follow up)
sophisticated behaviors emerge as test-time computation increases
based on interaction with RL environment

Cold start data

collect good examples from humans (details on how to make it work)

Misc: Rejection sampling

Deep Seek V3 [2024-12-27]

Precis

Showing amazing results with very little investment compared to llama models.
Try to implement the model / run inference on it if I can.
Lots of tricks to make the most of money invested, I'm a bit jealous

To learn

Get better at inference
Understand the model architecture
Look for opportunities
Second model to test numbers/understanding against, test simulations against (potentially try and run it locally too)

Numbers

14.8T tokens
671B params
37B activated per token
2k GPU cluster...~256 hosts?
14.8T tokens * (3.7 days / trillion tokens)
119k gpu hours for context length
5k gpu hours for post training
2.788M gpu hours total ~= 5.576M training (excluding experiments)
- $2 per gpu hour
16 PP
64 EP spanning 8 nodes
ZeRO-1 DP
61 layers
7168 hidden dimension
attention head 128
per head dimension 128
kv compression dimension 512
query compression dimension 1536
all but first 3 ffn layers are moe
each layer is 1 shared expert and 256 routed experts
intermediate hidden dimension of expert 2048
8 experts per token for routed expert
each token oges to at most 4 nodes
prediction depth is set to 1 (one extra token)
adamw b1 .9 b2 .95 decay .1
start with 4k seq length
learning rate increased slowly nd then constant, then decayed after 10t
- annealing at the end
- gradient clipping
- batch size is gradually increased and then kept constant

Notes

large moe model with 671B parameters, 37B activated per token (~5%)
trying for performance vs cost; both efficient inference and training
auxiliary loss free startegy for load balancing
mullti token prediction training objective
fp8
dualpipe algorithm for pp (reduce bubbles)
efficient cross node a2a kernels for ib/nvlink bandwidth
optimized memory to avoid tensor parallelism
14.8T tokens
2 stage context length extension
- 32k to 128k
- SFT then RL
- distill reasoning from deepseek R1
chain of thought model -- r1 series

architecture

moe inside the ffn network with a router
muti head attention inside attention block
auxiliary load
- measure which expert is dealing with things
- adjust where it goes using a gating multiplier
mtp
- predict multiple tokens at each step
- connect output of first token to next steps
- speculative decoding
- and compute a cross entropy loss as an additional training adjective
- discarded during inference, can be repurposed to reduce latency
infrastructure
misc: nice visualization showing compute and communication
HAI LLM framework
dualpipe
custom PTX instructions + tuning chunk size
recompute rms norm
ema in cpu of model params for performance estimates
- async updates
fp8
- fine grained quantization
- tile wis or block wise
- cache and dispatch activations in fp8, optimizer states in bf16
- validated for 1 t tokens
- maintain precision for embedding module, output head, moe gating, normalization, attention operators
- store master weights, weight gradients, optimizer states in higher precision
increasing accumulation precision
E4M3 throughout by using smaller tiles
online quantization by deriving max values dynamically
even further customizations
qq: how did they choose which ones are worth handling more carefully?
prefilling
- redundant experts to load balance
- based on live statistics and adjusted
- then rearrange the gpus within a node
- dynamic redundancy strategy
inference / decoding
data construction
- multilingual, with more maths and programing
- document packing
tokenizer
- byte level bpe
- modify for optimizing compression efficiency
- pretokenizer introduces tokens that combine punctuation and line breaks
- randomly split further

Follow ups

context: deep seek the company: maybe this
r1 series of deepseek models
read history of papers
understand performance implications of model decisions
low rank compression
HAI LLM code?
implement dualpipe
warp specialization -- ib sending, ib to nvlink, nvlink have different warps
model performance estimates?
follow up on gpu architecture information
swiglu operator
document packing

— Kunal