Working Notes: my commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
Deep Seek R1 [2025-01-26]
Large scale reinforcement learning
- directly apply RL without SFT
- explore CoT – chain of thought
- defined as a pipeline
Group Relative Policy Optimization
for each question q, sample a group of outputs from old policy
optimize policy model by maximizing …?
- expectation value of ()
- follow up with second paper
Reward modeling
-
rule based reward
- accuracy – checks for correct responses in the right format
- format rewards – put thinking process explicitly
-
no neural reward model (to follow up)
-
sophisticated behaviors emerge as test-time computation increases
-
based on interaction with RL environment
Cold start data
collect good examples from humans
(details on how to make it work)
Misc: Rejection sampling
Deep Seek V3 [2024-12-27]
Precis
- Showing amazing results with very little investment compared to llama models.
- Try to implement the model / run inference on it if I can.
- Lots of tricks to make the most of money invested, I’m a bit jealous
To learn
- Get better at inference
- Understand the model architecture
- Look for opportunities
- Second model to test numbers/understanding against, test simulations against
(potentially try and run it locally too)
Numbers
- 14.8T tokens
- 671B params
- 37B activated per token
- 2k GPU cluster…~256 hosts?
- 14.8T tokens * (3.7 days / trillion tokens)
- 119k gpu hours for context length
- 5k gpu hours for post training
- 2.788M gpu hours total ~= 5.576M training (excluding experiments)
- 16 PP
- 64 EP spanning 8 nodes
- ZeRO-1 DP
- 61 layers
- 7168 hidden dimension
- attention head 128
- per head dimension 128
- kv compression dimension 512
- query compression dimension 1536
- all but first 3 ffn layers are moe
- each layer is 1 shared expert and 256 routed experts
- intermediate hidden dimension of expert 2048
- 8 experts per token for routed expert
- each token oges to at most 4 nodes
- prediction depth is set to 1 (one extra token)
- adamw b1 .9 b2 .95 decay .1
- start with 4k seq length
- learning rate increased slowly nd then constant, then decayed after 10t
- annealing at the end
- gradient clipping
- batch size is gradually increased and then kept constant
Notes
- large moe model with 671B parameters, 37B activated per token (~5%)
- trying for performance vs cost; both efficient inference and training
- auxiliary loss free startegy for load balancing
- mullti token prediction training objective
- fp8
- dualpipe algorithm for pp (reduce bubbles)
- efficient cross node a2a kernels for ib/nvlink bandwidth
- optimized memory to avoid tensor parallelism
- 14.8T tokens
- 2 stage context length extension
- 32k to 128k
- SFT then RL
- distill reasoning from deepseek R1
- chain of thought model – r1 series
architecture
-
moe inside the ffn network with a router
-
muti head attention inside attention block
-
auxiliary load
- measure which expert is dealing with things
- adjust where it goes using a gating multiplier
-
mtp
- predict multiple tokens at each step
- connect output of first token to next steps
- speculative decoding
- and compute a cross entropy loss as an additional training adjective
- discarded during inference, can be repurposed to reduce latency
-
infrastructure
-
misc: nice visualization showing compute and communication
-
HAI LLM framework
-
dualpipe
-
custom PTX instructions + tuning chunk size
-
recompute rms norm
-
ema in cpu of model params for performance estimates
-
fp8
- fine grained quantization
- tile wis or block wise
- cache and dispatch activations in fp8, optimizer states in bf16
- validated for 1 t tokens
- maintain precision for embedding module, output head, moe gating, normalization, attention operators
- store master weights, weight gradients, optimizer states in higher precision
-
increasing accumulation precision
-
E4M3 throughout by using smaller tiles
-
online quantization by deriving max values dynamically
-
even further customizations
-
qq: how did they choose which ones are worth handling more carefully?
-
prefilling
- redundant experts to load balance
- based on live statistics and adjusted
- then rearrange the gpus within a node
- dynamic redundancy strategy
-
inference / decoding
-
data construction
- multilingual, with more maths and programing
- document packing
-
tokenizer
- byte level bpe
- modify for optimizing compression efficiency
- pretokenizer introduces tokens that combine punctuation and line breaks
- randomly split further
Follow ups
- context: deep seek the company: maybe this
- r1 series of deepseek models
- read history of papers
- understand performance implications of model decisions
- low rank compression
- HAI LLM code?
- implement dualpipe
- warp specialization – ib sending, ib to nvlink, nvlink have different warps
- model performance estimates?
- follow up on gpu architecture information
- swiglu operator
- document packing
— Kunal