-
moe inside the ffn network with a router
-
muti head attention inside attention block
-
auxiliary load
- measure which expert is dealing with things
- adjust where it goes using a gating multiplier
-
mtp
- predict multiple tokens at each step
- connect output of first token to next steps
- speculative decoding
- and compute a cross entropy loss as an additional training adjective
- discarded during inference, can be repurposed to reduce latency
-
infrastructure
-
misc: nice visualization showing compute and communication
-
HAI LLM framework
-
dualpipe
-
custom PTX instructions + tuning chunk size
-
recompute rms norm
-
ema in cpu of model params for performance estimates
-
fp8
- fine grained quantization
- tile wis or block wise
- cache and dispatch activations in fp8, optimizer states in bf16
- validated for 1 t tokens
- maintain precision for embedding module, output head, moe gating, normalization, attention operators
- store master weights, weight gradients, optimizer states in higher precision
-
increasing accumulation precision
-
E4M3 throughout by using smaller tiles
-
online quantization by deriving max values dynamically
-
even further customizations
-
qq: how did they choose which ones are worth handling more carefully?
-
prefilling
- redundant experts to load balance
- based on live statistics and adjusted
- then rearrange the gpus within a node
- dynamic redundancy strategy
-
inference / decoding
-
data construction
- multilingual, with more maths and programing
- document packing
-
tokenizer
- byte level bpe
- modify for optimizing compression efficiency
- pretokenizer introduces tokens that combine punctuation and line breaks
- randomly split further