Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
Llama2
Llama2 [[paper]] and notes: arxiv
First Pass
- I work with some of the authors, which is pretty cool.
- Fine tuning is expensive, actually creates the product, and is not reproducible
- 5-10% violations on 2k adversarial prompts
- 7, 13, 34 & 70b trained; llama2 chat as well
- steps
- pre-training
- with public data
- 2 trillion tokens
- supervised fine tuning
- RLHF
- iterative reward modeling data
- vs llama 1
- more context length
- Grouped query attention
- final loss of 1.5 vs 1.75 for 70b vs 7b
- numbers are split into digits, bytes decompose utf-8 characters
- 32k tokens
- trained on A100s
- RSC: Nvidia infiniband, 200Gbps, 400W
- Internal: RoCE with commodity switches, 200Gbps, 250W
- 21 tCO2 vs 291 tCO2 for 7 vs 70b -- surprisingly linear
- 3.3mn gpu hours, A100-80GB
- SFT: only need ~10s of thousands; used 30k
Fine tuning
- concatenate all propmts and answers + token for separation
- autoregressive objective, backpropagate only on answers
Training
- binary ranking loss coefficient adjusted with a margin
Read more about
- [[RLHF]] Reinforcement learning with human feedback
- [[SFT]] Supervised fine tuning
- [[Binary ranking loss coefficient]]
- [[Autoregressive Objective]]
Papers to read
- Transformer architecture, vaswani 2017
- RMSNorm Zhang
- SwiGLU Shazeer
- Rotary Positional Embeddings Su
- ADamW Loshchilove
- [[BPE]] Bytepair encoding sennrich
- SentencePiece kudo
- RSC paper, lee
- HumanEval Chen
- Ouyang binary ranking loss
— Kunal