Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Llama2

Llama2 [[paper]] and notes: arxiv

First Pass

I work with some of the authors, which is pretty cool.
Fine tuning is expensive, actually creates the product, and is not reproducible
5-10% violations on 2k adversarial prompts
7, 13, 34 & 70b trained; llama2 chat as well
steps
- pre-training
  - with public data
  - 2 trillion tokens
- supervised fine tuning
- RLHF
  - iterative reward modeling data
vs llama 1
- more context length
- Grouped query attention
final loss of 1.5 vs 1.75 for 70b vs 7b
numbers are split into digits, bytes decompose utf-8 characters
32k tokens
trained on A100s
- RSC: Nvidia infiniband, 200Gbps, 400W
- Internal: RoCE with commodity switches, 200Gbps, 250W
21 tCO2 vs 291 tCO2 for 7 vs 70b -- surprisingly linear
3.3mn gpu hours, A100-80GB
SFT: only need ~10s of thousands; used 30k

Fine tuning

Training