Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Building Llms From Scratch

extended notes from building LLMs from scratch

Tokenizers

special tokens to give the model more context
- <|begin|> , <|unk|> etc.
- bos, eos, pad, etc
byte pair encoding
- tiktoken up to gpt3 had a 50k vocab
- merges tokens based on frequency
from wikipedia
- byte level encoding works
of course karpathy has his own implementation
- bye
- in rust
byte latent transformer
- by mike, etc.
huggingface