Working Notes: a commonplace notebook for recording & exploring ideas.
  Home. Site Map. Subscribe. More at expLog.
Building Llms From Scratch
- publish
- 2025
- tags
- ['ai', 'llms', 'machine-learning']
extended notes from building LLMs from scratch
Tokenizers
- special tokens to give the model more context
- <|begin|>,- <|unk|>etc.
- bos,- eos,- pad, etc
 
- byte pair encoding
- tiktoken up to gpt3 had a 50k vocab
- merges tokens based on frequency
 
- from wikipedia
- byte level encoding works
 
- of course karpathy has his own implementation
- byte latent transformer
- huggingface
— Kunal