Working Notes: my commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
extended notes from building LLMs from scratch
Tokenizers
- special tokens to give the model more context
<|begin|>
, <|unk|>
etc.
bos
, eos
, pad
, etc
- byte pair encoding
- tiktoken up to gpt3 had a 50k vocab
- merges tokens based on frequency
- from wikipedia
- byte level encoding works
- of course karpathy has his own implementation
- byte latent transformer
-
— Kunal