Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
2024-02-25
Becoming a bit more intentional about spending time learning again.
GPT Tokenizer
Of course I had to watch Andrej's latest video.
Notes:
- Several issues with LLMs trace back to Tokenization (eg. spelling, string processing)
- Stick with utf-8
- "Byte Pair Encoding" to reduce stream of bytes to something more reasonable
- Repeatedly take the most common pair and add it to the vocabulary, reducing the size of the stream
- Can avoid tokenization with a hierarchical structuring that can take bytes
- BPE is just this process repeated multiple times, compressing the length of the sequence a bit.
- This interacts with the context length of the model, and becomes important there.
- Decoding is fairly straightforward.
- Looking at the encoding implementation, I feel like I'd much rather make a state machine to do this.
- GPT tokenization
- Uses regexes to split out the text into elements
- Process the split results manually; forces merges to happen within words.
- The regexes enforce a lot of behavior; maximum token sizes, etc.
- Saving the merges and the vocab is enough to describe the tokenizer.
- Special tokens:
- <|endoftext|> -- delimits documents in the training set. Special cased in the code.
- im_start / im_end / etc.
- Sentence piece
- much more configurable / older
- lots of options
- falls back to bytes for tokens not seen in training
- More tokens increases computational complexity
- Tokenization also makes auto-complete much harder; because completion could involve introducing a new token that replaces the existing tokens being completed.
Trends in Machine Learning
Binged on a talk by Jeff Dean.
Things that stood out:
- The value of the data quality
- Evals help make real decisions
— Kunal