Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

2024-02-25

Becoming a bit more intentional about spending time learning again.

GPT Tokenizer

Of course I had to watch Andrej's latest video.

Notes:

Several issues with LLMs trace back to Tokenization (eg. spelling, string processing)
Stick with utf-8
"Byte Pair Encoding" to reduce stream of bytes to something more reasonable
- Repeatedly take the most common pair and add it to the vocabulary, reducing the size of the stream
Can avoid tokenization with a hierarchical structuring that can take bytes
BPE is just this process repeated multiple times, compressing the length of the sequence a bit.
This interacts with the context length of the model, and becomes important there.
Decoding is fairly straightforward.
Looking at the encoding implementation, I feel like I'd much rather make a state machine to do this.
GPT tokenization
- Uses regexes to split out the text into elements
- Process the split results manually; forces merges to happen within words.
The regexes enforce a lot of behavior; maximum token sizes, etc.
Saving the merges and the vocab is enough to describe the tokenizer.
Special tokens:
- <|endoftext|> -- delimits documents in the training set. Special cased in the code.
- im_start / im_end / etc.
Sentence piece
- much more configurable / older
- lots of options
- falls back to bytes for tokens not seen in training
More tokens increases computational complexity
Tokenization also makes auto-complete much harder; because completion could involve introducing a new token that replaces the existing tokens being completed.

Things that stood out: