Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Nanogpt

NanoGPT

(using hy)

language model models a sequence of characters
GPT = Generatively Pretrained Transformer
will build a character based transformer
produce character sequences based on context

Building nanogpt

read in the text file
extract list of characters in the file
encoder: character -> number
decoder: number -> character
several other schemas
- sentencepiece, tiktoken
- subword tokenizer
- tradeoff token size and sequence lengths
wrap the full text into a single data tensor
split the data into train and validation
90% train, 10% validation
train transformer on chunks of dataset at a time
maximum length: block size, or context length
each block size has multiple examples
- each example has 1 -> x examples
- with context 1, 2, 3, ... ; target is x example
also makes sure the transformer sees multiple contexts to train with
after block size, we'll end up truncating
also choose a batch size
train multiple batches at the same time
done for GPU efficiency, to parallelize computation
generating batches
- look for random offsets into the training set
- pick out block size
- pick out targets as a follow up, basically offset by 1
bigram language model
- inputs -> token embedding table
- vocab size by vocab size
- nn.Embedding
- find the row for that tensor
- token embedding table
  - batch by size by channels
- treat the value as the logit / score
- predict the next character at each position
- predict as F.cross_entropy
  - negative loss likelhood
- build generate
  - take a B/T array and generates by prediction
  - run the module, get logits
  - get the last predictions
  - figure out probabilities with softmax
  - get the next value
  - repeatedly run generation to add a 100 tokens
  - convert integers to a tensor
- AdamW optimizer (3e-4 generally, here used 1e-3)
loss
- estimated loss == reduce noise
- by averaging over several randomly chosen batches
- remember to fix the seed up front
- add no grad

Self attention

coupling tokens, while only looking backwards
previous context -> current timestamp with no info from future
for tokens to communicate
- weak interaction: average all the preceding elements
- bag of words == average things
use a triangle matrix to get the sums
- torch.tril
- lower triangular portion of torch.ones
- triangle matrix @ data
- to get averages, can divide the triangle matrix by sum of rows
another way to do this is to use softmax
- masked_fill with a tril to make shows how much tokens can communicate [0, -inf, -inf] [0, 0, -inf] [0, 0, 0]
- and then softmax with dim -1 [ 1, 0, 0] [.5, .5, 0] [.3, .3, .3]
can do weighted aggregations of the past elements by using matrix multiplication with the lower triangle
- and then mask it to get rid of the future
updating bigrams
- change embedding to include n_embed (32)
- add dimensions
make token embeddings instead
- logits come by taking linear layer from the embeddings lm_head
make a position embedding table
- add it to the token embeddings
- and then pass it through to the lm_head
self attention
- 32 dimensions for token embeddings
- start with actual affinities instead of uniform affinities
- each token emits 2 vectors
  - query: what I'm looking for
  - key: what I contain
- do a dot product between key and the query, which becomes the weight
- helps align the tokens
- this is a head of self attention
  - head_size hyper-parameter
- weights = q @ k.transpose(-2, -1)
add in value
- create v = value(x) through a linear
- output = wei @ v
- v is what gets aggregated
Attention
- communication mechanism
- number of nodes in a dedicated graph
- every node has some vector of information
  - aggregate info via a weighted sum
- can be applied to any mechanism
- no notion of space
  - only acts on a set of vectors
  - encoded positionally to anchor to a position
- convolution runs on a specific layout in space
- can allow all nodes to talk
  - eg. sentiment analysis
  - because final prediction is just sentiment
  - encoder of self attention
- decoder block:
  - decodes with a masked triangular matrix
self attention: keys, queries and values are based on the same source
- in principle, attention is much more general
- can have a case where queries come from x
- keys, values come from other context
  - nodes on the side
in the paper: dividing by square root of head size
- scaling attention by head size
- to make sure variance is controlled to be unit
  - if the values are too high, softmax sharpens to the max
  - don't want values to be extreme
  - aggregating from a single node
plug in the self attention head to the model and then pass to the model
multi-head attention
- multiple attentions in parallel
- concatenate over the channel dimension
- kind of like a group convolution
feed forward attention
- add computation into the network
- linear layer followed by a non linearitya
- self attend, feed forward
  - on a per token level
intersperse communication with computation
- add a lot of blocks
- making a deep net, with optimization issues
adding skip/residual connections
- deep residual learning paper
- transform data, and add a skip connection with addition
- addition distributes gradients equally
  - supervision hop through every addition node
  - and also fork off into residual blocks
layer norm: don't normalize columns, but normalize layers
- doesn't span across examples
- this is tricky to reason about without impl
also adds dropout: to randomly prevent nodes from communicating
final tweaks
- adjusted hyperparameters to read more data
comparison with paper
- skipped cross attention
- generating text
- triangular mask makes it a decoder
- original paper had encoder / decoder
  - machine translation
  - expects tokens that encode french
  - decode translation in english
- generation is conditioned on additional info
  - encoder reads the french
  - create tokens
  - transfromer without triangular mask
  - encoding the content of the sentence
  - k, v still come from top / outside
  - creates an additional cross attention

nanogpt

train.py -- train loop
model.py -- gpt2, almost identical

chatgpt

pretraining: learn on the internet, generate babble
- document creator
- complete sequences
fine tuning: aligning stage
- 3 steps:
  - collect data with the right format
  - ~1000s of examples
  - fine tune the model to focus on docs that look similar
  - large models are sample efficient
  - train a reward model based on human responses
  - PPO -- PPO learning algorithm
tokens going up to 1 trillion these days

Follow ups

Attention is All You Need (Transformer architecture)
Makemore with notes (negative loss likelihood)
AdamW
Why this architecture / decision making
what is a channel?
layer normalization

Notes on my implementation

2023-10-28

To keep things really simple, I'm working with a/b data strings: something that people also use for manually working with transformers.

The data is generated with a->a, a->b and b->b transitions, with equal weights for a & b. I'll play with this as I understand more.

— Kunal