Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
(using hy)
read in the text file
extract list of characters in the file
encoder
: character -> number
decoder
: number -> character
several other schemas
wrap the full text into a single data tensor
split the data into train and validation
90% train, 10% validation
train transformer on chunks of dataset at a time
maximum length: block size
, or context length
each block size has multiple examples
also makes sure the transformer sees multiple contexts to train with
after block size, we'll end up truncating
also choose a batch size
train multiple batches at the same time
done for GPU efficiency, to parallelize computation
generating batches
bigram language model
inputs -> token embedding table
vocab size by vocab size
nn.Embedding
find the row for that tensor
token embedding table
treat the value as the logit / score
predict the next character at each position
predict as F.cross_entropy
build generate
AdamW optimizer (3e-4 generally, here used 1e-3)
loss
coupling tokens, while only looking backwards
previous context -> current timestamp with no info from future
for tokens to communicate
use a triangle matrix to get the sums
torch.tril
torch.ones
another way to do this is to use softmax
can do weighted aggregations of the past elements by using matrix multiplication with the lower triangle
updating bigrams
n_embed
(32)make token embeddings instead
lm_head
make a position embedding table
self attention
32 dimensions for token embeddings
start with actual affinities instead of uniform affinities
each token emits 2 vectors
query
: what I'm looking forkey
: what I containdo a dot product between key and the query, which becomes the weight
helps align the tokens
this is a head
of self attention
head_size
hyper-parameterweights = q @ k.transpose(-2, -1)
add in value
Attention
communication mechanism
number of nodes in a dedicated graph
every node has some vector of information
can be applied to any mechanism
no notion of space
convolution runs on a specific layout in space
can allow all nodes to talk
decoder block:
self
attention: keys, queries and values are based on the same source
in the paper: dividing by square root of head size
plug in the self attention head to the model and then pass to the model
multi-head attention
feed forward attention
intersperse communication with computation
adding skip/residual connections
layer norm: don't normalize columns, but normalize layers
also adds dropout: to randomly prevent nodes from communicating
final tweaks
comparison with paper
pretraining
: learn on the internet, generate babble
fine tuning
: aligning stage
To keep things really simple, I'm working with a/b data strings: something that people also use for manually working with transformers.
The data is generated with a->a, a->b and b->b transitions, with equal weights for a & b. I'll play with this as I understand more.
— Kunal