Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Transformers

publish: 2025
history: [{datetime.date(2025, 1, 12): 'start collecting notes'}, {datetime.date(2025, 2, 26): 'revisit some old notes'}]

Having spent a lot of time over the past few years on building infrastructure for Transformer models, I'm still not crystal clear on the actual calculations that happen within them. This work log is for experimenting with and building my own transformers and looking at the values inside them.

July 2025

Followed Sebastian Raschka's notebook to build a minimal Qwen implementation from scratch. Now that I have a model I can run, I think I'm going to dissect it and restructure the code so that I can explain each part, tentatively documenting here, and trying out all the experiments I've always wanted to do. Generally

Adding tests and validating dimensions on the implementation; I've been playing with Jaxtyping and introducing bear into the mix should help
Visualizing the transformations by each module, particularly attention
Building out a full training loop, and training it on tiny shakespeare
Building an even tinier qwen with much smaller dimensions, and training it on smaller datasets
Testing out the different design decisions

May 2025

Been busy hacking on my own programming language that transpiles to C (and Cuda) and ideally makes it more convenient to play with transformers: this is probably the most I've ever procrastinated on actually solving the problem I'd set out to but I expect this to pay off over a life time of programming.

Some interesting new websites I need to go devour:

bactra.org -- More tractable explanations of LLMs

March 2025

05

Kicking off building a transformer in C from the basics by following Sebastian Raschka's book; I'll keep documenting my progress and observations here. Listing out things I hope to achieve with this project

building intuition around the transformer architecture and behavior
build a stack that involves a trainer, inference, finetuning, RL, evals, etc.
implementing a training stack from scratch, adding in the observability and structures I've always wanted
deepening my comfort with and understanding of C, Cuda
having some software I can easily customize to run LLMs of my choice

I'll revisit this when I'm done with this project.

February 2025

Curriculum

My day job has been keeping me almost entirely occupied, and I haven't had much time to do the experiments or programming I would like to. There are a couple of projects I'd like to complete before I feel confident about transformers:

llm.c implementations for different architectures; explore doing RL on a tiny model
explorations around minimal transformers with language datasets of different complexity
implement an advanced model from scratch
optimize llm.c myself and see how far I can push it
implement flash attention, rope, etc.
write about what I learn
simulate hardware requirements for different model types / build spreadsheets (eg. the jax roofline analysis documents)

and at the same time actually build apply a small finetuned LLM to daily tasks.

shell completion based on history
log reader / anomaly detector (at which point I'll have automated myself out of a job)

with tools that help along the way

better chrome trace visualizers

January 2025

Building minimal transformers

As a first attempt, trying to build simple transformers: I have vague memories of doing something similar while working through the videos by Andrej Karpathy but this time around I'll poke a little bit myself. Reading about circuits was also helpful in getting ready for this.

Things I'd generally like to work on here:

Writing out the KVQ multiplications and seeing how they update
Breaking out the transformer architecture
Understanding how different languages are trained

0 layers 2025-01-20

Based on what I understand from the circuits videos, paper -- and what I vaguely remember from the Zero-to-Hero series, a simple transformer should result in weights that are simply bigram statistics. Surprisingly, I'm finding myself struggling a little bit in structuring the code in a way that's flexible and satisfying; I've read too much code.

— Kunal