Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Circuits

publish: 2025

For ML, the best success I've had in building intuition has been to break things down into the simplest components, and try to understand the numbers
Done with much more rigour by the people at Anthropic, collected in the discussions on Transformer Circuits

Reading notes

These are notes on the publications by Anthropic, Distill, etc.

Introduction to circuits

There's a lot of history/research behind this
Claims
1. features are the fundamental units, corresponding to "directions"
2. features are connected by weights, forming circuits
3. analogous fatures and circuits form across models and tasks
The idea of reading algorithms from the weights is fascinating
Examples: curve detection, high-low frequency detectors
so many pretty pictures, that maybe take away from the details
this works really well for understanding because it's so visual
definition of circuit
- "sub-graphs of the network, consisting [of] a set of tightly linked features and the weights between them"
superposition: allows the model to reduce the number of neurons
"circuit motif": recurring pattern in complex graphs
Gabor filter: linear filter for texture analysis -- more
circuits are more tractable than the network because we can understand them, make modifications and control them with rigor; at the cost of being a tiny part of the model

Curve Circuits

can be extracted because of the equivariance motif -- nn is spending a lot of spcae to handle different types of curves
(I need to spend more time dealing with visual networks/images to have more intuition for this)
sadly the notebook for running the Artifical-artificial-neural-network is not accessible any more

Visualizing Weights

This is always fun, sometimes so much so I miss the actual project I wanted to do
This is extremely relevant to Intermediate Logging
Different things to visualze
- Activations: what the network saw -- ~values of variables
- Weights: how the network computes one layer / ~assembly instructions
- Attributions: how much one neuron affects another / why it fired
"one-sided NMF": non-negative matrix factorization
- wiki
- factorize a matrix V into 2 matrices such that W * H ~= V
- reduces the dimensions significantly
- automatically clusters colums of input data
- found by minimizing ||V - WH||_F (froeubnius norm)
- can understand if the neurons have some structure
contextualizing weights
- can also show spatial positions
small multiples to show lots of details
dealing with indirect interactions [don't understand this much]
- multiply adjacent weight matrices
- can help see underlying structure more easily
which weights to look at
- largest weights

Building Blocks

(basics for visualization) (also very relevant for intermediate logging)

also very tightly coupled to visual networks
pair every neuron activation with a visualization of the neuron, sorted by activation magnitude
"saliency map": heatmap that highlights pixel of the input image that most caused output classification
rely on matrix factorization to reduce dimensions and complexity
do have to reduce the numbers to human scale
we can take action on neural networks if the parameters are broken out like this
(very rich article, but there's a lot of work needed to make this more common; combines too many things)

Transformer Circuits

Reversing tiny transformers
- Learn bigrams/trigrams from one layer attention only transformers
- 2 layer attention only transformers compose attention heads; "induction heads"
Takeaways
- attention heads ~= independent operations added to residual stream
- attention only models ~= sum of functions mapping tokens to logit changes
- lots of linear structure in transformers
- attention head = query-key circuit for attention pattern + output-value circuit for token->output effect
- k, q, v are intermediate rresults to describe the model
- composing attention heads increases expressivity
- all components communicate by reading/writing to residual stream
Toy Transformers
- "attention only" transformers without MLP
- tokens -> embedding -> attention head adding to residual stream -> mlp -> unembed -> logits
- "privileged basis": some aspect of model architecture makes features align with basis dimensions
- can see which layers are interacting by multiplying weights adding to the residual stream
- residual stream is used for a lot of communication, can be bottlenecked
- treat attention heads as separate
attention heads move information
- read residual stream of one token, and write to residual stream of another token
revisit this in my own implementation
- attention = (project result vectors per token) * (mix value vectors across tokens) * (compute value vector for each token) * input
- work through the attention heads on paper manually
Aside: Across the circuits papers, I never see a discussion on simplifying the data the model is training on to more easily validate the behavior of the model. I'm curious why.
Aside: anomaly detection from simplified decoder models that can run fast seems very possible
Next: Induction Heads

Zero layer transformers

Trying to implement a zero layer transformer, decided to google it -- and found something very recent and applicable
embedding == token -> learned vector of model dimension
unembedding == d dimension vector back to token
this is the level I should work at, because I made a lot of incorrect assumptions on the calculations, and writing it from scratch shows a lot of possibilities; and also where certain decisions make sense

Experiment notes

These are notes on experiments I've tried locally / would like to try locally
how much of the math can be simplified and moved out
misc notes
- try to use models that don't need to compress information
Planned experiments
- Build a tiny gpt that can work on patterns, like look and say; understand the weights
- Can we combine LLaMa and deepseek and other models?
- Follow path
  - zero layer transformer: T = W_u W_e
  - one layer transformer

Follow ups

— Kunal