Working notebook: a commonplace blog for collecting notes & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
There’s a lot of history/research behind this
Claims
The idea of reading algorithms from the weights is fascinating
Examples: curve detection, high-low frequency detectors
so many pretty pictures, that maybe take away from the details
this works really well for understanding because it’s so visual
definition of circuit
superposition: allows the model to reduce the number of neurons
“circuit motif”: recurring pattern in complex graphs
Gabor filter: linear filter for texture analysis – more
circuits are more tractable than the network because we can understand them, make modifications and control them with rigor; at the cost of being a tiny part of the model
This is always fun, sometimes so much so I miss the actual project I wanted to do
This is extremely relevant to Intermediate Logging
Different things to visualze
“one-sided NMF”: non-negative matrix factorization
contextualizing weights
small multiples to show lots of details
dealing with indirect interactions [don’t understand this much]
which weights to look at
(basics for visualization) (also very relevant for intermediate logging)
also very tightly coupled to visual networks
pair every neuron activation with a visualization of the neuron, sorted by activation magnitude
“saliency map”: heatmap that highlights pixel of the input image that most caused output classification
rely on matrix factorization to reduce dimensions and complexity
do have to reduce the numbers to human scale
we can take action on neural networks if the parameters are broken out like this
(very rich article, but there’s a lot of work needed to make this more common; combines too many things)
Reversing tiny transformers
Takeaways
Toy Transformers
attention heads move information
revisit this in my own implementation
Aside: Across the circuits papers, I never see a discussion on simplifying the data the model is training on to more easily validate the behavior of the model. I’m curious why.
Aside: anomaly detection from simplified decoder models that can run fast seems very possible
Next: Induction Heads
T = W_u W_e
— Kunal