Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
type:: #StrangeLoop2023
Anthropic
Lens: strange loop
cross layers of abstraction
when will an AI system have some self hood
hofstadter: gpt4 has something there
inside-out
simple feed forward network --
convnet -- convolution shrinks the image
analyze with features
weights
circuits -- how they're connected
curve detectors hold up to rigorous interpretation
lot of pictures of curves
can even replace some parts of the network with curve detection
can explore contents to see what the weights do with context
not all neurons are semantically well behaved
superposition hpythosesis: nns represent more features by using linear combinations
https://transformer-circuits.pub/
could embed several fueatures carefully
superposition has a great mathematical structure
superposition for representation
interpretability of convnns
transformers change mechanistic interpretability
residual stream: information flow
attention heads: query information
attention patterns are a primitive like features and weights
induction heads - search for previous places where something happened and see what happened after
take 2 layers to form
ai system that learns induction
outside-in
behavioral things
emergent behavior from models from the outside in
RLHF
what alternative model would have behaved slightly better on this task
create a preference model to score the responses, and that should approximate the human
input / output space is any text of up to 75k words
use models to generate questions and answers
plot of stated desire to not shut down
more rlhf makes it much more coherent
models shoot up in bias with parameters
models can recognize what they're doing / self awareness
keep asking the model
ask questions about predicted accuracy
RLAIF
feedback loop in the model
start with a helpful RLHF model (helpful, not harmless)
— Kunal