Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Llms Are A Strange Loop?

type: Meeting
description: None
tags: []
peopleInvolved: Zac Hatfield-Dodds
date: 2023-09-21

type:: #StrangeLoop2023

Notes

Anthropic

Lens: strange loop

cross layers of abstraction

when will an AI system have some self hood

hofstadter: gpt4 has something there

inside-out

simple feed forward network --

convnet -- convolution shrinks the image

analyze with features

interpretable semantic thing

weights

actual programming

circuits -- how they're connected

curve detectors hold up to rigorous interpretation

lot of pictures of curves

can even replace some parts of the network with curve detection

can explore contents to see what the weights do with context

not all neurons are semantically well behaved

superposition hpythosesis: nns represent more features by using linear combinations

https://transformer-circuits.pub/

could embed several fueatures carefully

superposition has a great mathematical structure

superposition for representation

interpretability of convnns

transformers change mechanistic interpretability

residual stream: information flow
attention heads: query information
attention patterns are a primitive like features and weights

induction heads - search for previous places where something happened and see what happened after

take 2 layers to form

ai system that learns induction

outside-in

behavioral things

emergent behavior from models from the outside in

RLHF

what alternative model would have behaved slightly better on this task

create a preference model to score the responses, and that should approximate the human

input / output space is any text of up to 75k words

use models to generate questions and answers

plot of stated desire to not shut down

more rlhf makes it much more coherent

models shoot up in bias with parameters

models can recognize what they're doing / self awareness

keep asking the model

ask questions about predicted accuracy

RLAIF

feedback loop in the model

start with a helpful RLHF model (helpful, not harmless)

anthropic.com/research

claude.ai

— Kunal