Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
Recurrent architecture
Sequentioal reasoning tasks in a single forward pass without supervision
27mn parameters + 1k samples -- exceptional performance (!)
Transformer complexity is AC^0 or TC^0
rely on breaking the task into simple intermediate steps
this paper uses latent reasonsing
latent reasoning is constrained by the effective computational depth
brains use
one step gradient approximation for HRM
eliminates requirement for backpropagation through time (BPTT)
HRM
HRM model
f_i(.; theta_i)
f_l(.; theta_l)
f_h(.; theta_h)
f_o(.; theta_o)
model dynamics
f_l
& f_h
keep a hidden state z^i_L
for f_l
z^0_L
for z^0_H
project input x into a working representation using f
at each timestep i, l updates state based on previous state, h module's state and input rep
h updates every T steps using L modules final state
z^i-L = f_L(z^i-1_L, z^i-1_H, theta_L)
z^i_H = { f_H(z^i-1_H, z^i-1_L, theta_H) if i == 0; z^i-1_H }
finally, y = output(z)
need to find a balance to converge correctly
H resets the L in cycles
graphs are amazing around how the inner and outer module behavie
forward residual == main stream everything adds to
choice of PCA and residual to identify behavior of the model
simplify gradient across multiple steps just based on the last state of the H& Lmodules
implement by disabing grad for the intermediate steps on time
deep supervision
adaptive computational time
z^m
overall loss combines q-head loss and seq to seq loss
this helps train stopping decisions
inference time scaling is simply increasing M_max
use RMS with post rnorm and AdamW, keeping params bounded by 1/lambda (=?)
stabelmax for small sample experiments
fl/h ar eimplemented as encoder only transformer blocks that are identical
using Rope, gated linear unit,s rms norm, and no bias in linear layers
lecun normal initalization
use adam atan2, with a constant learning rate with a linear warmup
inspect intermediate state by printing outputs at each step
from the brain
also shows up in the test
rest
— Kunal