Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Hierarchical Reasoning Model

Paper

Firstp ass on aper

Recurrent architecture
Sequentioal reasoning tasks in a single forward pass without supervision
- 2 interdependent recurrent modules
- high level module for slow, abstract planning
- low level module for rapid, detailed computations
27mn parameters + 1k samples -- exceptional performance (!)
Transformer complexity is AC^0 or TC^0
rely on breaking the task into simple intermediate steps
this paper uses latent reasonsing
latent reasoning is constrained by the effective computational depth
brains use
- hierarchical computation
- across different timescales
- recurrent loops refine internal representations
one step gradient approximation for HRM
eliminates requirement for backpropagation through time (BPTT)
HRM
- hierarchical processing: higher areas over longer timescales, more abstract
- lower level: more immediate
- temporal separation
- recurrent connectivity
HRM model
- input network f_i(.; theta_i)
- low level recurrent module f_l(.; theta_l)
- high level recurrent module f_h(.; theta_h)
- output network f_o(.; theta_o)
model dynamics
- N high level cycles
  - T low level timestamps
- indexed by 1, ..., N x T
- f_l & f_h keep a hidden state z^i_L for f_l
- initialized with z^0_L for z^0_H
project input x into a working representation using f
at each timestep i, l updates state based on previous state, h module's state and input rep
h updates every T steps using L modules final state
z^i-L = f_L(z^i-1_L, z^i-1_H, theta_L)
z^i_H = { f_H(z^i-1_H, z^i-1_L, theta_H) if i == 0; z^i-1_H }
finally, y = output(z)
need to find a balance to converge correctly
- want convergence to proceed slowly
H resets the L in cycles
graphs are amazing around how the inner and outer module behavie
forward residual == main stream everything adds to
choice of PCA and residual to identify behavior of the model
simplify gradient across multiple steps just based on the last state of the H& Lmodules
implement by disabing grad for the intermediate steps on time
deep supervision
- based on neural oscillations regulating when learning happens in the brain
- run multiple forward passes for a data sample - each forward pass is a segment
- M segments executed before termination
- collect state of h, l at conclusion of each segment m
- supervision state:
  - compute next state given previous state + forward pass
  - compute loss for current segment
  - update parameters
- hidden state is detached before being used as input state, creating 1 step approximation
adaptive computational time
- Q-head uses final state of H to predict Q values of halt, continue actions
- choose M_min randomly between 2...M_max occasionally or 1
- halt action is selected when we cross M_max (fixed hyperparam)
- or halt value exceeds estimated continue value, and we're at M_min
- Q learning algorithm
  - episodic markov decision process
  - state at segment m is z^m
  - use this to calculate loss if stopping or continuing
overall loss combines q-head loss and seq to seq loss
this helps train stopping decisions
inference time scaling is simply increasing M_max
use RMS with post rnorm and AdamW, keeping params bounded by 1/lambda (=?)
stabelmax for small sample experiments
fl/h ar eimplemented as encoder only transformer blocks that are identical
using Rope, gated linear unit,s rms norm, and no bias in linear layers
lecun normal initalization
use adam atan2, with a constant learning rate with a linear warmup
inspect intermediate state by printing outputs at each step
from the brain
- in dynamic s5ystems for reasoning & decision makning
- higher dimensional state space trajectories, for more computations
- dimensionality hierarchy
- Participation Ratio -- effective dimensionality of a higher dimension representation
  - (sum (eigenvalues of covariance matrix of neural trajectories))^2 / sum of squares
  - higher PR == variance evenly distributed across more dimensions, otherwise lower dimensioal structuore
also shows up in the test
rest
- hrm is also turing complete given enough time & memory
- takes feedback from dense gradients instead of sparse reward
- recurrence can be more sequential / linear attention

Follow up references

computational complexity of transformers
deep equilibrium models, implicit function theory
adamw
q learning
stablemax

— Kunal