NOAH SYRKIS

Mechanistic Interpretability on(multi-task) Irreducible IntegerIdentifiersNoah SyrkisJanuary 23, 20251 |Mechanistic Interpretability (MI)2 |Modular Arithmetic3 |Grokking on 𝒯miiii4 |Embeddings5 |Neurons6 |The 𝜔-SpikeFigure 1: <𝑝2 multiples of 13 or 27 (left) 11 (mid.) or primes (right)“This disgusting pile of matrices is actually just an astoundingly poorly written, elegant andconsice algorithm” — Neel Nanda¹¹Not verbatim, but the gist of it1 |Mechanistic Interpretability (MI)Sub-symbolic nature of deep learning obscures model mechanismsNo obvious mapping from the weights of a trained model to math notationMI is about reverse engineering these models, and looking closely at themMany low hanging fruits / practical botany phase of the scienceHow does a given model work? How can we train it faster? Is it safe?1 of 181 |GrokkingGrokking [1] is “sudden generalization”MI (often) needs a mechanismGrokking is thus convenient for MILee et al. (2024) speeds up grokking byboosting slow gradients as per Eq. 1For more see Appendix A(𝑡)=(𝑡1)𝛼+𝑔(𝑡)(1𝛼)(1.1)̂𝑔(𝑡)=𝑔(𝑡)+𝜆(𝑡)(1.2)2 of 181 |VisualizingMI needs creativity … but there are tricks:For two-token samples, plot them varyingone on each axis (Figure 2)When a matrix is periodic use FourierSingular value decomp. (Appendix C).Take away: get commfy with esch-plotsFigure 2: Top singular vectors of 𝐔𝑊𝐸𝒯nanda(top), varying 𝑥0 and 𝑥1 in sample (left) andfreq. (right) space in 𝑊out𝒯miiii3 of 18