NOAH SYRKIS

Mechanistic Interpretability on(multi-task) Irreducible IntegerIdentifiersNoah SyrkisFebruary 24, 20251 |Mechanistic Interpretability (MI)2 |Modular Arithmetic3 |Grokking on 𝒯miiii4 |Embeddings5 |Neurons6 |The 𝜔-SpikeFigure 1: <𝑝2 multiples of 13 or 27 (left) 11 (mid.) or primes (right)“This disgusting pile of matrices is actually just an astoundingly poorly written, elegant andconsice algorithm” — Neel Nanda¹¹Not verbatim, but the gist of it1 |Mechanistic Interpretability (MI)Sub-symbolic nature of deep learning obscures model mechanismsNo obvious mapping from the weights of a trained model to math notationMI is about reverse engineering these models, and looking closely at themMany low hanging fruits / practical botany phase of the scienceHow does a given model work? How can we train it faster? Is it safe?1 of 181 |GrokkingGrokking [1] is “sudden generalization”MI (often) needs a mechanismGrokking is thus convenient for MILee et al. (2024) speeds up grokking byboosting slow gradients as per Eq. 1For more see Appendix A(𝑡)=(𝑡1)𝛼+𝑔(𝑡)(1𝛼)(1.1)̂𝑔(𝑡)=𝑔(𝑡)+𝜆(𝑡)(1.2)2 of 181 |VisualizingMI needs creativity … but there are tricks:For two-token samples, plot them varyingone on each axis (Figure 2)When a matrix is periodic use FourierSingular value decomp. (Appendix C).Take away: get commfy with esch-plotsFigure 2: Top singular vectors of 𝐔𝑊𝐸𝒯nanda(top), varying 𝑥0 and 𝑥1 in sample (left) andfreq. (right) space in 𝑊out𝒯miiii3 of 18Figure 3: Shamleless plug: visit github.com/syrkis/esch for more esch plots2 |Modular Arithmetic“Seminal” MI paper by Nanda et al. (2023)focuses on modular additon (Eq. 2)Their final setup trains on 𝑝=113They train a one-layer transformerWe call their task 𝒯nandaAnd ours, seen in Eq. 3, we call 𝒯miiii(𝑥0+𝑥1)mod𝑝,𝑥0,𝑥1(2)(𝑥0𝑝0+𝑥1𝑝1)mod𝑞,𝑞<𝑝(3)4 of 182 |Modular Arithmetic𝒯miiii is non-commutative …… and multi-task: 𝑞 ranges from 2 to 109¹𝒯nanda use a single layer transformerNote that these tasks are synthetic and trivial to solve with conventional programmingThey are used in the MI literature to turn black boxes opaque¹Largest prime less than 𝑝=1135 of 183 |Grokking on 𝒯miiiiThe model groks on 𝒯miiii (Figure 4)Needed GrokFast [2] on compute budgetFinal hyperparams are seen in Table 1rate𝜆wd𝑑lrheads110121325631044Table 1: Hyperparams for 𝒯miiiiFigure 4: Training (top) and validation(bottom) accuracy during training on 𝒯miiii6 of 184 |EmbeddingsThe position embs. of Figure 5 reflects that𝒯nanda is commutative and 𝒯miiii is notMaybe: this corrects non-comm. of 𝒯miiii?Corr. is 0.95 for 𝒯nanda and 0.64 for 𝒯miiiiFigure 5: Positional embeddings for 𝒯nanda(top) and 𝒯miiii (bottom).7 of 184 |EmbeddingsFor 𝒯nanda token embs. are essentially linearcombinations of 5 frequencies (𝜔)For 𝒯miiii more frequencies are in playEach 𝒯miiii subtask targets unique primePossibility: One basis per prime task