works|about
Mechanistic Interpretabilityon (multi-task) Irreducible Integer IdentifiersNoah SyrkisJanuary 8, 20251 |Mechanistic Interpretability2 |Modular Arithmetic3 |Grokking on 𝒯miiii4 |Embeddings5 |Neurons6 |The 𝜔-Spike“This disgusting pile of matrices is actually just an astoundingly poorly written, elegant and consicealgorithm” — Neel Nanda¹¹Not verbatim, but the gist of it1 |Mechanistic InterpretabilitySub-symbolic nature of deep learning obscures model mechanismsNo obvious mapping from the weights of a trained model to math notationMI is about reverse engineering these models, and looking closely at themMany low hanging fruits / practical botany phase of the scienceHow does a given model work? How can we train it faster? Is it safe?1 of 191.1 |GrokkingGrokking [1] is “sudden generalization”MI (often) needs a mechanismGrokking is thus convenient for MIFigure 1: Grokking2 of 192 |Modular Arithmetic“Seminal” MI paper by Nanda et al. (2023)focuses on modular addition (𝒯nanda)Their final setup trains on 𝑝=113They train a one-layer transformerWe call their task 𝒯nandaAnd ours we call 𝒯miiii𝒯nanda=(𝑥0+𝑥1)mod𝑝,𝑥0,𝑥1(1.1)𝒯miiii=(𝑥0𝑝0+𝑥1𝑝1)mod𝑞,𝑞<𝑝(1.2)3 of 192 |Modular Arithmetic𝒯miiii is non-commutative …… and multi-task: 𝑞 ranges from 2 to 109¹𝒯nanda use a single layer transformerNote that these tasks are synthetic and trivial to solve with conventional programmingThey are used in the MI literature to turn black boxes opaque¹Largest prime less than 𝑝=1134 of 19Figure 2: <𝑝2 multiples of 13 or 27 (left) 11 (mid.) or primes (right)3 |Grokking on 𝒯miiiiFor two-token samples, plot them varying oneon each axis (Figure 3)When a matrix is periodic use FourierSingular value decompositionFigure 3: Top singular vectors of 𝐔𝑊𝐸𝒯nanda (top),varying 𝑥0 and 𝑥1 in sample (left) and freq. (right)space in 𝑊out𝒯miiii5 of 193 |Grokking on 𝒯miiiiThe model groks on 𝒯miiii (Figure 4)Needed GrokFast [3] on compute budgetFinal hyperparams are seen in Table 1rate𝜆wd𝑑lrheads110121325631044Table 1: Hyperparams for 𝒯miiiiFigure 4: Training (top) and validation (bottom)accuracy during training on 𝒯miiii6 of 194 |EmbeddingsHow the embedding layer deals with the difference between 𝒯nanda and 𝒯miiii7 of 194.1 |Correcting for non-commutativityThe position embs. of Figure 5 reflects that𝒯nanda is commutative and 𝒯miiii is notMaybe: this corrects non-comm. of 𝒯miiii?Corr. is 0.95 for 𝒯nanda and 0.64 for 𝒯miiiiFigure 5: Positional embeddings for 𝒯nanda (top)and 𝒯miiii (bottom).8 of 194.2 |Correcting for multi-taskingFor 𝒯nanda token embs. are essentially linearcombinations of 5 frequencies (𝜔)For 𝒯miiii more frequencies are in playEach 𝒯miiii subtask targets unique primePossibility: One basis per prime taskFigure 6: 𝒯nanda (top) and 𝒯miiii (bottom) tokenembeddings in Fourier basis9 of 194.3 |Sanity-check and task-maskMasking 𝑞{2,3,5,7} yields we see a slightdecrease in token emb. freqs.Sanity check: 𝒯baseline has no periodicityThe tok. embs. encode a basis per subtask?Figure 7: 𝒯baseline (top), 𝒯miiii (middle) and 𝒯masked(bottom) token embeddings in Fourier basis10 of 195 |NeuronsInspite of the dense Fourier basis of 𝑊𝐸𝒯miiii theperiodicity is clearfigs/neurs_113_miiii.svg), caption: [Activations of first three neurons for𝒯nanda (top) and 𝒯miiii (bottom)], )11 of 195 |Neurons(Probably redundant) sanity check: Figure 9confirms neurons are periodicSee some freqs. 𝜔 rise into significanceLets log |𝜔>𝜇𝜔+2𝜎𝜔| while trainingFigure 9: FFT of Activations of first three neuronsfor 𝒯nanda (top) and 𝒯miiii (bottom)12 of 19Figure 10: Neurons as archive for 𝒯baslineFigure 11: Neurons as algorithm 𝒯miiiiFigure 12: Number of neurons with frequency 𝜔 above the theshold 𝜇𝜔+2𝜎𝜔6 |The 𝜔-SpikeNeurs. periodic on solving 𝑞{2,3,5,7}When we generalize to the reamining tasks,many frequencies activate (64-sample)Those 𝜔’s are not useful for memory and notuseful after generalizationtime256102440961638465536|𝝎|00101810Table 2: active 𝜔’s through trainingFigure 13: Figure 12 (top) and validation accuracyfrom Figure 4 (bottom)13 of 196 |The 𝜔-SpikeGrokFast [3] shows time gradient sequences is (arguably) a stocastical signal with:A fast varying overfitting componentA slow varying generealizing componentMy work confirms this to be true for 𝒯miiii… and observes a strucutre that seems to fit neither of the two14 of 196 |The 𝜔-SpikeFuture work:Modify GrokFast to assume a third stochastic componentRelate to signal processing literatureCan more depth make tok-embedding sparse?15 of 19TAKReferences[1]A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, “Grokking: Generalization BeyondOverfitting on Small Algorithmic Datasets,” no. arXiv:2201.02177. arXiv, Jan. 2022. doi: 10.48550/arXiv.2201.02177.[2]N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt, “Progress Measures for Grokking viaMechanistic Interpretability,” no. arXiv:2301.05217. arXiv, Oct. 2023.[3]J. Lee, B. G. Kang, K. Kim, and K. M. Lee, “Grokfast: Accelerated Grokking by Amplifying SlowGradients,” no. arXiv:2405.20233. Jun. 2024.16 of 19A |Stochastic Signal ProcessingWe denote the weights of a model as 𝜃. The gradient of 𝜃 with respect to our loss function at time𝑡 we denote 𝑔(𝑡). As we train the model, 𝑔(𝑡) varies, going up and down. This can be thought of asa stocastic signal. We can represent this signal with a Fourier basis. GrokFast posits that the slowvarying frequencies contribute to grokking. Higer frequencies are then muted, and grokking is indeedaccelerated.17 of 19B |Discrete Fourier TransformFunction can be expressed as a linear combination of cosine and sine waves. A similar thing can bedone for data / vectors.18 of 19C |Singular Value DecompositionAn 𝑛×𝑚 matrix 𝑀 can be represented as a 𝑈Σ𝑉, where 𝑈 is an 𝑚×𝑚 complex unitary matrix,Σ a rectangular 𝑚×𝑛 diagonal matrix (padded with zeros), and 𝑉 an 𝑛×𝑛 complex unitary matrix.Multiplying by 𝑀 can thus be viewed as first rotating in the 𝑚-space with 𝑈, then scaling by Σ andthen rotating by 𝑉 in the 𝑛-space.19 of 19