works|about
Mechanistic Interpretabilityon (multi-task) Irreducible Integer IdentifiersNoah SyrkisJanuary 8, 20251 |Mechanistic Interpretability (MI)2 |Modular Arithmetic3 |Grokking on 𝒯miiii4 |Embeddings5 |Neurons6 |The 𝜔-Spike“This disgusting pile of matrices is actually just an astoundingly poorly written, elegant and consicealgorithm” — Neel Nanda¹¹Not verbatim, but the gist of it1 |Mechanistic Interpretability (MI)Sub-symbolic nature of deep learning obscures model mechanismsNo obvious mapping from the weights of a trained model to math notationMI is about reverse engineering these models, and looking closely at themHow does a given model work? How can we train it faster? Is it safe?1 of 241 |Mechanistic Interpretability (MI)Open areas of research in MI includes trainingdynamics, algorithms, and much moreEarly MI work reverse engineered modular artithmetic related tasks [1]Figure 1 shows 𝑦 in (𝑥0,𝑥1)𝑦, with themapping given by Eq. 1𝑦=𝑥0+𝑥1mod𝑝,𝑝=113(1)This task is denoted 𝒯nandaRemainders𝑥1𝑥0Figure 1: esch diagram of 𝑦 from 𝒯nanda2 of 241 |Mechanistic Interpretability (MI)on 𝑦 from 𝒯nanda(0,0)(1,0)(2,0)(0,1)(1,1)(2,1)(0,2)(1,2)(2,2)Table 1: Table of (𝑥0,𝑥1)-tuples for 𝑝=3𝑥1Figure 2: esch of (𝑥0,𝑥1)-tuples for 𝑝=53 of 241 |Mechanistic Interpretability (MI)Array72114918210631053817125291140610114926073821111051038112472911061260114815103729297011312648101581125107011926434116932101701285105212793061811461283011541102971741086921251130931062411857012108511110612394724 of 241 |Mechanistic Interpretability (MI)Are hard to see𝑦 as (𝑥0,𝑥1) move through [0..𝑝1]𝑥0𝑥1Figure 3: Visualizin5 of 241 |Mechanistic Interpretability (MI)1.Make a task2.Solve the task3.Inspect the solutionThink artificial neuroscienceFigure 4: Target 𝑦 for as 𝑥0 and 𝑥1 move from 0 to𝑝1 for the task 𝑥0+𝑥1mod𝑝=𝑦6 of 241.1 |Grokking [2]Sudden generalization long after overfittingMI (by definition) needs a mechanismGrokking is thus convenient for MIFigure 5: Example of the grokking7 of 242 |Modular Arithmetic“Seminal” MI paper by Nanda et al. (2023)focuses on modular addition (𝒯nanda)Their final setup trains on 𝑝=113They train a one-layer transformerWe call their task 𝒯nandaAnd ours we call 𝒯miiii𝒯nanda=(𝑥0+𝑥1)mod𝑝,𝑥0,𝑥1(2.1)𝒯miiii=(𝑥0𝑝0+𝑥1𝑝1)mod𝑞,𝑞<𝑝(2.2)8 of 242 |Modular Arithmetic𝒯miiii is non-commutative …… and multi-task: 𝑞 ranges from 2 to 109¹𝒯nanda use a single layer transformerNote that these tasks are synthetic and trivial to solve with conventional programmingThey are used in the MI literature to turn black boxes opaque¹Largest prime less than 𝑝=1139 of 24Figure 6: <𝑝2 multiples of 13 or 27 (left) 11 (mid.) or primes (right)3 |Grokking on 𝒯miiiiFor two-token samples, plot them varying oneon each axis (Figure 7)When a matrix is periodic use FourierSingular value decompositionFigure 7: Top singular vectors of 𝐔𝑊𝐸𝒯nanda (top),varying 𝑥0 and 𝑥1 in sample (left) and freq. (right)space in 𝑊out𝒯miiii10 of 243 |Grokking on 𝒯miiiiThe model groks on 𝒯miiii (Figure 8)Needed GrokFast [3] on compute budgetFinal hyperparams are seen in Table 3rate𝜆wd𝑑lrheads110121325631044Table 3: Hyperparams for 𝒯miiiiFigure 8: Training (top) and validation (bottom)accuracy during training on 𝒯miiii11 of 244 |EmbeddingsHow the embedding layer deals with the difference between 𝒯nanda and 𝒯miiii12 of 244.1 |Correcting for non-commutativityThe position embs. of Figure 9 reflects that𝒯nanda is commutative and 𝒯miiii is notMaybe: this corrects non-comm. of 𝒯miiii?Corr. is 0.95 for 𝒯nanda and 0.64 for 𝒯miiiiFigure 9: Positional embeddings for 𝒯nanda (top) and𝒯miiii (bottom).13 of 244.2 |Correcting for multi-taskingFor 𝒯nanda token embs. are essentially linearcombinations of 5 frequencies (𝜔)For 𝒯miiii more frequencies are in playEach 𝒯miiii subtask targets unique primePossibility: One basis per prime taskFigure 10: 𝒯nanda (top) and 𝒯miiii (bottom) tokenembeddings in Fourier basis14 of 244.3 |Sanity-check and task-maskMasking 𝑞{2,3,5,7} yields we see a slightdecrease in token emb. freqs.Sanity check: 𝒯baseline has no periodicityThe tok. embs. encode a basis per subtask?Figure 11: 𝒯baseline (top), 𝒯miiii (middle) and 𝒯masked(bottom) token embeddings in Fourier basis15 of 245 |NeuronsInspite of the dense Fourier basis of 𝑊𝐸𝒯miiii theperiodicity is clearfigs/neurs_113_miiii.svg), caption: [Activations of first three neurons for𝒯nanda (top) and 𝒯miiii (bottom)], )16 of 245 |Neurons(Probably redundant) sanity check: Figure 13confirms neurons are periodicSee some freqs. 𝜔 rise into significanceLets log |𝜔>𝜇𝜔+2𝜎𝜔| while trainingFigure 13: FFT of Activations of first three neuronsfor 𝒯nanda (top) and 𝒯miiii (bottom)17 of 24Figure 14: Neurons as archive for 𝒯baslineFigure 15: Neurons as algorithm 𝒯miiiiFigure 16: Number of neurons with frequency 𝜔 above the theshold 𝜇𝜔+2𝜎𝜔6 |The 𝜔-SpikeNeurs. periodic on solving 𝑞{2,3,5,7}When we generalize to the reamining tasks,many frequencies activate (64-sample)Those 𝜔’s are not useful for memory and notuseful after generalizationtime256102440961638465536|𝝎|00101810Table 4: active 𝜔’s through trainingFigure 17: Figure 16 (top) and validation accuracyfrom Figure 8 (bottom)18 of 246 |The 𝜔-SpikeGrokFast [3] shows time gradient sequences is (arguably) a stocastical signal with:A fast varying overfitting componentA slow varying generealizing componentMy work confirms this to be true for 𝒯miiii… and observes a strucutre that seems to fit neither of the two19 of 246 |The 𝜔-SpikeFuture work:Modify GrokFast to assume a third stochastic componentRelate to signal processing literatureCan more depth make tok-embedding sparse?20 of 24TAKReferences[1]N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt, “Progress Measures for Grokking viaMechanistic Interpretability,” no. arXiv:2301.05217. arXiv, Oct. 2023.[2]A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, “Grokking: Generalization BeyondOverfitting on Small Algorithmic Datasets,” no. arXiv:2201.02177. arXiv, Jan. 2022. doi: 10.48550/arXiv.2201.02177.[3]J. Lee, B. G. Kang, K. Kim, and K. M. Lee, “Grokfast: Accelerated Grokking by Amplifying SlowGradients,” no. arXiv:2405.20233. Jun. 2024.21 of 24A |Stochastic Signal ProcessingWe denote the weights of a model as 𝜃. The gradient of 𝜃 with respect to our loss function at time𝑡 we denote 𝑔(𝑡). As we train the model, 𝑔(𝑡) varies, going up and down. This can be thought of asa stocastic signal. We can represent this signal with a Fourier basis. GrokFast posits that the slowvarying frequencies contribute to grokking. Higer frequencies are then muted, and grokking is indeedaccelerated.22 of 24B |Discrete Fourier TransformFunction can be expressed as a linear combination of cosine and sine waves. A similar thing can bedone for data / vectors.23 of 24C |Singular Value DecompositionAn 𝑛×𝑚 matrix 𝑀 can be represented as a 𝑈Σ𝑉, where 𝑈 is an 𝑚×𝑚 complex unitary matrix,Σ a rectangular 𝑚×𝑛 diagonal matrix (padded with zeros), and 𝑉 an 𝑛×𝑛 complex unitary matrix.Multiplying by 𝑀 can thus be viewed as first rotating in the 𝑚-space with 𝑈, then scaling by Σ andthen rotating by 𝑉 in the 𝑛-space.24 of 24