talks | works
MIIII

MIIII

January 8, 2025
Rio de Janeiro, Brazil
Press D to download as a PDF and F to toggle presentation mode. Navigate up and down with K and J when presenting
MIIIINoah SyrkisJanuary 8, 20251 |Mechanistic Interpretability (MI)2 |Modular Arithmetic3 |Grokking on 𝒯miiii4 |Embeddings5 |Neurons6 |The 𝜔-Spike“This disgusting pile of matrices is actually just an astoundingly poorly written, elegant and consicealgorithm” — Neel Nanda11Not verbatim, but the gist of it1 |Mechanistic Interpretability (MI)Sub-symbolic nature of deep learning obscures model mechanismsNo obvious mapping from the weights of a trained model to math notationMI is about reverse engineering these models, and looking closely at themHow does a given model work? How can we train it faster? Is it safe?1 of 241 |Mechanistic Interpretability (MI)Early MI work focus on modular addition[1]𝒯nanda focus on a model mapping (𝑥0,𝑥1)𝑦True mapping given by 𝑦=𝑥0+𝑥1mod𝑝(0,0)(1,0)(2,0)(0,1)(1,1)(2,1)(0,2)(1,2)(2,2)Table 1: Table of (𝑥0,𝑥1)-tuples for 𝑝=3𝑥1Figure 1: esch of (𝑥0,𝑥1)-tuples for 𝑝=52 of 241 |Mechanistic Interpretability (MI)on 𝑦 from 𝒯nandaRemainders𝑥1𝑥0Figure 2: esch diagram of 𝑦 from 𝒯nanda3 of 241 |Mechanistic Interpretability (MI)Array72114918210631053817125291140610114926073821111051038112472911061260114815103729297011312648101581125107011926434116932101701285105212793061811461283011541102971741086921251130931062411857012108511110612394724 of 241 |Mechanistic Interpretability (MI)Are hard to see5 of 241 |Mechanistic Interpretability (MI)1.Make a task2.Solve the task3.Inspect the solutionThink artificial neuroscienceFigure 3: Target 𝑦 for as 𝑥0 and 𝑥1 move from 0 to𝑝1 for the task 𝑥0+𝑥1mod𝑝=𝑦6 of 241.1 |Grokking [1]Sudden generalization long after overfittingMI (by definition) needs a mechanismGrokking is thus convenient for MIFigure 4: Example of the grokking7 of 242 |Modular Arithmetic“Seminal” MI paper by Nanda et al. (2023)focuses on modular addition (𝒯nanda)Their final setup trains on 𝑝=113They train a one-layer transformerWe call their task 𝒯nandaAnd ours we call 𝒯miiii𝒯nanda=(𝑥0+𝑥1)mod𝑝,𝑥0,𝑥1(1.1)𝒯miiii=(𝑥0𝑝0+𝑥1𝑝1)mod𝑞,𝑞<𝑝(1.2)8 of 242 |Modular Arithmetic𝒯miiii is non-commutative …… and multi-task: 𝑞 ranges from 2 to 1091𝒯nanda use a single layer transformerNote that these tasks are synthetic and trivial to solve with conventional programmingThey are used in the MI literature to turn black boxes opaque1Largest prime less than 𝑝=1139 of 24Figure 5: <𝑝2 multiples of 13 or 27 (left) 11 (mid.) or primes (right)3 |Grokking on 𝒯miiiiFor two-token samples, plot them varying oneon each axis (Figure 6)When a matrix is periodic use FourierSingular value decompositionFigure 6: Top singular vectors of 𝐔𝑊𝐸𝒯nanda (top),varying 𝑥0 and 𝑥1 in sample (left) and freq.(right) space in 𝑊out𝒯miiii10 of 243 |Grokking on 𝒯miiiiThe model groks on 𝒯miiii (Figure 7)Needed GrokFast [2] on compute budgetFinal hyperparams are seen in Table 3rate𝜆wd𝑑lrheads110121325631044Table 3: Hyperparams for 𝒯miiiiFigure 7: Training (top) and validation (bottom)accuracy during training on 𝒯miiii11 of 244 |EmbeddingsHow the embedding layer deals with the difference between 𝒯nanda and 𝒯miiii12 of 244.1 |Correcting for non-commutativityThe position embs. of Figure 8 reflects that𝒯nanda is commutative and 𝒯miiii is notMaybe: this corrects non-comm. of 𝒯miiii?Corr. is 0.95 for 𝒯nanda and 0.64 for 𝒯miiiiFigure 8: Positional embeddings for 𝒯nanda (top)and 𝒯miiii (bottom).13 of 244.2 |Correcting for multi-taskingFor 𝒯nanda token embs. are essentially linearcombinations of 5 frequencies (𝜔)For 𝒯miiii more frequencies are in playEach 𝒯miiii subtask targets unique primePossibility: One basis per prime taskFigure 9: 𝒯nanda (top) and 𝒯miiii (bottom) tokenembeddings in Fourier basis14 of 244.3 |Sanity-check and task-maskMasking 𝑞{2,3,5,7} yields we see a slightdecrease in token emb. freqs.Sanity check: 𝒯baseline has no periodicityThe tok. embs. encode a basis per subtask?Figure 10: 𝒯baseline (top), 𝒯miiii (middle) and𝒯masked (bottom) token embeddings in Fourierbasis15 of 245 |NeuronsInspite of the dense Fourier basis of 𝑊𝐸𝒯miiii theperiodicity is clear/public/figs/miiii/neurs_113_miiii.svg), caption: [Activations of first three neurons for𝒯nanda (top) and 𝒯miiii (bottom)], )16 of 245 |Neurons(Probably redundant) sanity check: Figure 12confirms neurons are periodicSee some freqs. 𝜔 rise into significanceLets log |𝜔>𝜇𝜔+2𝜎𝜔| while trainingFigure 12: FFT of Activations of first three neurons for 𝒯nanda (top) and 𝒯miiii (bottom)17 of 24Figure 13: Neurons as archive for 𝒯baslineFigure 14: Neurons as algorithm 𝒯miiiiFigure 15: Number of neurons with frequency 𝜔 above the theshold 𝜇𝜔+2𝜎𝜔6 |The 𝜔-SpikeNeurs. periodic on solving 𝑞{2,3,5,7}When we generalize to the reamining tasks,many frequencies activate (64-sample)Those 𝜔’s are not useful for memory and notuseful after generalizationtime256102440961638465536|𝝎|00101810Table 4: active 𝜔’s through trainingFigure 16: Figure 15 (top) and validation accuracyfrom Figure 7 (bottom)18 of 246 |The 𝜔-SpikeGrokFast [2] shows time gradient sequences is (arguably) a stocastical signal with:A fast varying overfitting componentA slow varying generealizing componentMy work confirms this to be true for 𝒯miiii… and observes a strucutre that seems to fit neither of the two19 of 246 |The 𝜔-SpikeFuture work:Modify GrokFast to assume a third stochastic componentRelate to signal processing literatureCan more depth make tok-embedding sparse?20 of 24References[1]N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt, “Progress Measures for Grokkingvia Mechanistic Interpretability,” no. arXiv:2301.05217. arXiv, Oct. 2023.[2]J. Lee, B. G. Kang, K. Kim, and K. M. Lee, “Grokfast: Accelerated Grokking by Amplifying SlowGradients,” no. arXiv:2405.20233. Jun. 2024.21 of 24A |Stochastic Signal ProcessingWe denote the weights of a model as 𝜃. The gradient of 𝜃 with respect to our loss function at time𝑡 we denote 𝑔(𝑡). As we train the model, 𝑔(𝑡) varies, going up and down. This can be thought of asa stocastic signal. We can represent this signal with a Fourier basis. GrokFast posits that the slowvarying frequencies contribute to grokking. Higer frequencies are then muted, and grokking is indeedaccelerated.22 of 24B |Discrete Fourier TransformFunction can be expressed as a linear combination of cosine and sine waves. A similar thing can bedone for data / vectors.23 of 24C |Singular Value DecompositionAn 𝑛×𝑚 matrix 𝑀 can be represented as a 𝑈Σ𝑉, where 𝑈 is an 𝑚×𝑚 complex unitary matrix, Σa rectangular 𝑚×𝑛 diagonal matrix (padded with zeros), and 𝑉 an 𝑛×𝑛 complex unitary matrix.Multiplying by 𝑀 can thus be viewed as first rotating in the 𝑚-space with 𝑈, then scaling by Σ andthen rotating by 𝑉 in the 𝑛-space.24 of 24