Noah Syrkis
noah[at]syrkis.com
MIIII

MIIII

January 8, 2026
Copenhagen, Denmark
Talk on Mechanistic Interpretability of Deep Learning Models presented at the University of Copenhagen
MIIIINoah SyrkisJanuary 8, 20261 |Mechanistic Interpretability (MI)2 |Grokking and Generalization3 |Modular Arithmetic4 |Grokking on 𝒯miiii5 |Embeddings6 |Neurons / the Ω-spike“This disgusting pile of matrices is just a poorly written elegant algorithm” — Neel Nanda11Not verbatim, but the gist of it1 |Mechanistic Interpretability (MI)Deep learning (DL) is sub-symbolicNo clear map from params to math notationMI is about finding that mapStep 1 train on task. Step 2 reverse engineerTurning black boxes … opaque?Figure 1: Activations of an MLP neuron trained onmodular addition (𝑥0+𝑥1mod𝑝=𝑦)1 of 171.1 |MI Style QuestionsWhen does the model learn what?Are the learned mechanisms static?How are the mechanisms learned?How to write a learnt algo in math?𝑓(𝑥)=sin𝑤𝑒𝑥+cos𝑤𝑒𝑥Figure 2: Mapping model params to math2 of 172 |Grokking and GeneralizationGrokking [1] is generalization after overfittingMech. interpretability needs a mechanismModel params move from archive to algorithmFigure 3 shows example of train and eval curvesFigure 3: Example of the grokking3 of 173 |Modular ArithmeticIn the following assume 𝑝 and 𝑞 to be primeSeminal [2] MI work uses Eq. 1.1 as taskWe created the strictly harder Eq. 1.2 taskEq. 1.2 is multitask and non-commutative𝑦=(𝑥0+𝑥1)mod𝑝(1.1)𝑦=(𝑥0+𝑥1𝑝)mod𝑞𝑞<𝑝(1.2)4 of 173 |Modular ArithmeticFigure 4 shows a vis of a subset of the dataOn top we see all (𝑥0,𝑥1)-pairs for 𝑝=7Below (𝑥0+𝑥1𝑝)mod𝑞,𝑝=13,𝑞=11Figure 4: Visualizing 𝑋 for 𝑝=7 (top)and 𝑌 for 𝑞=11,𝑝=13 (bottom)5 of 174 |Grokking on 𝒯miiiiThe model groks on 𝒯miiii (Figure 5)Final hyper-params are seen in Table 2GrokFast [3] posits gradient series is made of:1.A fast varying overfitting component2.A slow varying generalizing componentGrokking is sped up1 by boosting the latterFigure 5: Training (top) and validation (bottom)accuracy during training on 𝒯miiii1Our model did not converge without GrokFast6 of 175 |EmbeddingsPos embs in Figure 6 shows commutativityCorr. is 0.95 for 𝒯nanda and 0.64 for 𝒯miiiiAssumed to fully account for commutativityFigure 6: Positional embeddings for 𝒯nanda (top)and 𝒯miiii (bottom).7 of 175 |EmbeddingsFor 𝒯nanda token embs are linear comb of 5 freqsFor 𝒯miiii more freqs indicate larger tableEach task focuses on a unique prime (no overlap)As per Figure 7 the embs of 𝒯miiii are saturatedFigure 7: 𝒯nanda (top) and 𝒯miiii (bottom) tokenembeddings in Fourier basis8 of 17Conclusion: Embs alone account for commutativity and multitask(edness?)6 |Neurons / the Ω-spikeWe plot neuron activation varying 𝑥0 and 𝑥1Activations are largely identical to 𝒯nandaFigure 8: Activations of first three neurons for𝒯nanda (top) and 𝒯miiii (bottom)9 of 176 |Neurons / the Ω-spikeSome freqs 𝜔 rise to significance (𝜔>𝜇+2𝜎)But how many? And at what points in time?Figure 9: FFT of activations of first three neuronsfor 𝒯nanda (top) and 𝒯miiii (bottom)10 of 17Figure 10: Number of neurons with active freq 𝜔 (rows) through time (cols)6 |Neurons / the Ω-spikeInitial freqs coincide with solving 2, 3, 5 and 7Spike in active freqs during generalizationDecrease in active freqs after generalizationepoch256102440961638465536freqs00101810Table 1: number active freqs 𝜔 through trainingFigure 11: Figure 10 (top) and validation accuracyfrom Figure 5 (bottom)11 of 176 |Neurons / the Ω-spikePrevious work [2] shows final circuitry begins developing right away (no sudden phase shift)GrokFast [3] targets this circuitry, assuming associated gradient updates to be slow varyingWith the Ω-spike we observe temporarily useful structures (not part of final solution)We propose to modify GrokFast to allow dynamical targeting of temporarily useful circuitry12 of 17References[1]A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, “Grokking: GeneralizationBeyond Overfitting on Small Algorithmic Datasets,” no. arXiv:2201.02177. arXiv, Jan. 2022.doi: 10.48550/arXiv.2201.02177.[2]N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt, “Progress Measures for Grokkingvia Mechanistic Interpretability,” no. arXiv:2301.05217. arXiv, Oct. 2023.[3]S. Lee and S. Kim, “Exploring Prime Number Classification: Achieving High Recall Rate and RapidConvergence with Sparse Encoding,” no. arXiv:2402.03363. arXiv, Feb. 2024.13 of 17A |Hyperparametersrate𝜆wd𝑑lrheads110121325631044Table 2: Hyperparams for 𝒯miiii14 of 17B |Stochastic Signal ProcessingWe denote the weights of a model as 𝜃. The gradient of 𝜃 with respect to our loss function at time𝑡 we denote 𝑔(𝑡). As we train the model, 𝑔(𝑡) varies, going up and down. This can be thought of asa stocastic signal. We can represent this signal with a Fourier basis. GrokFast posits that the slowvarying frequencies contribute to grokking. Higer frequencies are then muted, and grokking is indeedaccelerated.15 of 17C |Discrete Fourier TransformFunction can be expressed as a linear combination of cosine and sine waves. A similar thing can bedone for data / vectors.16 of 17D |Singular Value DecompositionAn 𝑛×𝑚 matrix 𝑀 can be represented as a 𝑈Σ𝑉, where 𝑈 is an 𝑚×𝑚 complex unitary matrix, Σa rectangular 𝑚×𝑛 diagonal matrix (padded with zeros), and 𝑉 an 𝑛×𝑛 complex unitary matrix.Multiplying by 𝑀 can thus be viewed as first rotating in the 𝑚-space with 𝑈, then scaling by Σ andthen rotating by 𝑉 in the 𝑛-space.17 of 17