Noah Syrkis
noah[at]syrkis.com
MIIII

MIIII

January 8, 2026
Copenhagen, Denmark
Talk on Mechanistic Interpretability of Deep Learning Models presented at the University of Copenhagen
MIIIINoah SyrkisJanuary 8, 20261 |Mechanistic Interpretability (MI)2 |Grokking and Generalization3 |Modular Arithmetic4 |Grokking on 𝒯miiii5 |Embeddings6 |Neurons / the Ω-spike“This disgusting pile of matrices is just a poorly written elegant algorithm” — Neel Nanda11Not verbatim, but the gist of it1 |Mechanistic Interpretability (MI)Deep learning (DL) is sub-symbolicNo clear map from params to math notationMI is about finding that mapStep 1 train on task. Step 2 reverse engineerTurning black boxes … opaque?Figure 1: Activations of an MLP neuron trained onmodular addition (𝑥0+𝑥1mod𝑝=𝑦)1 of 171.1 |MI Style QuestionsWhen does the model learn what?Are the learned mechanisms static?How are the mechanisms learned?How to write a learnt algo in math?𝑓(𝑥)=sin𝑤𝑒𝑥+cos𝑤𝑒𝑥Figure 2: Mapping model params to math2 of 172 |Grokking and GeneralizationGrokking [1] is generalization after overfittingMech. interpretability needs a mechanismModel params move from archive to algorithmFigure 3 shows example of train and eval curvesFigure 3: Example of the grokking3 of 173 |Modular ArithmeticIn the following assume 𝑝 and 𝑞 to be primeSeminal [2] MI work uses Eq. 1.1 as taskWe created the strictly harder Eq. 1.2 taskEq. 1.2 is multitask and non-commutative𝑦=(𝑥0+𝑥1)mod𝑝(1.1)𝑦=(𝑥0+𝑥1𝑝)mod𝑞𝑞<𝑝(1.2)4 of 173 |Modular ArithmeticFigure 4 shows a vis of a subset of the dataOn top we see all (𝑥0,𝑥1)-pairs for 𝑝=7Below (𝑥0+𝑥1𝑝)mod𝑞,𝑝=13,𝑞=11Figure 4: Visualizing 𝑋 for 𝑝=7 (top)and 𝑌 for 𝑞=11,𝑝=13 (bottom)5 of 174 |Grokking on 𝒯miiiiThe model groks on 𝒯miiii (Figure 5)Final hyper-params are seen in Table 2GrokFast [3] posits gradient series is made of:1.A fast varying overfitting component2.A slow varying generalizing componentGrokking is sped up1 by boosting the latterFigure 5: Training (top) and validation (bottom)accuracy during training on 𝒯miiii1Our model did not converge without GrokFast6 of 175 |EmbeddingsPos embs in Figure 6 shows commutativityCorr. is 0.95 for 𝒯nanda and 0.64 for 𝒯miiiiAssumed to fully account for commutativityFigure 6: Positional embeddings for 𝒯nanda (top)and 𝒯miiii (bottom).7 of 175 |EmbeddingsFor 𝒯nanda token embs are linear comb of 5 freqsFor 𝒯miiii more freqs indicate larger tableEach task focuses on a unique prime (no overlap)As per Figure 7 the embs of 𝒯miiii are saturated