Noah Syrkis
noah[at]syrkis.com
talks
|
works
MIIII
January 8, 2026
Copenhagen, Denmark
Talk on Mechanistic Interpretability of Deep Learning Models presented at the University of Copenhagen
MIIII
Noah Syrkis
January 8, 2026
1 |
Mechanistic Interpretability (MI)
2 |
Grokking and Generalization
3 |
Modular Arithmetic
4 |
Grokking on
𝒯
miiii
5 |
Embeddings
6 |
Neurons / the
Ω
-spike
“This disgusting pile of matrices is just a poorly written elegant algorithm” — Neel Nanda
1
1
Not verbatim, but the gist of it
1 |
Mechanistic Interpretability (MI)
▶
Deep learning (DL) is sub-symbolic
▶
No clear map from params to math notation
▶
MI is about finding that map
▶
Step 1 train on task. Step 2 reverse engineer
▶
Turning black boxes … opaque?
Figure 1: Activations of an MLP neuron trained on
modular addition (
𝑥
0
+
𝑥
1
mod
𝑝
=
𝑦
)
1 of
17
1.1 |
MI Style Questions
▶
When does the model learn what?
▶
Are the learned mechanisms static?
▶
How are the mechanisms learned?
▶
How to write a learnt algo in math?
↓
𝑓
(
𝑥
)
=
sin
𝑤
𝑒
𝑥
+
cos
𝑤
𝑒
𝑥
Figure 2: Mapping model params to math
2 of
17
2 |
Grokking and Generalization
▶
Grokking
[1]
is generalization after overfitting
▶
Mech. interpretability needs a mechanism
▶
Model params move from archive to algorithm
▶
Figure 3
shows example of train and eval curves
Figure 3: Example of the grokking
3 of
17
3 |
Modular Arithmetic
▶
In the following assume
𝑝
and
𝑞
to be prime
▶
Seminal
[2]
MI work uses
Eq. 1.1
as task
▶
We created the strictly harder
Eq. 1.2
task
▶
Eq. 1.2
is multitask and non-commutative
𝑦
=
(
𝑥
0
+
𝑥
1
)
mod
𝑝
(1.1)
⃗
𝑦
=
(
𝑥
0
+
𝑥
1
𝑝
)
mod
𝑞
∀
𝑞
<
𝑝
(1.2)
4 of
17
3 |
Modular Arithmetic
▶
Figure 4
shows a vis of a subset of the data
▶
On top we see all
(
𝑥
0
,
𝑥
1
)
-pairs for
𝑝
=
7
▶
Below
(
𝑥
0
+
𝑥
1
𝑝
)
mod
𝑞
,
𝑝
=
1
3
,
𝑞
=
1
1
↓
Figure 4: Visualizing
𝑋
for
𝑝
=
7
(top)
and
𝑌
for
𝑞
=
1
1
,
𝑝
=
1
3
(bottom)
5 of
17
4 |
Grokking on
𝒯
miiii
▶
The model groks on
𝒯
miiii
(
Figure 5
)
▶
Final hyper-params are seen in
Table 2
▶
GrokFast
[3]
posits gradient series is made of:
1.
A fast varying overfitting component
2.
A slow varying generalizing component
▶
Grokking is sped up
1
by boosting the latter
Figure 5: Training (top) and validation (bottom)
accuracy during training on
𝒯
miiii
1
Our model did not converge without GrokFast
6 of
17
5 |
Embeddings
▶
Pos embs in
Figure 6
shows commutativity
▶
Corr. is
0
.
9
5
for
𝒯
nanda
and
−
0
.
6
4
for
𝒯
miiii
▶
Assumed to fully account for commutativity
Figure 6: Positional embeddings for
𝒯
nanda
(top)
and
𝒯
miiii
(bottom).
7 of
17
5 |
Embeddings
▶
For
𝒯
nanda
token embs are linear comb of 5 freqs
▶
For
𝒯
miiii
more freqs indicate larger table
▶
Each task focuses on a unique prime (no over
lap)
▶
As per
Figure 7
the embs of
𝒯
miiii
are saturated