Noah Syrkis
noah[at]syrkis.com
talks
|
works
MIIII
January 8, 2026
Copenhagen, Denmark
Talk on Mechanistic Interpretability of Deep Learning Models presented at the University of Copenhagen
MIIII
Noah Syrkis
January 8, 2026
1 |
Mechanistic Interpretability (MI)
2 |
Grokking and Generalization
3 |
Modular Arithmetic
4 |
Grokking on
𝒯
miiii
5 |
Embeddings
6 |
Neurons / the
Ω
-spike
“This disgusting pile of matrices is just a poorly written elegant algorithm” — Neel Nanda
1
1
Not verbatim, but the gist of it
1 |
Mechanistic Interpretability (MI)
▶
Deep learning (DL) is sub-symbolic
▶
No clear map from params to math notation
▶
MI is about finding that map
▶
Step 1 train on task. Step 2 reverse engineer
▶
Turning black boxes … opaque?
Figure 1: Activations of an MLP neuron trained on
modular addition (
𝑥
0
+
𝑥
1
mod
𝑝
=
𝑦
)
1 of
17
1.1 |
MI Style Questions
▶
When does the model learn what?
▶
Are the learned mechanisms static?
▶
How are the mechanisms learned?
▶
How to write a learnt algo in math?
↓
𝑓
(
𝑥
)
=
sin
𝑤
𝑒
𝑥
+
cos
𝑤
𝑒
𝑥
Figure 2: Mapping model params to math
2 of
17
2 |
Grokking and Generalization
▶
Grokking
[1]
is generalization after overfitting
▶
Mech. interpretability needs a mechanism
▶
Model params move from archive to algorithm
▶
Figure 3
shows example of train and eval curves
Figure 3: Example of the grokking
3 of
17
3 |
Modular Arithmetic
▶
In the following assume
𝑝
and
𝑞
to be prime
▶
Seminal
[2]
MI work uses
Eq. 1.1
as task
▶
We created the strictly harder
Eq. 1.2
task
▶
Eq. 1.2
is multitask and non-commutative
𝑦
=
(
𝑥
0
+
𝑥
1
)
mod
𝑝
(1.1)
⃗
𝑦
=
(
𝑥
0
+
𝑥
1
𝑝
)
mod
𝑞
∀
𝑞
<
𝑝
(1.2)
4 of
17
3 |
Modular Arithmetic
▶
Figure 4
shows a vis of a subset of the data
▶
On top we see all
(
𝑥
0
,
𝑥
1
)
-pairs for
𝑝
=
7
▶
Below
(
𝑥
0
+
𝑥
1
𝑝
)
mod
𝑞
,
𝑝
=
1
3
,
𝑞
=
1
1
↓
Figure 4: Visualizing
𝑋
for
𝑝
=
7
(top)
and
𝑌
for
𝑞
=
1
1
,
𝑝
=
1
3
(bottom)
5 of
17
4 |
Grokking on
𝒯
miiii
▶
The model groks on
𝒯
miiii
(
Figure 5
)
▶
Final hyper-params are seen in
Table 2
▶
GrokFast
[3]
posits gradient series is made of:
1.
A fast varying overfitting component
2.
A slow varying generalizing component
▶
Grokking is sped up
1
by boosting the latter
Figure 5: Training (top) and validation (bottom)
accuracy during training on
𝒯
miiii
1
Our model did not converge without GrokFast
6 of
17
5 |
Embeddings
▶
Pos embs in
Figure 6
shows commutativity
▶
Corr. is
0
.
9
5
for
𝒯
nanda
and
−
0
.
6
4
for
𝒯
miiii
▶
Assumed to fully account for commutativity
Figure 6: Positional embeddings for
𝒯
nanda
(top)
and
𝒯
miiii
(bottom).
7 of
17
5 |
Embeddings
▶
For
𝒯
nanda
token embs are linear comb of 5 freqs
▶
For
𝒯
miiii
more freqs indicate larger table
▶
Each task focuses on a unique prime (no over
lap)
▶
As per
Figure 7
the embs of
𝒯
miiii
are saturated
Figure 7:
𝒯
nanda
(top) and
𝒯
miiii
(bottom) token
embeddings in Fourier basis
8 of
17
Conclusion: Embs alone account for commutativity and multitask(edness?)
6 |
Neurons / the
Ω
-spike
▶
We plot neuron activation varying
𝑥
0
and
𝑥
1
▶
Activations are largely identical to
𝒯
nanda
Figure 8: Activations of first three neurons for
𝒯
nanda
(top) and
𝒯
miiii
(bottom)
9 of
17
6 |
Neurons / the
Ω
-spike
▶
Some freqs
𝜔
rise to significance (
𝜔
>
𝜇
+
2
𝜎
)
▶
But how many? And at what points in time?
Figure 9: FFT of activations of first three neurons
for
𝒯
nanda
(top) and
𝒯
miiii
(bottom)
10 of
17
Figure 10: Number of neurons with active freq
𝜔
(rows) through time (cols)
6 |
Neurons / the
Ω
-spike
▶
Initial freqs coincide with solving 2, 3, 5 and 7
▶
Spike in active freqs during generalization
▶
Decrease in active freqs after generalization
epoch
256
1024
4096
16384
65536
freqs
0
0
10
18
10
Table 1: number active freqs
𝜔
through training
Figure 11:
Figure 10
(top) and validation accuracy
from
Figure 5
(bottom)
11 of
17
6 |
Neurons / the
Ω
-spike
▶
Previous work
[2]
shows final circuitry begins developing right away (no sudden phase shift)
▶
GrokFast
[3]
targets this circuitry, assuming associated gradient updates to be slow varying
▶
With the
Ω
-spike we observe temporarily useful structures (not part of final solution)
▶
We propose to modify GrokFast to allow dynamical targeting of temporarily useful circuitry
12 of
17
References
[1]
A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, “Grokking: Generalization
Beyond Overfitting on Small Algorithmic Datasets,” no. arXiv:2201.02177. arXiv, Jan. 2022.
doi:
10.48550/arXiv.2201.02177
.
[2]
N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt, “Progress Measures for Grokking
via Mechanistic Interpretability,” no. arXiv:2301.05217. arXiv, Oct. 2023.
[3]
S. Lee and S. Kim, “Exploring Prime Number Classification: Achieving High Recall Rate and Rapid
Convergence with Sparse Encoding,” no. arXiv:2402.03363. arXiv, Feb. 2024.
13 of
17
A |
Hyperparameters
rate
𝜆
wd
𝑑
lr
heads
1
1
0
1
2
1
3
256
3
1
0
4
4
Table 2: Hyperparams for
𝒯
miiii
14 of
17
B |
Stochastic Signal Processing
We denote the weights of a model as
𝜃
. The gradient of
𝜃
with respect to our loss function at time
𝑡
we denote
𝑔
(
𝑡
)
. As we train the model,
𝑔
(
𝑡
)
varies, going up and down. This can be thought of as
a stocastic signal. We can represent this signal with a Fourier basis. GrokFast posits that the slow
varying frequencies contribute to grokking. Higer frequencies are then muted, and grokking is indeed
accelerated.
15 of
17
C |
Discrete Fourier Transform
Function can be expressed as a linear combination of cosine and sine waves. A similar thing can be
done for data / vectors.
16 of
17
D |
Singular Value Decomposition
An
𝑛
×
𝑚
matrix
𝑀
can be represented as a
𝑈
Σ
𝑉
∗
, where
𝑈
is an
𝑚
×
𝑚
complex unitary matrix,
Σ
a rectangular
𝑚
×
𝑛
diagonal matrix (padded with zeros), and
𝑉
an
𝑛
×
𝑛
complex unitary matrix.
Multiplying by
𝑀
can thus be viewed as first rotating in the
𝑚
-space with
𝑈
, then scaling by
Σ
and
then rotating by
𝑉
in the
𝑛
-space.
17 of
17
MIIII
Noah Syrkis
January 8, 2026
1 |
Mechanistic Interpretability (MI)
2 |
Grokking and Generalization
3 |
Modular Arithmetic
4 |
Grokking on
𝒯
miiii
5 |
Embeddings
6 |
Neurons / the
Ω
-spike
“This disgusting pile of matrices is just a poorly written elegant algorithm” — Neel Nanda
1
1
Not verbatim, but the gist of it
1 |
Mechanistic Interpretability (MI)
▶
Deep learning (DL) is sub-symbolic
▶
No clear map from params to math notation
▶
MI is about finding that map
Figure 1: Activations of an MLP neuron trained on
modular addition (
𝑥
0
+
𝑥
1
mod
𝑝
=
𝑦
)
1 of
17
1 |
Mechanistic Interpretability (MI)
▶
Deep learning (DL) is sub-symbolic
▶
No clear map from params to math notation
▶
MI is about finding that map
▶
Step 1 train on task. Step 2 reverse engineer
Figure 1: Activations of an MLP neuron trained on
modular addition (
𝑥
0
+
𝑥
1
mod
𝑝
=
𝑦
)
1 of
17
1 |
Mechanistic Interpretability (MI)
▶
Deep learning (DL) is sub-symbolic
▶
No clear map from params to math notation
▶
MI is about finding that map
▶
Step 1 train on task. Step 2 reverse engineer
▶
Turning black boxes … opaque?
Figure 1: Activations of an MLP neuron trained on
modular addition (
𝑥
0
+
𝑥
1
mod
𝑝
=
𝑦
)
1 of
17
1.1 |
MI Style Questions
▶
When does the model learn what?
↓
𝑓
(
𝑥
)
=
sin
𝑤
𝑒
𝑥
+
cos
𝑤
𝑒
𝑥
Figure 2: Mapping model params to math
2 of
17
1.1 |
MI Style Questions
▶
When does the model learn what?
▶
Are the learned mechanisms static?
↓
𝑓
(
𝑥
)
=
sin
𝑤
𝑒
𝑥
+
cos
𝑤
𝑒
𝑥
Figure 2: Mapping model params to math
2 of
17
1.1 |
MI Style Questions
▶
When does the model learn what?
▶
Are the learned mechanisms static?
▶
How are the mechanisms learned?
↓
𝑓
(
𝑥
)
=
sin
𝑤
𝑒
𝑥
+
cos
𝑤
𝑒
𝑥
Figure 2: Mapping model params to math
2 of
17
1.1 |
MI Style Questions
▶
When does the model learn what?
▶
Are the learned mechanisms static?
▶
How are the mechanisms learned?
▶
How to write a learnt algo in math?
↓
𝑓
(
𝑥
)
=
sin
𝑤
𝑒
𝑥
+
cos
𝑤
𝑒
𝑥
Figure 2: Mapping model params to math
2 of
17
2 |
Grokking and Generalization
▶
Grokking
[1]
is generalization after overfitting
▶
Mech. interpretability needs a mechanism
▶
Model params move from archive to algorithm
▶
Figure 3
shows example of train and eval curves
Figure 3: Example of the grokking
3 of
17
3 |
Modular Arithmetic
▶
In the following assume
𝑝
and
𝑞
to be prime
▶
Seminal
[2]
MI work uses
Eq. 1.1
as task
▶
We created the strictly harder
Eq. 1.2
task
▶
Eq. 1.2
is multitask and non-commutative
𝑦
=
(
𝑥
0
+
𝑥
1
)
mod
𝑝
(1.1)
⃗
𝑦
=
(
𝑥
0
+
𝑥
1
𝑝
)
mod
𝑞
∀
𝑞
<
𝑝
(1.2)
4 of
17
3 |
Modular Arithmetic
▶
Figure 4
shows a vis of a subset of the data
▶
On top we see all
(
𝑥
0
,
𝑥
1
)
-pairs for
𝑝
=
7
▶
Below
(
𝑥
0
+
𝑥
1
𝑝
)
mod
𝑞
,
𝑝
=
1
3
,
𝑞
=
1
1
↓
Figure 4: Visualizing
𝑋
for
𝑝
=
7
(top)
and
𝑌
for
𝑞
=
1
1
,
𝑝
=
1
3
(bottom)
5 of
17
4 |
Grokking on
𝒯
miiii
▶
The model groks on
𝒯
miiii
(
Figure 5
)
▶
Final hyper-params are seen in
Table 2
▶
GrokFast
[3]
posits gradient series is made of:
1.
A fast varying overfitting component
Figure 5: Training (top) and validation (bottom)
accuracy during training on
𝒯
miiii
1
Our model did not converge without GrokFast
6 of
17
4 |
Grokking on
𝒯
miiii
▶
The model groks on
𝒯
miiii
(
Figure 5
)
▶
Final hyper-params are seen in
Table 2
▶
GrokFast
[3]
posits gradient series is made of:
1.
A fast varying overfitting component
2.
A slow varying generalizing component
Figure 5: Training (top) and validation (bottom)
accuracy during training on
𝒯
miiii
1
Our model did not converge without GrokFast
6 of
17
4 |
Grokking on
𝒯
miiii
▶
The model groks on
𝒯
miiii
(
Figure 5
)
▶
Final hyper-params are seen in
Table 2
▶
GrokFast
[3]
posits gradient series is made of:
1.
A fast varying overfitting component
2.
A slow varying generalizing component
▶
Grokking is sped up
1
by boosting the latter
Figure 5: Training (top) and validation (bottom)
accuracy during training on
𝒯
miiii
1
Our model did not converge without GrokFast
6 of
17
5 |
Embeddings
▶
Pos embs in
Figure 6
shows commutativity
▶
Corr. is
0
.
9
5
for
𝒯
nanda
and
−
0
.
6
4
for
𝒯
miiii
▶
Assumed to fully account for commutativity
Figure 6: Positional embeddings for
𝒯
nanda
(top)
and
𝒯
miiii
(bottom).
7 of
17
5 |
Embeddings
▶
For
𝒯
nanda
token embs are linear comb of 5 freqs
▶
For
𝒯
miiii
more freqs indicate larger table
▶
Each task focuses on a unique prime (no over
lap)
▶
As per
Figure 7
the embs of
𝒯
miiii
are saturated
Figure 7:
𝒯
nanda
(top) and
𝒯
miiii
(bottom) token
embeddings in Fourier basis
8 of
17
Conclusion: Embs alone account for commutativity and multitask(edness?)
6 |
Neurons / the
Ω
-spike
▶
We plot neuron activation varying
𝑥
0
and
𝑥
1
▶
Activations are largely identical to
𝒯
nanda
Figure 8: Activations of first three neurons for
𝒯
nanda
(top) and
𝒯
miiii
(bottom)
9 of
17
6 |
Neurons / the
Ω
-spike
▶
Some freqs
𝜔
rise to significance (
𝜔
>
𝜇
+
2
𝜎
)
▶
But how many? And at what points in time?
Figure 9: FFT of activations of first three neurons
for
𝒯
nanda
(top) and
𝒯
miiii
(bottom)
10 of
17
Figure 10: Number of neurons with active freq
𝜔
(rows) through time (cols)
6 |
Neurons / the
Ω
-spike
▶
Initial freqs coincide with solving 2, 3, 5 and 7
▶
Spike in active freqs during generalization
▶
Decrease in active freqs after generalization
epoch
256
1024
4096
16384
65536
freqs
0
0
10
18
10
Table 1: number active freqs
𝜔
through training
Figure 11:
Figure 10
(top) and validation accuracy
from
Figure 5
(bottom)
11 of
17
6 |
Neurons / the
Ω
-spike
▶
Previous work
[2]
shows final circuitry begins developing right away (no sudden phase shift)
12 of
17
6 |
Neurons / the
Ω
-spike
▶
Previous work
[2]
shows final circuitry begins developing right away (no sudden phase shift)
▶
GrokFast
[3]
targets this circuitry, assuming associated gradient updates to be slow varying
12 of
17
6 |
Neurons / the
Ω
-spike
▶
Previous work
[2]
shows final circuitry begins developing right away (no sudden phase shift)
▶
GrokFast
[3]
targets this circuitry, assuming associated gradient updates to be slow varying
▶
With the
Ω
-spike we observe temporarily useful structures (not part of final solution)
12 of
17
6 |
Neurons / the
Ω
-spike
▶
Previous work
[2]
shows final circuitry begins developing right away (no sudden phase shift)
▶
GrokFast
[3]
targets this circuitry, assuming associated gradient updates to be slow varying
▶
With the
Ω
-spike we observe temporarily useful structures (not part of final solution)
▶
We propose to modify GrokFast to allow dynamical targeting of temporarily useful circuitry
12 of
17
References
[1]
A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, “Grokking: Generalization
Beyond Overfitting on Small Algorithmic Datasets,” no. arXiv:2201.02177. arXiv, Jan. 2022.
doi:
10.48550/arXiv.2201.02177
.
[2]
N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt, “Progress Measures for Grokking
via Mechanistic Interpretability,” no. arXiv:2301.05217. arXiv, Oct. 2023.
[3]
S. Lee and S. Kim, “Exploring Prime Number Classification: Achieving High Recall Rate and Rapid
Convergence with Sparse Encoding,” no. arXiv:2402.03363. arXiv, Feb. 2024.
13 of
17
A |
Hyperparameters
rate
𝜆
wd
𝑑
lr
heads
1
1
0
1
2
1
3
256
3
1
0
4
4
Table 2: Hyperparams for
𝒯
miiii
14 of
17
B |
Stochastic Signal Processing
We denote the weights of a model as
𝜃
. The gradient of
𝜃
with respect to our loss function at time
𝑡
we denote
𝑔
(
𝑡
)
. As we train the model,
𝑔
(
𝑡
)
varies, going up and down. This can be thought of as
a stocastic signal. We can represent this signal with a Fourier basis. GrokFast posits that the slow
varying frequencies contribute to grokking. Higer frequencies are then muted, and grokking is indeed
accelerated.
15 of
17
C |
Discrete Fourier Transform
Function can be expressed as a linear combination of cosine and sine waves. A similar thing can be
done for data / vectors.
16 of
17
D |
Singular Value Decomposition
An
𝑛
×
𝑚
matrix
𝑀
can be represented as a
𝑈
Σ
𝑉
∗
, where
𝑈
is an
𝑚
×
𝑚
complex unitary matrix,
Σ
a rectangular
𝑚
×
𝑛
diagonal matrix (padded with zeros), and
𝑉
an
𝑛
×
𝑛
complex unitary matrix.
Multiplying by
𝑀
can thus be viewed as first rotating in the
𝑚
-space with
𝑈
, then scaling by
Σ
and
then rotating by
𝑉
in the
𝑛
-space.
17 of
17