NOAH SYRKIS
about
|
works
Mechanistic Interpretability on
(multi-task) Irreducible Integer
Identifiers
Noah Syrkis
February 24, 2025
1 |
Mechanistic Interpretability (MI)
2 |
Modular Arithmetic
3 |
Grokking on
𝒯
miiii
4 |
Embeddings
5 |
Neurons
6 |
The
𝜔
-Spike
Figure 1:
ℕ
<
𝑝
2
multiples of 13 or 27 (left) 11 (mid.) or primes (right)
“This disgusting pile of matrices is actually just an astoundingly poorly written, elegant and
consice algorithm” — Neel Nanda
¹
¹
Not verbatim, but the gist of it
1 |
Mechanistic Interpretability (MI)
▶
Sub-symbolic nature of deep learning obscures model mechanisms
▶
No obvious mapping from the weights of a trained model to math notation
▶
MI is about reverse engineering these models, and looking closely at them
▶
Many low hanging fruits / practical botany phase of the science
▶
How does a given model work? How can we train it faster? Is it safe?
1 of
18
1 |
Grokking
▶
Grokking
[1]
is “sudden generalization”
▶
MI (often) needs a mechanism
▶
Grokking is thus convenient for MI
▶
Lee et al. (2024)
speeds up grokking by
boosting slow gradients as per
Eq. 1
▶
For more see
Appendix A
ℎ
(
𝑡
)
=
ℎ
(
𝑡
−
1
)
𝛼
+
𝑔
(
𝑡
)
(
1
−
𝛼
)
(1.1)
̂
𝑔
(
𝑡
)
=
𝑔
(
𝑡
)
+
𝜆
ℎ
(
𝑡
)
(1.2)
2 of
18
1 |
Visualizing
▶
MI needs creativity … but there are tricks:
▶
For two-token samples, plot them varying
one on each axis (
Figure 2
)
▶
When a matrix is periodic use Fourier
▶
Singular value decomp. (
Appendix C
).
▶
Take away: get commfy with
esch
-plots
Figure 2: Top singular vectors of
𝐔
𝑊
𝐸
𝒯
nanda
(top), varying
𝑥
0
and
𝑥
1
in sample (left) and
freq. (right) space in
𝑊
out
𝒯
miiii
3 of
18
Figure 3: Shamleless plug: visit
github.com/syrkis/esch
for more esch plots
2 |
Modular Arithmetic
▶
“Seminal” MI paper by
Nanda et al. (2023)
focuses on modular additon (
Eq. 2
)
▶
Their final setup trains on
𝑝
=
1
1
3
▶
They train a one-layer transformer
▶
We call their task
𝒯
nanda
▶
And ours, seen in
Eq. 3
, we call
𝒯
miiii
(
𝑥
0
+
𝑥
1
)
mod
𝑝
,
∀
𝑥
0
,
𝑥
1
(2)
(
𝑥
0
𝑝
0
+
𝑥
1
𝑝
1
)
mod
𝑞
,
∀
𝑞
<
𝑝
(3)
4 of
18
2 |
Modular Arithmetic
▶
𝒯
miiii
is non-commutative …
▶
… and multi-task:
𝑞
ranges from 2 to 109
¹
▶
𝒯
nanda
use a single layer transformer
▶
Note that these tasks are synthetic and trivial to solve with conventional programming
▶
They are used in the MI literature to turn black boxes opaque
¹
Largest prime less than
𝑝
=
1
1
3
5 of
18
3 |
Grokking on
𝒯
miiii
▶
The model groks on
𝒯
miiii
(
Figure 4
)
▶
Needed GrokFast
[2]
on compute budget
▶
Final hyperparams are seen in
Table 1
rate
𝜆
wd
𝑑
lr
heads
1
1
0
1
2
1
3
256
3
1
0
4
4
Table 1: Hyperparams for
𝒯
miiii
Figure 4: Training (top) and validation
(bottom) accuracy during training on
𝒯
miiii
6 of
18
4 |
Embeddings
▶
The position embs. of
Figure 5
reflects that
𝒯
nanda
is commutative and
𝒯
miiii
is not
▶
Maybe: this corrects non-comm. of
𝒯
miiii
?
▶
Corr. is
0
.
9
5
for
𝒯
nanda
and
−
0
.
6
4
for
𝒯
miiii
Figure 5: Positional embeddings for
𝒯
nanda
(top) and
𝒯
miiii
(bottom).
7 of
18
4 |
Embeddings
▶
For
𝒯
nanda
token embs. are essentially linear
combinations of 5 frequencies (
𝜔
)
▶
For
𝒯
miiii
more frequencies are in play
▶
Each
𝒯
miiii
subtask targets unique prime
▶
Possibility: One basis per prime task