NOAH SYRKIS
about
|
works
Mechanistic Interpretability on
(multi-task) Irreducible Integer
Identifiers
Noah Syrkis
January 23, 2025
1 |
Mechanistic Interpretability (MI)
2 |
Modular Arithmetic
3 |
Grokking on
𝒯
miiii
4 |
Embeddings
5 |
Neurons
6 |
The
𝜔
-Spike
Figure 1:
ℕ
<
𝑝
2
multiples of 13 or 27 (left) 11 (mid.) or primes (right)
“This disgusting pile of matrices is actually just an astoundingly poorly written, elegant and
consice algorithm” — Neel Nanda
¹
¹
Not verbatim, but the gist of it
1 |
Mechanistic Interpretability (MI)
▶
Sub-symbolic nature of deep learning obscures model mechanisms
▶
No obvious mapping from the weights of a trained model to math notation
▶
MI is about reverse engineering these models, and looking closely at them
▶
Many low hanging fruits / practical botany phase of the science
▶
How does a given model work? How can we train it faster? Is it safe?
1 of
18
1 |
Grokking
▶
Grokking
[1]
is “sudden generalization”
▶
MI (often) needs a mechanism
▶
Grokking is thus convenient for MI
▶
Lee et al. (2024)
speeds up grokking by
boosting slow gradients as per
Eq. 1
▶
For more see
Appendix A
ℎ
(
𝑡
)
=
ℎ
(
𝑡
−
1
)
𝛼
+
𝑔
(
𝑡
)
(
1
−
𝛼
)
(1.1)
̂
𝑔
(
𝑡
)
=
𝑔
(
𝑡
)
+
𝜆
ℎ
(
𝑡
)
(1.2)
2 of
18
1 |
Visualizing
▶
MI needs creativity … but there are tricks:
▶
For two-token samples, plot them varying
one on each axis (
Figure 2
)
▶
When a matrix is periodic use Fourier
▶
Singular value decomp. (
Appendix C
).
▶
Take away: get commfy with
esch
-plots
Figure 2: Top singular vectors of
𝐔
𝑊
𝐸
𝒯
nanda
(top), varying
𝑥
0
and
𝑥
1
in sample (left) and
freq. (right) space in
𝑊
out
𝒯
miiii
3 of
18