Ebenezer Dukakis comments on The Game Board has been Flipped: Now is a good time to rethink what you’re doing

Ebenezer Dukakis Jan 30, 2025, 8:33 AM
2 points
0 ∶ 0
With regard to Deepseek, it seems to me that the success of mixture-of-experts could be considered an update towards methods like gradient routing. If you could localize specific kinds of knowledge to specific experts in a reliable way, you could dynamically toggle off / ablate experts with unnecessary dangerous knowledge. E.g. toggle off experts knowledgeable in human psychology so the AI doesn’t manipulate you.

I like this approach because if you get it working well, it’s a general tool that could help address a lot of different catastrophe stories in a way that seems pretty robust. E.g. to mitigate a malicious AI from gaining root access to its datacenter, ablate knowledge of the OS it’s running on. To mitigate sophisticated cooperation between AIs that are supposed to be monitoring one another, ablate knowledge of game theory. Etc. (The broader point is that unlearning seems very generally useful. But the “Expand-Route-Ablate” style approach from the gradient routing paper strikes me as a particularly promising, and could harmonize well with MoE.)

I think a good research goal would be to try to eventually replicate Deepseek’s work, except with highly interpretable experts. The idea is to produce a “high-assurance” model which can be ablated so undesired behaviors, like deception, are virtually impossible to jailbreak out of it (since the weights that perform the behavior are inaccessible). I think the gradient routing paper is a good start. To achieve sufficient safety we’ll need new methods that are more robust and easier to deploy, which should probably be prototyped on toy problems first.
What links here?
- Ebenezer Dukakis's comment on The Game Board has been Flipped: Now is a good time to rethink what you’re doing by LintzA (Jan 30, 2025, 8:58 AM; 22 points)

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer