With regard to Deepseek, it seems to me that the success of mixture-of-experts could be considered an update towards methods like gradient routing. If you could localize specific kinds of knowledge to specific experts in a reliable way, you could dynamically toggle off / ablate experts with unnecessary dangerous knowledge. E.g. toggle off experts knowledgeable in human psychology so the AI doesn’t manipulate you.
I like this approach because if you get it working well, it’s a general tool that could help address a lot of different catastrophe stories in a way that seems pretty robust. E.g. to mitigate a malicious AI from gaining root access to its datacenter, ablate knowledge of the OS it’s running on. To mitigate sophisticated cooperation between AIs that are supposed to be monitoring one another, ablate knowledge of game theory. Etc. (The broader point is that unlearning seems very generally useful. But the “Expand-Route-Ablate” style approach from the gradient routing paper strikes me as a particularly promising, and could harmonize well with MoE.)
I think a good research goal would be to try to eventually replicate Deepseek’s work, except with highly interpretable experts. The idea is to produce a “high-assurance” model which can be ablated so undesired behaviors, like deception, are virtually impossible to jailbreak out of it (since the weights that perform the behavior are inaccessible). I think the gradient routing paper is a good start. To achieve sufficient safety we’ll need new methods that are more robust and easier to deploy, which should probably be prototyped on toy problems first.
With regard to Deepseek, it seems to me that the success of mixture-of-experts could be considered an update towards methods like gradient routing. If you could localize specific kinds of knowledge to specific experts in a reliable way, you could dynamically toggle off / ablate experts with unnecessary dangerous knowledge. E.g. toggle off experts knowledgeable in human psychology so the AI doesn’t manipulate you.
I like this approach because if you get it working well, it’s a general tool that could help address a lot of different catastrophe stories in a way that seems pretty robust. E.g. to mitigate a malicious AI from gaining root access to its datacenter, ablate knowledge of the OS it’s running on. To mitigate sophisticated cooperation between AIs that are supposed to be monitoring one another, ablate knowledge of game theory. Etc. (The broader point is that unlearning seems very generally useful. But the “Expand-Route-Ablate” style approach from the gradient routing paper strikes me as a particularly promising, and could harmonize well with MoE.)
I think a good research goal would be to try to eventually replicate Deepseek’s work, except with highly interpretable experts. The idea is to produce a “high-assurance” model which can be ablated so undesired behaviors, like deception, are virtually impossible to jailbreak out of it (since the weights that perform the behavior are inaccessible). I think the gradient routing paper is a good start. To achieve sufficient safety we’ll need new methods that are more robust and easier to deploy, which should probably be prototyped on toy problems first.