The basic reasoning is that SGD is an extremely powerful optimizer, and even the imperfections of SGD in real life that mesa-optimizers can use are detectable without much interpretability progress at all. Also, there is an incentive by capabilities groups to improve SGD, so we have good reason to expect that these flaws become less worrisome over time.
In particular, it is basically immune to acausal trade setups or blackmail setups by mesa-optimizers.
Some choice quotes from Beren’s post below:
The key intuition is that gradient descent optimizes the entire causal graph leading to the loss (property 1) and leaves no slack (properties 2 and 3). Everything that can be optimized will be, and in the locally maximal way. If a parameter is contributing to the loss, it is optimized by its exact sensitivity to decrease the loss the most. If a circuit results in bad behaviour, every element of the circuit is optimized away independently. It is easy to construct gradient hacking examples in cases where these conditions do not hold. For instance, if the gradient hacker’s internal circuitry and external connections to other parts of the network are shielded from GD, then it is easy to construct a gradient hacker
In general, if a gradient hacker H is on part of a differentiable path through the network, and tries to manipulate some other part of the network X to output some incorrect output, then gradient descent will simultaneously and independently do all of: a.) destroy the internal representations of the gradient hacker H to stop it doing this behaviour. b.) modify the input connections to H such that the gradient hacker becomes confused about how to do its negative behaviour. c.) modify the output connections from H to X to reduce the impact its behaviour and causally cut off the gradient hacker, d.) modify the internal structure of X so that it is less affected by whatever H outputs, and e.) modify the outputs of X (i.e. to the loss) to reduce its influence, potentially trying to cut off the entire causal network path the the gradient hacker is on.
This leads to an important point. Better optimization algorithms lead to both better capabilities and greater safety from mesaoptimizers. They do this because they reduce the slack available in the optimization process that a mesaoptimizer could exploit for its own purposes. For instance, gradient descent is very sensitive to the conditionining of the loss landscape around a parameter. A gradient hacker could exist for a very long time in an extremely flat region of poor conditioning before being optimized away, potentially persisting over an entire training run. However, this threat can be removed with better preconditioners or second order optimizations which makes gradient descent much less sensitive to local conditioning and so removes the slack caused by poor conditioning.
The basic reasoning is that SGD is an extremely powerful optimizer, and even the imperfections of SGD in real life that mesa-optimizers can use are detectable without much interpretability progress at all. Also, there is an incentive by capabilities groups to improve SGD, so we have good reason to expect that these flaws become less worrisome over time.
In particular, it is basically immune to acausal trade setups or blackmail setups by mesa-optimizers.
Some choice quotes from Beren’s post below:
The link here is below for the full post: