In any case, I don’t see any reason to think the neural net prior is malign, or particularly biased toward deceptive, misaligned generalization. If anything the simplicity prior seems like good news for alignment.
I definitely disagree with this—especially the last sentence; essentially all of my hope for neural net inductive biases comes from them not being like an actual simplicity prior. The primary literature I’d reference here would be “How likely is deceptive alignment?” for the practical question regarding concrete neural net inductive biases and “The Solomonoff Prior is Malign” for the purely theoretical question concerning the actual simplicity prior.
So, I definitely don’t have the Solomonoff prior in mind when I talk about simplicity. I’m actively doing research at the moment to better characterize the sense in which neural nets are biased toward “simple” functions, but I would be shocked if it has anything to do with Kolmogorov complexity.
Okay, my crux is that the simplicity/Kolmogorov/Solomonoff prior is probably not very malign, assuming we could run it, and in general I find the prior not to be malign except for specific situations.
This is basically because it relies on the IMO dubious assumption that the halting oracle can only be used once, and notably once we use the halting/Solomonoff oracle more than once, the Solomonoff oracle loses it’s malign properties.
More generally, if the Solomonoff Oracle is duplicatable, as modern AIs generally are, then there’s a known solution to mitigate the malignancy of the Solomonoff prior: Duplicate it, and let multiple people run the Solomonoff inductor in parallel to increase the complexity of manipulation. The goal is essentially to remove the uniqueness of 1 Solomonoff inductor, and make an arbitrary number of such oracles to drive up the complexity of manipulation.
So under a weak assumption, the malignancy of the Solomonoff prior goes away.
This is described well in the link below, and the important part is that we need either a use-once condition, or we need to assume uniqueness in some way. If we don’t have either assumption holding, as is likely to be the case, then the Solomonoff/Kolmogorov prior isn’t malign.
More specifically, it’s this part of John Wentworth’s comment:
In Solomonoff Model, Sufficiently Large Data Rules Out Malignness
There is a major outside-view reason to expect that the Solomonoff-is-malign argument must be doing something fishy: Solomonoff Induction (SI) comes with performance guarantees. In the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. The post mentions that:
A simple application of the no free lunch theorem shows that there is no way of making predictions that is better than the Solomonoff prior across all possible distributions over all possible strings. Thus, agents that are influencing the Solomonoff prior cannot be good at predicting, and thus gain influence, in all possible worlds.
… but in the large-data limit, SI’s guarantees are stronger than just that. In the large-data limit, there is no computable way of making better predictions than the Solomonoff prior in any world. Thus, agents that are influencing the Solomonoff prior cannot gain long-term influence in any computable world; they have zero degrees of freedom to use for influence. It does not matter if they specialize in influencing worlds in which they have short strings; they still cannot use any degrees of freedom for influence without losing all their influence in the large-data limit.
Takeaway of this argument: as long as we throw enough data at our Solomonoff inductor before asking it for any outputs, the malign agent problem must go away. (Though note that we never know exactly how much data that is; all we have is a big-O argument with an uncomputable constant.)
As far as the actual practical question, there is a very important limitation on inner-misaligned agents by SGD, primarily because gradient hacking is very difficult to do, and is an underappreciated limitation on misalignment, since SGD has powerful tools to remove inner-misaligned circuits/TMs/Agents in the link below:
I definitely disagree with this—especially the last sentence; essentially all of my hope for neural net inductive biases comes from them not being like an actual simplicity prior. The primary literature I’d reference here would be “How likely is deceptive alignment?” for the practical question regarding concrete neural net inductive biases and “The Solomonoff Prior is Malign” for the purely theoretical question concerning the actual simplicity prior.
So, I definitely don’t have the Solomonoff prior in mind when I talk about simplicity. I’m actively doing research at the moment to better characterize the sense in which neural nets are biased toward “simple” functions, but I would be shocked if it has anything to do with Kolmogorov complexity.
Okay, my crux is that the simplicity/Kolmogorov/Solomonoff prior is probably not very malign, assuming we could run it, and in general I find the prior not to be malign except for specific situations.
This is basically because it relies on the IMO dubious assumption that the halting oracle can only be used once, and notably once we use the halting/Solomonoff oracle more than once, the Solomonoff oracle loses it’s malign properties.
More generally, if the Solomonoff Oracle is duplicatable, as modern AIs generally are, then there’s a known solution to mitigate the malignancy of the Solomonoff prior: Duplicate it, and let multiple people run the Solomonoff inductor in parallel to increase the complexity of manipulation. The goal is essentially to remove the uniqueness of 1 Solomonoff inductor, and make an arbitrary number of such oracles to drive up the complexity of manipulation.
So under a weak assumption, the malignancy of the Solomonoff prior goes away. This is described well in the link below, and the important part is that we need either a use-once condition, or we need to assume uniqueness in some way. If we don’t have either assumption holding, as is likely to be the case, then the Solomonoff/Kolmogorov prior isn’t malign.
https://www.lesswrong.com/posts/f7qcAS4DMKsMoxTmK/the-solomonoff-prior-is-malign-it-s-not-a-big-deal#Comparison_
And that’s if it’s actually malign, which it might not be, at least in the large-data limit:
https://www.lesswrong.com/posts/Tr7tAyt5zZpdTwTQK/the-solomonoff-prior-is-malign#fDEmEHEx5EuET4FBF
More specifically, it’s this part of John Wentworth’s comment:
As far as the actual practical question, there is a very important limitation on inner-misaligned agents by SGD, primarily because gradient hacking is very difficult to do, and is an underappreciated limitation on misalignment, since SGD has powerful tools to remove inner-misaligned circuits/TMs/Agents in the link below:
https://www.lesswrong.com/posts/w2TAEvME2yAG9MHeq/gradient-hacking-is-extremely-difficult