I think it’s too easy for someone to skim this entire post and still completely miss the headline “this is strong empirical evidence that mesa-optimizers are real in practice”.
I’ll construct 2 scenarios of where EA mesa-objectives would likely conflict with reality, and conditional on this, I expect the EA community to learn deceptive alignment with >50% probability:
Moral realism is correct, but the correct theory of ethics is non-utilitarian. Specifically, moral realism is the claim that there are mind independent facts of morality, similar to how reality is today mind independent. There is a fact of the matter on morality.
Bluntly, EA is a numbers movement. And only utilitarianism endorses using numbers. So if dentology or virtue ethics were right, I do not expect EA to be aligned to it, and instead become deceptively aligned.
Moral anti-realism is correct, that is there is no fact of the matter of which morality is correct, and everything is subjective. That is, if people are disagreeing on values, both sides are right in their own view, and that’s it. There is no moral reality here.
Again, I expect failure to transmit that fact to the public, admittedly this time EA doesn’t need to justify it’s values, nor does anybody else, but I do expect EA to put a front of being objective truth even if it isn’t.
I think the answer is yes, primarily because I do think this is an effective strategy to do much of anything in the real world.
Here’s a link:
https://www.lesswrong.com/posts/firtXAWGdvzXYAh9B/paper-transformers-learn-in-context-by-gradient-descent
And here’s a comment from the post.
I’ll construct 2 scenarios of where EA mesa-objectives would likely conflict with reality, and conditional on this, I expect the EA community to learn deceptive alignment with >50% probability:
Moral realism is correct, but the correct theory of ethics is non-utilitarian. Specifically, moral realism is the claim that there are mind independent facts of morality, similar to how reality is today mind independent. There is a fact of the matter on morality.
Bluntly, EA is a numbers movement. And only utilitarianism endorses using numbers. So if dentology or virtue ethics were right, I do not expect EA to be aligned to it, and instead become deceptively aligned.
Moral anti-realism is correct, that is there is no fact of the matter of which morality is correct, and everything is subjective. That is, if people are disagreeing on values, both sides are right in their own view, and that’s it. There is no moral reality here.
Again, I expect failure to transmit that fact to the public, admittedly this time EA doesn’t need to justify it’s values, nor does anybody else, but I do expect EA to put a front of being objective truth even if it isn’t.