Thanks for this post! I think it’s especially helpful to tease apart theoretical possibility and actual relationships (such as correlation) between goals and intelligence (under different situations). As you say, the orthogonality thesis has almost no implications for the actual relations between them.
What’s important here is the actual probability of alignment. I think it would be very valuable to have at least a rough baseline/ default value since that is just the general starting point for predicting and forecasting. I’d love to know if there’s some work done on this (even if just order of magnitude estimations) or if someone can give it mathematical rigor. Most importantly, I think we need a prior for the size and dimensionality of the state space of possible values/goals. Together with a random starting point of an AI in that state space, we should be able to calculate a baseline value from which we then update based on arguments like the ones you mention as well as negative ones. If we get more elaborate in our modeling, we might include an uninformative/neutral prior for the dynamics of that state space based on other variables like intelligence which is then equally subject to updates from arguments and evidence.
Considering the fragility of human values/ the King Midas problem – that is in brief the difficulty of specifying human values completely & correctly while they likely make up a very tiny fraction in a huge state space of possible goals/values, I expect this baseline to be an extremely low value.
Turning to your points on updating from there (reasons to expect correlation), while they are interesting, I feel very uncertain about each one of them. At the same time instrumental convergence, inner alignment, Goodhart’s Law, and others seem to be (to different degrees) strong reasons to update in the opposite direction. I’d add here the empirical evidence thus far of smart people not succeeding in finding satisfactory solutions to the problem. Of course, you’re right that 1. is probably the strongest reason for correlation. Yet, since the failure modes that I’m most worried about involve accidental mistakes rather than bad intent, I’m not sure whether to update a lot based on this. I’m overall vastly uncertain about this point since AI developers’ impact seems to involve predicting how successful we will be in both finding and implementing acceptable solutions to the alignment problem.
I feel like the reasons you mention to expect correlation should be further caveated in light of a point Stuart Russel makes frequently: Even if there is some convergence towards human values in practice for a set of variables of importance to some approximation, it seems unlikely that this covers all the variables we care about & we should expect other unconstrained variables about which we might care to be set to extreme values since that’s what optimizers tend to do.
One remark on “convergence” in the terminology section. You write
lower intelligence could be more predictive [of motivations], but that thesis doesn’t seem very relevant whether or not it’s true.
I disagree, you focus on the wrong case there (benevolent ostriches). If we knew lower intelligence to be more (and higher intelligence to be less) predictive of goals, it would be a substantial update from our baseline of how difficult alignment is for an AGI thus making alignment harder and progressively so in the development of more intelligent systems. That would be important information.
Finally, like some previous commenters, I also don’t have the impression that the orthogonality thesis is (typically seen as) a standalone argument for the importance of AIS.
Thanks for this post! I think it’s especially helpful to tease apart theoretical possibility and actual relationships (such as correlation) between goals and intelligence (under different situations). As you say, the orthogonality thesis has almost no implications for the actual relations between them.
What’s important here is the actual probability of alignment. I think it would be very valuable to have at least a rough baseline/ default value since that is just the general starting point for predicting and forecasting. I’d love to know if there’s some work done on this (even if just order of magnitude estimations) or if someone can give it mathematical rigor. Most importantly, I think we need a prior for the size and dimensionality of the state space of possible values/goals. Together with a random starting point of an AI in that state space, we should be able to calculate a baseline value from which we then update based on arguments like the ones you mention as well as negative ones. If we get more elaborate in our modeling, we might include an uninformative/neutral prior for the dynamics of that state space based on other variables like intelligence which is then equally subject to updates from arguments and evidence.
Considering the fragility of human values/ the King Midas problem – that is in brief the difficulty of specifying human values completely & correctly while they likely make up a very tiny fraction in a huge state space of possible goals/values, I expect this baseline to be an extremely low value.
Turning to your points on updating from there (reasons to expect correlation), while they are interesting, I feel very uncertain about each one of them. At the same time instrumental convergence, inner alignment, Goodhart’s Law, and others seem to be (to different degrees) strong reasons to update in the opposite direction. I’d add here the empirical evidence thus far of smart people not succeeding in finding satisfactory solutions to the problem. Of course, you’re right that 1. is probably the strongest reason for correlation. Yet, since the failure modes that I’m most worried about involve accidental mistakes rather than bad intent, I’m not sure whether to update a lot based on this. I’m overall vastly uncertain about this point since AI developers’ impact seems to involve predicting how successful we will be in both finding and implementing acceptable solutions to the alignment problem.
I feel like the reasons you mention to expect correlation should be further caveated in light of a point Stuart Russel makes frequently: Even if there is some convergence towards human values in practice for a set of variables of importance to some approximation, it seems unlikely that this covers all the variables we care about & we should expect other unconstrained variables about which we might care to be set to extreme values since that’s what optimizers tend to do.
One remark on “convergence” in the terminology section. You write
I disagree, you focus on the wrong case there (benevolent ostriches). If we knew lower intelligence to be more (and higher intelligence to be less) predictive of goals, it would be a substantial update from our baseline of how difficult alignment is for an AGI thus making alignment harder and progressively so in the development of more intelligent systems. That would be important information.
Finally, like some previous commenters, I also don’t have the impression that the orthogonality thesis is (typically seen as) a standalone argument for the importance of AIS.