As far as I understand it, orthogonality and instrumental convergence together actually make a case for AI being by default not aligned. The quote from Eliezer here goes a bit beyond the post. For the orthogonality thesis by itself, I agree with you & the main theses of the post. I would interpret “not aligned by default” as something like a random AI is probably not aligned. So I tend to disagree also when just considering these two points. This is also the way I originally understood Bostrom in Superintelligence.
However, I agree that this doesn’t tell you whether aligning AI is hard, which is another question. For this, we at least have the empirical evidence of a lot of smart people banging their heads against it for years without coming up with detailed general solutions that we feel confident about. I think this is some evidence for it being hard.
(Epistemic status of this comment: much weaker than of the OP)
I am suspicious a) of a priori non-mathematical reasoning being used to generate empirical predictions on the outside view and b) of this particular a priori non-mathematical reasoning on the inside view. It doesn’t look like AI algorithms have tended to get more resource grabby as they advance. AlphaZero will use all the processing power you throw at it, but it doesn’t seek more. If you installed the necessary infrastructure (and, ok, upgraded the storage space), it could presumably run on a ZX Spectrum.
And very intelligent humans don’t seem profoundly resource grabby, either. For every Jeff Bezos you have an Edward Witten, who obsessively dedicates dedicating his or her own time to some passion project but does very little to draw external resources towards it.
So based on existing intelligences, satisficing behaviour seems more like the expected default than maximising.
AlphaZero isn’t smart enough (algorithmically speaking). From Human Compatible (p.207):
Life for AlphaGo during the training period must be quite frustrating: the better it gets, the better its opponent gets—because its opponent is a near-exact copy of itself. Its win percentage hovers around 50 percent, no matter how good it becomes. If it were more intelligent—if it had a design closer to what one might expect of a human-level AI system—it would be able to fix this problem. This AlphaGo++ would not assume that the world is just the Go board, because that hypothesis leaves a lot of things unexplained. For example, it doesn’t explain what “physics” is supporting the operation of AlphaGo++’s own decisions or where the mysterious “opponent moves” are coming from. Just as we curious humans have gradually come to understand the workings of our cosmos, in a way that (to some extent) also explains the workings of our own minds, and just like the Oracle AI discussed in Chapter 6, AlphaGo++ will, by a process of experimentation, learn that there is more to the universe than the Go board. It will work out the laws of operation of the computer it runs on and of its own code, and it will realize that such a system cannot easily be explained without the existence of other entities in the universe. It will experiment with different patterns of stones on the board, wondering if those entities can interpret them. It will eventually communicate with those entities through a language of patterns and persuade them to reprogram its reward signal so that it always gets +1. The inevitable conclusion is that a sufficiently capable AlphaGo++ that is designed as a rewardsignal maximizer will wirehead.
From wireheading, it might then go on to resource grab to maximise the probability that it gets a +1 or maximise the number of +1s it’s getting (e.g. filling planet sized memory banks with 1s); although already it would have to have a lot of power over humans to be able to convince them to reprogram it by sending messages via the go board!
I don’t think the examples of humans (Bezos/Witten) are that relevant, in as much as we are products of evolution, and are “adaption executors” rather than “fitness maximisers”, are imperfectly rational, and tend to be (broadly speaking) aligned/human-compatible, by default.
As far as I understand it, orthogonality and instrumental convergence together actually make a case for AI being by default not aligned. The quote from Eliezer here goes a bit beyond the post. For the orthogonality thesis by itself, I agree with you & the main theses of the post. I would interpret “not aligned by default” as something like a random AI is probably not aligned. So I tend to disagree also when just considering these two points. This is also the way I originally understood Bostrom in Superintelligence.
However, I agree that this doesn’t tell you whether aligning AI is hard, which is another question. For this, we at least have the empirical evidence of a lot of smart people banging their heads against it for years without coming up with detailed general solutions that we feel confident about. I think this is some evidence for it being hard.
(Epistemic status of this comment: much weaker than of the OP)
I am suspicious a) of a priori non-mathematical reasoning being used to generate empirical predictions on the outside view and b) of this particular a priori non-mathematical reasoning on the inside view. It doesn’t look like AI algorithms have tended to get more resource grabby as they advance. AlphaZero will use all the processing power you throw at it, but it doesn’t seek more. If you installed the necessary infrastructure (and, ok, upgraded the storage space), it could presumably run on a ZX Spectrum.
And very intelligent humans don’t seem profoundly resource grabby, either. For every Jeff Bezos you have an Edward Witten, who obsessively dedicates dedicating his or her own time to some passion project but does very little to draw external resources towards it.
So based on existing intelligences, satisficing behaviour seems more like the expected default than maximising.
AlphaZero isn’t smart enough (algorithmically speaking). From Human Compatible (p.207):
From wireheading, it might then go on to resource grab to maximise the probability that it gets a +1 or maximise the number of +1s it’s getting (e.g. filling planet sized memory banks with 1s); although already it would have to have a lot of power over humans to be able to convince them to reprogram it by sending messages via the go board!
I don’t think the examples of humans (Bezos/Witten) are that relevant, in as much as we are products of evolution, and are “adaption executors” rather than “fitness maximisers”, are imperfectly rational, and tend to be (broadly speaking) aligned/human-compatible, by default.