I’m a bit surprised not to see this post get more attention. My impression is that for a long time a lot of people put significant weight on the orthogonality argument. As Sasha argues, it is difficult to see why it should update us towards the view that AI alignment is difficult or an important problem to work on, let alone by far the most important problem in human history. I would be curious to hear its proponents explain what they think of Sasha’s argument.
For example, in his response to bryan caplan on the scale of AI risk, Eliezer Yudkowsky makes the following argument:
“1. Orthogonality thesis – intelligence can be directed toward any compact goal; consequentialist means-end reasoning can be deployed to find means corresponding to a free choice of end; AIs are not automatically nice; moral internalism is false.
2. Instrumental convergence – an AI doesn’t need to specifically hate you to hurt you; a paperclip maximizer doesn’t hate you but you’re made out of atoms that it can use to make paperclips, so leaving you alive represents an opportunity cost and a number of foregone paperclips. Similarly, paperclip maximizers want to self-improve, to perfect material technology, to gain control of resources, to persuade their programmers that they’re actually quite friendly, to hide their real thoughts from their programmers via cognitive steganography or similar strategies, to give no sign of value disalignment until they’ve achieved near-certainty of victory from the moment of their first overt strike, etcetera.
3. Rapid capability gain and large capability differences – under scenarios seeming more plausible than not, there’s the possibility of AIs gaining in capability very rapidly, achieving large absolute differences of capability, or some mixture of the two. (We could try to keep that possibility non-actualized by a deliberate effort, and that effort might even be successful, but that’s not the same as the avenue not existing.)
4. 1-3 in combination imply that Unfriendly AI is a critical Problem-to-be-solved, because AGI is not automatically nice, by default does things we regard as harmful, and will have avenues leading up to great intelligence and power.”
This argument does not work. This argument shows that AI systems will not necessarily be aligned. It tells us nothing about whether they are likely to be aligned or whether it will be easy to align AI systems. All of the above is completely compatible with the view that AI will be easy to align and we will obviously try and to align them eg since we don’t all want to die. To borrow the example from ben garfinkel, cars could in principle be given any goals. This has next to no bearing on what goals they will have in the real world or the set of likely possible futures.
The quote above is an excerpt from here, and immediately after listing those four points, Eliezer says “But there are further reasons why the above problem might be difficult to solve, as opposed to being the sort of thing you can handle straightforwardly with a moderate effort…”.
FWIW my impression is more like “I feel like I’ve heard the (valid) observation expressed in the OP many times before, even though I don’t exactly remember where”, and I think it’s an instance of the unfortunate but common phenomenon where the state of the discussion among ‘people in the field’ is not well represented by public materials.
As far as I understand it, orthogonality and instrumental convergence together actually make a case for AI being by default not aligned. The quote from Eliezer here goes a bit beyond the post. For the orthogonality thesis by itself, I agree with you & the main theses of the post. I would interpret “not aligned by default” as something like a random AI is probably not aligned. So I tend to disagree also when just considering these two points. This is also the way I originally understood Bostrom in Superintelligence.
However, I agree that this doesn’t tell you whether aligning AI is hard, which is another question. For this, we at least have the empirical evidence of a lot of smart people banging their heads against it for years without coming up with detailed general solutions that we feel confident about. I think this is some evidence for it being hard.
(Epistemic status of this comment: much weaker than of the OP)
I am suspicious a) of a priori non-mathematical reasoning being used to generate empirical predictions on the outside view and b) of this particular a priori non-mathematical reasoning on the inside view. It doesn’t look like AI algorithms have tended to get more resource grabby as they advance. AlphaZero will use all the processing power you throw at it, but it doesn’t seek more. If you installed the necessary infrastructure (and, ok, upgraded the storage space), it could presumably run on a ZX Spectrum.
And very intelligent humans don’t seem profoundly resource grabby, either. For every Jeff Bezos you have an Edward Witten, who obsessively dedicates dedicating his or her own time to some passion project but does very little to draw external resources towards it.
So based on existing intelligences, satisficing behaviour seems more like the expected default than maximising.
AlphaZero isn’t smart enough (algorithmically speaking). From Human Compatible (p.207):
Life for AlphaGo during the training period must be quite frustrating: the better it gets, the better its opponent gets—because its opponent is a near-exact copy of itself. Its win percentage hovers around 50 percent, no matter how good it becomes. If it were more intelligent—if it had a design closer to what one might expect of a human-level AI system—it would be able to fix this problem. This AlphaGo++ would not assume that the world is just the Go board, because that hypothesis leaves a lot of things unexplained. For example, it doesn’t explain what “physics” is supporting the operation of AlphaGo++’s own decisions or where the mysterious “opponent moves” are coming from. Just as we curious humans have gradually come to understand the workings of our cosmos, in a way that (to some extent) also explains the workings of our own minds, and just like the Oracle AI discussed in Chapter 6, AlphaGo++ will, by a process of experimentation, learn that there is more to the universe than the Go board. It will work out the laws of operation of the computer it runs on and of its own code, and it will realize that such a system cannot easily be explained without the existence of other entities in the universe. It will experiment with different patterns of stones on the board, wondering if those entities can interpret them. It will eventually communicate with those entities through a language of patterns and persuade them to reprogram its reward signal so that it always gets +1. The inevitable conclusion is that a sufficiently capable AlphaGo++ that is designed as a rewardsignal maximizer will wirehead.
From wireheading, it might then go on to resource grab to maximise the probability that it gets a +1 or maximise the number of +1s it’s getting (e.g. filling planet sized memory banks with 1s); although already it would have to have a lot of power over humans to be able to convince them to reprogram it by sending messages via the go board!
I don’t think the examples of humans (Bezos/Witten) are that relevant, in as much as we are products of evolution, and are “adaption executors” rather than “fitness maximisers”, are imperfectly rational, and tend to be (broadly speaking) aligned/human-compatible, by default.
I’m a bit surprised not to see this post get more attention. My impression is that for a long time a lot of people put significant weight on the orthogonality argument. As Sasha argues, it is difficult to see why it should update us towards the view that AI alignment is difficult or an important problem to work on, let alone by far the most important problem in human history. I would be curious to hear its proponents explain what they think of Sasha’s argument.
For example, in his response to bryan caplan on the scale of AI risk, Eliezer Yudkowsky makes the following argument:
“1. Orthogonality thesis – intelligence can be directed toward any compact goal; consequentialist means-end reasoning can be deployed to find means corresponding to a free choice of end; AIs are not
automatically nice; moral internalism is false.
2. Instrumental convergence – an AI doesn’t need to specifically hate you to hurt you; a
paperclip maximizer doesn’t hate you but you’re made out of atoms that it can use to make paperclips, so leaving you alive represents an opportunity cost and a number of foregone paperclips. Similarly,
paperclip maximizers want to self-improve, to perfect material technology, to gain control of resources, to persuade their programmers that they’re actually quite friendly, to hide their real thoughts from their programmers via cognitive steganography or similar strategies, to give no sign of value disalignment until they’ve achieved near-certainty of victory from the moment of their first overt strike, etcetera.
3. Rapid capability gain and large capability differences – under scenarios seeming more plausible than not, there’s the possibility of AIs gaining in capability very rapidly, achieving large absolute
differences of capability, or some mixture of the two. (We could try to keep that possibility non-actualized by a deliberate effort, and that effort might even be successful, but that’s not the same as the avenue
not existing.)
4. 1-3 in combination imply that Unfriendly AI is a critical Problem-to-be-solved, because AGI is not automatically nice, by default does things we regard as harmful, and will have avenues
leading up to great intelligence and power.”
This argument does not work. This argument shows that AI systems will not necessarily be aligned. It tells us nothing about whether they are likely to be aligned or whether it will be easy to align AI systems. All of the above is completely compatible with the view that AI will be easy to align and we will obviously try and to align them eg since we don’t all want to die. To borrow the example from ben garfinkel, cars could in principle be given any goals. This has next to no bearing on what goals they will have in the real world or the set of likely possible futures.
The quote above is an excerpt from here, and immediately after listing those four points, Eliezer says “But there are further reasons why the above problem might be difficult to solve, as opposed to being the sort of thing you can handle straightforwardly with a moderate effort…”.
ah i see i hadn’t seen that
FWIW my impression is more like “I feel like I’ve heard the (valid) observation expressed in the OP many times before, even though I don’t exactly remember where”, and I think it’s an instance of the unfortunate but common phenomenon where the state of the discussion among ‘people in the field’ is not well represented by public materials.
As far as I understand it, orthogonality and instrumental convergence together actually make a case for AI being by default not aligned. The quote from Eliezer here goes a bit beyond the post. For the orthogonality thesis by itself, I agree with you & the main theses of the post. I would interpret “not aligned by default” as something like a random AI is probably not aligned. So I tend to disagree also when just considering these two points. This is also the way I originally understood Bostrom in Superintelligence.
However, I agree that this doesn’t tell you whether aligning AI is hard, which is another question. For this, we at least have the empirical evidence of a lot of smart people banging their heads against it for years without coming up with detailed general solutions that we feel confident about. I think this is some evidence for it being hard.
(Epistemic status of this comment: much weaker than of the OP)
I am suspicious a) of a priori non-mathematical reasoning being used to generate empirical predictions on the outside view and b) of this particular a priori non-mathematical reasoning on the inside view. It doesn’t look like AI algorithms have tended to get more resource grabby as they advance. AlphaZero will use all the processing power you throw at it, but it doesn’t seek more. If you installed the necessary infrastructure (and, ok, upgraded the storage space), it could presumably run on a ZX Spectrum.
And very intelligent humans don’t seem profoundly resource grabby, either. For every Jeff Bezos you have an Edward Witten, who obsessively dedicates dedicating his or her own time to some passion project but does very little to draw external resources towards it.
So based on existing intelligences, satisficing behaviour seems more like the expected default than maximising.
AlphaZero isn’t smart enough (algorithmically speaking). From Human Compatible (p.207):
From wireheading, it might then go on to resource grab to maximise the probability that it gets a +1 or maximise the number of +1s it’s getting (e.g. filling planet sized memory banks with 1s); although already it would have to have a lot of power over humans to be able to convince them to reprogram it by sending messages via the go board!
I don’t think the examples of humans (Bezos/Witten) are that relevant, in as much as we are products of evolution, and are “adaption executors” rather than “fitness maximisers”, are imperfectly rational, and tend to be (broadly speaking) aligned/human-compatible, by default.