I think something that sometimes people have in mind when they talk about the orthogonality of intelligence and goals is they have this picture of AI development where we’re creating systems that are, in some sense, smarter and smarter. And then there’s this separate project of trying to figure out what goals to give these AI systems. The way this works in, I think, in some of the classic presentations of risk is that there’s this deadline picture. That there will come a day where we have extremely intelligent systems. And if we can’t by that day figure out how to give them the right goals, then we might give them the wrong goals and a disaster might occur. So we have this exogenous deadline of the creep of AI capability progress, and that we need to solve this issue before that day arises. That’s something that I think I, for the most part, disagree with.
I continue to have a lot of uncertainty about how likely it is that AI development will look like “there’s this separate project of trying to figure out what goals to give these AI systems” vs a development process where capability and goals are necessarily connected. (I didn’t find your arguments in favor of the latter very persuasive.) For example it seems GPT-3 can be seen as more like the former than the latter. (See this thread for background on this.)
To the extent that AI development is more like the latter than the former, that might be bad news for (a certain version of) the orthogonality thesis, but it can be even worse news for the prospect of AI alignment. Because instead of disaster striking only if we can’t figure out the right goals to give to the AI, it can also be the case that we know what goals we want to give it, but due to constraints of the development process, we can’t give it those goals and can only build AI with unaligned goals. So it seems to me that the latter scenario can also be rightly described as “exogenous deadline of the creep of AI capability progress”. (In both cases, we can try to refrain from developing/deploying AGI, but it may be a difficult coordination problem for humanity to stay in a state where we know how to build AGI but chooses not to, and in any case this consideration cuts equally across both scenarios.)
Because instead of disaster striking only if we can’t figure out the right goals to give to the AI, it can also be the case that we know what goals we want to give it, but due to constraints of the development process, we can’t give it those goals and can only build AI with unaligned goals. So it seems to me that the latter scenario can also be rightly described as “exogenous deadline of the creep of AI capability progress”. (In both cases, we can try to refrain from developing/deploying AGI, but it may be a difficult coordination problem for humanity to stay in a state where we know how to build AGI but chooses not to, and in any case this consideration cuts equally across both scenarios.)
I think that the comment you make above is right. In the podcast, we only discuss this issue in a super cursory way:
(From the transcript) A second related concern, which is a little bit different, is that you could think this is an argument against us naively going ahead and putting this thing out into the world that’s as extremely misaligned as a dust minimizer or a paperclip maximizer, but we could still get to the point where we haven’t worked out alignment techniques.… No sane person would keep running the dust minimizer simulation once it’s clear this is not the thing we want to be making. But maybe not everyone is the same. Maybe someone wants to make a system that pursues some extremely narrow objective like this extremely effectively, even though it would be clear to anyone with normal values that you’re not in the process of making a thing that you want to actually use. Maybe somebody who wants to cause destruction could conceivably plough ahead. So that might be one way of rescuing a deadline picture. The deadline is not when will people have intelligent systems that they naively throw out into the world. It’s when do we reach the point where someone wants to create something that, in some sense, is intuitively pursuing a very narrow objective, has the ability to do that.
Fortunately, I’m not too worried about this possibility. Partly, as background, I expect us to have moved beyond using hand-coded reward functions—or, more generally, what Stuart Russell calls the “standard model”—by the time we have the ability to create broadly superintelligent and highly agential/unbounded systems. There are really strong incentives to do this, since there are loads of useful applications that seemingly can’t be developed using hand-coded reward functions. This is some of the sense in which, in my view, capabilities and alignment research is mushed up. If progress is sufficiently gradual, I find it hard to imagine that the ability to create things like world-destroying paperclippers comes before (e.g.) the ability to make at least pretty good use of reward modeling techniques.
(To be clear, I recognize that loads of alignment researchers also think that there will be strong economic incentives for alignment research. I believe there’s a paragraph in Russell’s book arguing this. I think DM’s “scalable agent alignment” paper also suggests that reward modeling is necessary to develop systems that can assist us in most “real world domains.” Although I don’t know how much optimism other people tend to take from this observation. I don’t actually know, for example, whether or not Russell is less optimisic than me.)
If we do end up in a world where people know they can create broadly superintelligent and highly agential/unbounded AI systems, but we’re still haven’t worked out alternatives to Russell’s “standard model,” then no sane person really has any incentive to create and deploy these kinds of systems. Training up a broadly superintelligent and highly agential system using something like a hand-coded reward function is likely to be an obviously bad idea; if it’s not obviously bad, a priori, then it will likely become obviously bad during the training process. There wouldn’t be much of a coordination problem, since, at least in normal circumstances, no one has an incentive to knowingly destroy themselves.
If I then try to tell a story where humanity goes extinct, due to a failure to move beyond the standard model in time, two main scenarios come to mind.
Doomsday Machine: States develop paperclipper-like systems, while thinking of them as doomsday machines, to serve as a novel alternative or complement to nuclear deterrents. They end up being used, either accidentally or intentionally.
Apocalyptic Residual: The ability to develop paperclipper-like systems diffuses broadly. Some of the groups that gain this ability have apocalyptic objectives. They groups intentionally develop and deploy the systems, with the active intention of destroying humanity.
The first scenario doesn’t seem very likely to me. Although this is obviously very speculative, paperclippers seem much worse than nuclear or even biological deterrents. First, your own probability of survival, if you use a paperclipper, may be much lower than your probability of survival if you used nukes or biological weapons. Second, and somewhat ironically, it may actually be hard to convince people that your paperclipper system can actually do a ton of damage; it seems hard to know that the result would actually be as bad as feared, without real-world experience using it before. States would also, likely, be slow to switch to this new deterrence strategy, providing even more time for alignment techniques to be worked out. As a further bit of friction/disincentive, these systems might also just be extremely expensive (depending on compute or environment design requirements). Finally, for doomsday to occur, it’s actually necessary for a paperclipper system to be used—and for its effect to be as bad as feared. The history of nuclear weapons suggests that the annual probability of use is probably pretty low.
The second scenario also doesn’t seem very likely to me, since: (a) I think there would probably be an initial period where large quantities of resources (e.g. compute and skilled engineers) are required to make world-destroying paperclippers. (b) Only a very small portion of people want to destroy the world. (c) There would be unusually strong incentives for states to prevent apocalyptic groups or individuals from gaining access to the necessary resources.
Although see Asya’s “AGI in Vulnerable World” post for a discussion of some conditions under which malicious use concerns might loom larger.
My guess would be that if you play with GPT-3, it can talk about as well about human values (or AI alignment for that matter) as it can talk about anything else. In that sense, it seems like stronger capabilities for GPT-3 also potentially help solve the alignment problem.
I continue to have a lot of uncertainty about how likely it is that AI development will look like “there’s this separate project of trying to figure out what goals to give these AI systems” vs a development process where capability and goals are necessarily connected. (I didn’t find your arguments in favor of the latter very persuasive.) For example it seems GPT-3 can be seen as more like the former than the latter. (See this thread for background on this.)
I don’t think I caught the point about GPT-3, although this might just be a matter of using concepts differently.
In my mind: To whatever extent GPT-3 can be said to have a “goal,” its goal is to produce text that it would be unsurprising to find on the internet. The training process both imbued it with this goal and made the system good at achieving it.
There are other things we might want spin-offs of GPT-3 to do: For example, compose better-than-human novels. Doing this would involve shifting both what GPT-3 is “capable” of doing and shifting what its “goal” is. (There’s not really a clean practical or conceptual distinction between the two.) It would also probably require making progress on some sort of “alignment” technique, since we can’t (e.g.) write down a hand-coded reward function that quantifies novel quality.
From the podcast transcript:
I continue to have a lot of uncertainty about how likely it is that AI development will look like “there’s this separate project of trying to figure out what goals to give these AI systems” vs a development process where capability and goals are necessarily connected. (I didn’t find your arguments in favor of the latter very persuasive.) For example it seems GPT-3 can be seen as more like the former than the latter. (See this thread for background on this.)
To the extent that AI development is more like the latter than the former, that might be bad news for (a certain version of) the orthogonality thesis, but it can be even worse news for the prospect of AI alignment. Because instead of disaster striking only if we can’t figure out the right goals to give to the AI, it can also be the case that we know what goals we want to give it, but due to constraints of the development process, we can’t give it those goals and can only build AI with unaligned goals. So it seems to me that the latter scenario can also be rightly described as “exogenous deadline of the creep of AI capability progress”. (In both cases, we can try to refrain from developing/deploying AGI, but it may be a difficult coordination problem for humanity to stay in a state where we know how to build AGI but chooses not to, and in any case this consideration cuts equally across both scenarios.)
I think that the comment you make above is right. In the podcast, we only discuss this issue in a super cursory way:
Fortunately, I’m not too worried about this possibility. Partly, as background, I expect us to have moved beyond using hand-coded reward functions—or, more generally, what Stuart Russell calls the “standard model”—by the time we have the ability to create broadly superintelligent and highly agential/unbounded systems. There are really strong incentives to do this, since there are loads of useful applications that seemingly can’t be developed using hand-coded reward functions. This is some of the sense in which, in my view, capabilities and alignment research is mushed up. If progress is sufficiently gradual, I find it hard to imagine that the ability to create things like world-destroying paperclippers comes before (e.g.) the ability to make at least pretty good use of reward modeling techniques.
(To be clear, I recognize that loads of alignment researchers also think that there will be strong economic incentives for alignment research. I believe there’s a paragraph in Russell’s book arguing this. I think DM’s “scalable agent alignment” paper also suggests that reward modeling is necessary to develop systems that can assist us in most “real world domains.” Although I don’t know how much optimism other people tend to take from this observation. I don’t actually know, for example, whether or not Russell is less optimisic than me.)
If we do end up in a world where people know they can create broadly superintelligent and highly agential/unbounded AI systems, but we’re still haven’t worked out alternatives to Russell’s “standard model,” then no sane person really has any incentive to create and deploy these kinds of systems. Training up a broadly superintelligent and highly agential system using something like a hand-coded reward function is likely to be an obviously bad idea; if it’s not obviously bad, a priori, then it will likely become obviously bad during the training process. There wouldn’t be much of a coordination problem, since, at least in normal circumstances, no one has an incentive to knowingly destroy themselves.
If I then try to tell a story where humanity goes extinct, due to a failure to move beyond the standard model in time, two main scenarios come to mind.
Doomsday Machine: States develop paperclipper-like systems, while thinking of them as doomsday machines, to serve as a novel alternative or complement to nuclear deterrents. They end up being used, either accidentally or intentionally.
Apocalyptic Residual: The ability to develop paperclipper-like systems diffuses broadly. Some of the groups that gain this ability have apocalyptic objectives. They groups intentionally develop and deploy the systems, with the active intention of destroying humanity.
The first scenario doesn’t seem very likely to me. Although this is obviously very speculative, paperclippers seem much worse than nuclear or even biological deterrents. First, your own probability of survival, if you use a paperclipper, may be much lower than your probability of survival if you used nukes or biological weapons. Second, and somewhat ironically, it may actually be hard to convince people that your paperclipper system can actually do a ton of damage; it seems hard to know that the result would actually be as bad as feared, without real-world experience using it before. States would also, likely, be slow to switch to this new deterrence strategy, providing even more time for alignment techniques to be worked out. As a further bit of friction/disincentive, these systems might also just be extremely expensive (depending on compute or environment design requirements). Finally, for doomsday to occur, it’s actually necessary for a paperclipper system to be used—and for its effect to be as bad as feared. The history of nuclear weapons suggests that the annual probability of use is probably pretty low.
The second scenario also doesn’t seem very likely to me, since: (a) I think there would probably be an initial period where large quantities of resources (e.g. compute and skilled engineers) are required to make world-destroying paperclippers. (b) Only a very small portion of people want to destroy the world. (c) There would be unusually strong incentives for states to prevent apocalyptic groups or individuals from gaining access to the necessary resources.
Although see Asya’s “AGI in Vulnerable World” post for a discussion of some conditions under which malicious use concerns might loom larger.
(Apologies for the super long response!)
My guess would be that if you play with GPT-3, it can talk about as well about human values (or AI alignment for that matter) as it can talk about anything else. In that sense, it seems like stronger capabilities for GPT-3 also potentially help solve the alignment problem.
Edit: More discussion here:
https://www.lesswrong.com/posts/BnDF5kejzQLqd5cjH/alignment-as-a-bottleneck-to-usefulness-of-gpt-3?commentId=vcPdcRPWJe2kFi4Wn
I don’t think I caught the point about GPT-3, although this might just be a matter of using concepts differently.
In my mind: To whatever extent GPT-3 can be said to have a “goal,” its goal is to produce text that it would be unsurprising to find on the internet. The training process both imbued it with this goal and made the system good at achieving it.
There are other things we might want spin-offs of GPT-3 to do: For example, compose better-than-human novels. Doing this would involve shifting both what GPT-3 is “capable” of doing and shifting what its “goal” is. (There’s not really a clean practical or conceptual distinction between the two.) It would also probably require making progress on some sort of “alignment” technique, since we can’t (e.g.) write down a hand-coded reward function that quantifies novel quality.