You say you disagree with the idea that the day when we create AGI acts as a sort of ‘deadline’, and if we don’t figure out alignment before then we’re screwed.
A lot of your argument is about how increasing AI capability and alignment are intertwined processes, so that as we increase an AI’s capabilities we’re also increasing its alignment. You discuss how it’s not like we’re going to create a super powerful AI and then give it a module with its goals at the end of the process.
I agree with that, but I don’t see it as substantially affecting the Bostrom/Yudkowsky arguments.
Isn’t the idea that we would have something that seemed aligned as we were training it (based on this continuous feedback we were giving it), but then only when it became extremely powerful we’d realize it wasn’t actually aligned?
I think there are a couple different bits to my thinking here, which I sort of smush together in the interview.
The first bit is that, when developing an individual AI system, its goals and capabilities/intelligence tend to take shape together. This is helpful, since it increases the odds that we’ll notice issues with the system’s emerging goals before they result in truly destructive behavior. Even if someone didn’t expect a purely dust-minimizing house-cleaning robot to be a bad idea, for example, they’ll quickly realize their mistake as they train the system. The mistake will be clear well before the point when the simulated robot learns how to take over the world; it will probably be clear even before the point when the robot learns how to operate door knobs.
The second bit is that there are many contexts in which pretty much any possible hand-coded reward function will either quickly reveal itself as inappropriate or be obviously inappropriate before the training process even begins. This means that sane people won’t proceed in developing and deploying things like house-cleaning robots or city planners until they’ve worked out alignment techniques to some degree; they’ll need to wait until we’ve moved beyond “hand-coding” preferences, toward processes that more heavily involve ML systems learning what behaviors users or developers prefer.
It’s still conceivable that, even given these considerations, people will still accidentally develop AI systems that commit omnicide (or cause similarly grave harms). But the likelihood at least goes down. First of all, it needs to be the case that (a): training processes that use apparently promising alignment techniques will still converge on omnicidal systems. Second, it needs to be the case that (b): people won’t notice that these training processes have serious issues until they’ve actually made omnicidal AI systems.
I’m skeptical of both (a) and (b). My intuition, regarding (a), is that some method that involves learning human preferences would need to be really terrible to result in systems that are doing things on the order of mass murder. Although some arguments related to mesa-optimization may push against this intuition.
Then my intuition, regarding (b), is that the techniques would likely display serious issues before anyone creates a system capable of omnicide. For example, if these techniques tend to induce systems to engage in deceptive behaviors, I would expect there to be some signs that this is an issue early on; I would expect some failed or non-catastrophic acts of deception to be observed first. However, again, my intuition is closely tied to my expectation that progress will be pretty continuous. A key thing to keep in mind about highly continuous scenarios is that there’s not just one single consequential ML training run, where the ML system might look benign at the start but turn around and take over the world at the end. We’re instead talking about countless training runs, used to develop a wide variety of different systems of intermediate generality and competency, deployed across a wide variety of domains, over a period of multiple years. We would have many more opportunities to notice issues with available techniques than we would in a “brain in a box” scenario. In a more discontinuous scenario, the risk would presumably be higher.
This seems to be a disagreement about “how hard is AI alignment?”.
This might just be a matter of semantics, but I don’t think “how hard is AI alignment?” is the main question I have in mind here. I’m mostly thinking about the question of whether we’ll unwittingly create existentially damaging systems, if we don’t work out alignment techniques first. For example, if we don’t know how to make benign house cleaners, city planners, or engineers by year X, will we unwittingly create omnicidal systems instead? Certainly, the harder it is to work out alignment techniques, the higher the risks become. But it’s possible for accident risk to be low even if alignment techniques are very hard to work out.
I think there are a couple different bits to my thinking here, which I sort of smush together in the interview.
The first bit is that, when developing an individual AI system, its goals and capabilities/intelligence tend to take shape together. This is helpful, since it increases the odds that we’ll notice issues with the system’s emerging goals before they result in truly destructive behavior. Even if someone didn’t expect a purely dust-minimizing house-cleaning robot to be a bad idea, for example, they’ll quickly realize their mistake as they train the system. The mistake will be clear well before the point when the simulated robot learns how to take over the world; it will probably be clear even before the point when the robot learns how to operate door knobs.
The second bit is that there are many contexts in which pretty much any possible hand-coded reward function will either quickly reveal itself as inappropriate or be obviously inappropriate before the training process even begins. This means that sane people won’t proceed in developing and deploying things like house-cleaning robots or city planners until they’ve worked out alignment techniques to some degree; they’ll need to wait until we’ve moved beyond “hand-coding” preferences, toward processes that more heavily involve ML systems learning what behaviors users or developers prefer.
It’s still conceivable that, even given these considerations, people will still accidentally develop AI systems that commit omnicide (or cause similarly grave harms). But the likelihood at least goes down. First of all, it needs to be the case that (a): training processes that use apparently promising alignment techniques will still converge on omnicidal systems. Second, it needs to be the case that (b): people won’t notice that these training processes have serious issues until they’ve actually made omnicidal AI systems.
I’m skeptical of both (a) and (b). My intuition, regarding (a), is that some method that involves learning human preferences would need to be really terrible to result in systems that are doing things on the order of mass murder. Although some arguments related to mesa-optimization may push against this intuition.
Then my intuition, regarding (b), is that the techniques would likely display serious issues before anyone creates a system capable of omnicide. For example, if these techniques tend to induce systems to engage in deceptive behaviors, I would expect there to be some signs that this is an issue early on; I would expect some failed or non-catastrophic acts of deception to be observed first. However, again, my intuition is closely tied to my expectation that progress will be pretty continuous. A key thing to keep in mind about highly continuous scenarios is that there’s not just one single consequential ML training run, where the ML system might look benign at the start but turn around and take over the world at the end. We’re instead talking about countless training runs, used to develop a wide variety of different systems of intermediate generality and competency, deployed across a wide variety of domains, over a period of multiple years. We would have many more opportunities to notice issues with available techniques than we would in a “brain in a box” scenario. In a more discontinuous scenario, the risk would presumably be higher.
This might just be a matter of semantics, but I don’t think “how hard is AI alignment?” is the main question I have in mind here. I’m mostly thinking about the question of whether we’ll unwittingly create existentially damaging systems, if we don’t work out alignment techniques first. For example, if we don’t know how to make benign house cleaners, city planners, or engineers by year X, will we unwittingly create omnicidal systems instead? Certainly, the harder it is to work out alignment techniques, the higher the risks become. But it’s possible for accident risk to be low even if alignment techniques are very hard to work out.