Check whether the model works with Paul Christiano-type assumptions about how AGI will go.
I had a similar thought reading through your article and my gut reaction is that your setup can be made to work as-is with a more gradual takeoff story with more precedents, warning shots and general transformative effects of AI before we get to takeover capability, but its a bit unnatural and some of the phrasing doesn’t quite fit.
Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.
The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively
or
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.
On his view (and this is somewhat similar to my view) the background assumption is more like, ‘deploying your first critical try (i.e. an AGI that is capable of taking over) implies doom’, which is saying that there is an eventual deadline where these issues need to be sorted out, but lots of transformation and interaction may happen first to buy time or raise the level of capability needed for takeover. So something like the following is needed:
Technical alignment research success by the time of the first critical try (possibly AI assisted)
Safety-conscious deployment decisions when we reach the critical point where dangerous AGI could take over (possibly assisted by e.g. convincing public demonstrations of misalignment)
Coordination between potential AI deployers by the critical try (possibly aided by e.g. warning shots)
On the Paul view, your three pillars would still eventually have to be satisfied at some point, to reach a stable regime where unaligned AGI cannot pose a threat, but we would only need to get to those 100 points after a period where less capable AGIs are running around either helping or hindering, motivating us to respond better or causing damage that degrades our response, to varying extents depending on how we respond in the meantime, and exactly how long we spend during the AI takeoff period.
Also, crucially, the actions of pre-AGI AI may push this point where the problems become critical to higher AI capability levels as well as potentially assisting on each of the pillars directly, e.g. by making takeover harder in various ways. But Paul’s view isn’t that this is enough to actually postpone the need for a complete solution forever: e.g. that the effects of pre-AGI AI could ‘could significantly (though not indefinitely) postpone the point when alignment difficulties could become fatal’.
This adds another element of uncertainty and complexity to all of the takeover/success stories that makes a lot of predictions more difficult.
Essentially, the time/level of AI capability at which we must reach 100 points to succeed also becomes a free variable in the model that can move up and down, and we also have to consider the shorter-term effects of transformative AI on each of the pillars as well.
Great post!
I had a similar thought reading through your article and my gut reaction is that your setup can be made to work as-is with a more gradual takeoff story with more precedents, warning shots and general transformative effects of AI before we get to takeover capability, but its a bit unnatural and some of the phrasing doesn’t quite fit.
Paul says rather that e.g.
or
On his view (and this is somewhat similar to my view) the background assumption is more like, ‘deploying your first critical try (i.e. an AGI that is capable of taking over) implies doom’, which is saying that there is an eventual deadline where these issues need to be sorted out, but lots of transformation and interaction may happen first to buy time or raise the level of capability needed for takeover. So something like the following is needed:
Technical alignment research success by the time of the first critical try (possibly AI assisted)
Safety-conscious deployment decisions when we reach the critical point where dangerous AGI could take over (possibly assisted by e.g. convincing public demonstrations of misalignment)
Coordination between potential AI deployers by the critical try (possibly aided by e.g. warning shots)
On the Paul view, your three pillars would still eventually have to be satisfied at some point, to reach a stable regime where unaligned AGI cannot pose a threat, but we would only need to get to those 100 points after a period where less capable AGIs are running around either helping or hindering, motivating us to respond better or causing damage that degrades our response, to varying extents depending on how we respond in the meantime, and exactly how long we spend during the AI takeoff period.
Also, crucially, the actions of pre-AGI AI may push this point where the problems become critical to higher AI capability levels as well as potentially assisting on each of the pillars directly, e.g. by making takeover harder in various ways. But Paul’s view isn’t that this is enough to actually postpone the need for a complete solution forever: e.g. that the effects of pre-AGI AI could ‘could significantly (though not indefinitely) postpone the point when alignment difficulties could become fatal’.
This adds another element of uncertainty and complexity to all of the takeover/success stories that makes a lot of predictions more difficult.
Essentially, the time/level of AI capability at which we must reach 100 points to succeed also becomes a free variable in the model that can move up and down, and we also have to consider the shorter-term effects of transformative AI on each of the pillars as well.
Thanks for this!
My thinking has moved in this direction as well somewhat since writing this. I’m working on a post which tells a story more or less following what you lay out above—in doc form here: https://docs.google.com/document/d/1msp5JXVHP9rge9C30TL87sau63c7rXqeKMI5OAkzpIA/edit#
I agree this danger level for capabilities could be an interesting addition to the model.
I do feel like the model remains useful in my thinking, so I might try a re-write + some extensions at some point (but probably not very soon)