Thank you for the post, it’s interesting and, I think, neglected work on realistic scenarios to take a look at positive goals.
I have a relatively easy time imagining what a stable failure mode looks like—if everyone’s dead, for instance, it seems like they’re likely to stay dead. I’m somewhat less certain about how to model a truly stable success mode. What you describe seems to be in essence one successfully aligned AGI and how we might get there. Do you think that is sufficient to be a stable good state regarding AI safety—i.e. a state in which the collective field of AI safety can take a breath and say ‘we did it, let’s pack it up’? I ask this because it seems important to not be hasty about defining success modes, a false sense of security seems generally dangerous.
I would imagine you might think that one aligned AGI can prevent less well-aligned AGIs from coming into existence; but that of course might come with a potentially concerningly powerful influence on the world. Or that there is a general baseline interest in not building unaligned AGI, so that once the alignment problem is solved, there’s just no reason for unaligned AGI coming into existence? Especially in a very slow, multipolar take-off scenario, an isolated success in aligning one AGI doesn’t necessarily seem to translate to a global success story. (Even less so if you’re worried about how the unaligned and aligned AI might interact).
Another failure mode for an ostensibly stable good state is of course that you just think the AI is aligned and the actions suggested by its value function only come apart from what we think it should do (doesn’t even need to be a particularly treacherous or a particularly big turn). Accordingly, some success modes might be more stable than others—i.e. in how certain we can be that the AI is actually correctly aligned and not just seemingly.
This is a bit of a random collection of thoughts—the TL;DR question version might be: How stable do you think the success in your success stories is?
I think this is a very important question that should probably get its own post.
I’m currently very uncertain about it but I imagine the most realistic scenario is a mix of a lot of different approaches that never feels fully stable. I guess it might be similar to nuclear weapons today but on steroids, i.e. different actors have control over the technology, there are some norms and rules that most actors abide by, there are some organizations that care about non-proliferation, etc. But overall, a small perturbation could still blow up the system.
A really stable scenario probably requires either some very tough governance, e.g. preventing all but one actor from getting to AGI, or high-trust cooperation between actors, e.g. by working on the same AGI jointly.
Overall, I currently don’t see a realistic scenario that feels more stable than nuclear weapons seem today which is not very reassuring.
I’d argue it’s even less stable than nukes, but one reassuring point: There will ultimately be a very weird future with thousands, Millions or billions of AIs, post humans and genetically engineered beings, and the borders are very porous and dissolvable and that ultimately is important to keep in mind. Also we don’t need arbitrarily long alignment, just aligning it for 50-100 years is enough. Ultimately nothing needs to be in the long term stable, just short term chaos and stability.
Because it’s possible that even in unstable, diverse futures, catastrophe can be avoided. As to the long-term future after the Singularity, that’s a question we will deal with it when we get there
I don’t think “dealing with it when we get there” is a good approach to AI safety. I agree that bad outcomes could be averted in unstable futures but I’d prefer to reduce the risk as much as possible nonetheless.
Thank you for the post, it’s interesting and, I think, neglected work on realistic scenarios to take a look at positive goals.
I have a relatively easy time imagining what a stable failure mode looks like—if everyone’s dead, for instance, it seems like they’re likely to stay dead. I’m somewhat less certain about how to model a truly stable success mode. What you describe seems to be in essence one successfully aligned AGI and how we might get there. Do you think that is sufficient to be a stable good state regarding AI safety—i.e. a state in which the collective field of AI safety can take a breath and say ‘we did it, let’s pack it up’? I ask this because it seems important to not be hasty about defining success modes, a false sense of security seems generally dangerous.
I would imagine you might think that one aligned AGI can prevent less well-aligned AGIs from coming into existence; but that of course might come with a potentially concerningly powerful influence on the world. Or that there is a general baseline interest in not building unaligned AGI, so that once the alignment problem is solved, there’s just no reason for unaligned AGI coming into existence? Especially in a very slow, multipolar take-off scenario, an isolated success in aligning one AGI doesn’t necessarily seem to translate to a global success story. (Even less so if you’re worried about how the unaligned and aligned AI might interact).
Another failure mode for an ostensibly stable good state is of course that you just think the AI is aligned and the actions suggested by its value function only come apart from what we think it should do (doesn’t even need to be a particularly treacherous or a particularly big turn). Accordingly, some success modes might be more stable than others—i.e. in how certain we can be that the AI is actually correctly aligned and not just seemingly.
This is a bit of a random collection of thoughts—the TL;DR question version might be: How stable do you think the success in your success stories is?
I think this is a very important question that should probably get its own post.
I’m currently very uncertain about it but I imagine the most realistic scenario is a mix of a lot of different approaches that never feels fully stable. I guess it might be similar to nuclear weapons today but on steroids, i.e. different actors have control over the technology, there are some norms and rules that most actors abide by, there are some organizations that care about non-proliferation, etc. But overall, a small perturbation could still blow up the system.
A really stable scenario probably requires either some very tough governance, e.g. preventing all but one actor from getting to AGI, or high-trust cooperation between actors, e.g. by working on the same AGI jointly.
Overall, I currently don’t see a realistic scenario that feels more stable than nuclear weapons seem today which is not very reassuring.
I’d argue it’s even less stable than nukes, but one reassuring point: There will ultimately be a very weird future with thousands, Millions or billions of AIs, post humans and genetically engineered beings, and the borders are very porous and dissolvable and that ultimately is important to keep in mind. Also we don’t need arbitrarily long alignment, just aligning it for 50-100 years is enough. Ultimately nothing needs to be in the long term stable, just short term chaos and stability.
I’m not sure why this should be reassuring. It doesn’t sound clearly good to me. In fact, it sounds pretty controversial.
Because it’s possible that even in unstable, diverse futures, catastrophe can be avoided. As to the long-term future after the Singularity, that’s a question we will deal with it when we get there
I don’t think “dealing with it when we get there” is a good approach to AI safety. I agree that bad outcomes could be averted in unstable futures but I’d prefer to reduce the risk as much as possible nonetheless.