Edited the post substantially (and, hopefully, transparently, via strikethrough and “edit:” and such) to reflect the parts of this and the previous comment that I agree with.
Regarding this:
I don’t see many risk scenarios where a technical solution to the AI alignment problem is sufficient to solve AGI-related risk. For accident-related risk models (in the sense of this framework) solving safety problems is necessary. But even when technical solutions are available, you still need all relevant actors to adopt those solutions, and we know from the history of nuclear safety that the gap between availability and adoption can be big — in that case decades. In other words, even if technical AI alignment researchers somehow “solve” the alignment problem, government action may still be necessary to ensure adoption (whether government-affiliated labs or private sector actors are the developers).
I’ve heard this elsewhere, including at an event for early-career longtermists interested in policy where a very policy-skeptical, MIRI-type figure was giving a Q&A. A student asked: if we solved the alignment problem, wouldn’t we need to enforce its adoption? The MIRI-type figure said something along the lines of:
“Solving the alignment problem” probably means figuring out how to build an aligned AGI. The top labs all want to build an aligned AGI; they just think the odds of the AGIs they’re working on being aligned are much higher than I think they are. But if we have a solution, we can just go to the labs and say, here, this is how you build it in a way that we don’t all die, and I can prove that this makes us not all die. And if you can’t say that, you don’t actually have a solution. And they’re mostly reasonable people who want to build AGI and make a ton of money and not die, so they will take the solution and say, thanks, we’ll do it this way now.
So, was MIRI-type right? Or would we need policy levers to enforce adoption of the problem, even in this model? The post you cite chronicles how long it took for safety advocates to address glaring risks in the nuclear missile system. My initial model says that if the top labs resemble today’s OpenAI and DeepMind, it would be much easier to convince them than the entrenched, securitized bureaucracy described in the post: the incentives are much better aligned, and the cultures are more receptive to suggestions of change. But this does seem like a cruxy question. If MIRI-type is wrong, this would justify a lot of investigation into what those levers would be, and how to prepare governments to develop and pull these levers. If not, this would support more focus on buying time in the first place, as well as on trying to make sure the top firms at the time the alignment problem is solved are receptive.
(E.g., if the MIRI-type model is right, a US lead over China seems really important: if we expect that the solution will come from alignment researchers in Berkeley, maybe it’s more likely that they implement it if they are private, “open”-tech-culture companies, who speak the same language and live in the same milieu and broadly have trusting relationships with the proponent, etc. Or maybe not!)
I put some credence on the MIRI-type view, but we can’t simply assume that is how it will go down. What if AGI gets developed in the context of an active international crisis or conflict? Could not a government — the US, China, the UK, etc. — come in and take over the tech, and race to get there first? To the extent that there is some “performance” penalty to a safety implementation, or that implementing safety measures takes time that could be used by an opponent to deploy first, there are going to be contexts where not all safety measures are going to be adopted automatically. You could imagine similar dynamics, though less extreme, in an inter-company or inter-lab race situation, where (depending on the perceived stakes) a government might need to step in to prevent premature deployment-for-profit.
The MIRI-type view bakes in a bunch of assumptions about several dimensions of the strategic situation, including: (1) it’s going to be clear to everyone that the AGI system will kill everyone without the safety solution, (2) the safety solution is trusted by everyone and not seen as a potential act of sabotage by an outside actor with its own interest, (3) the external context will allow for reasoned and lengthy conversation about these sorts of decisions. This view makes sense within one scenario in terms of the actors involved, their intentions and perceptions, the broader context, the nature of the tech, etc. It’s not an impossible scenario, but to bet all your chips on it in terms of where the community focuses its effort (I’ve similarly witnessed some MIRI staff’s “policy-skepticism”) strikes me as naive and irresponsible.
You can essentially think of it as two separate problems:
Problem 1: Conditional on us having a technical solution to AI alignment, how do we ensure the first AGI built implements it?
Problem 2: Conditional on us having a technical solution to AI alignment, how do we ensure no AGI is ever built that does NOT implement it, or some other equivalent solution?
I feel like you are talking about Problem 1, and Locke is talking about Problem 2. I agree with the MIRI-type that Problem 1 is easy to solve, and the hard part of that problem is having the solution. I do believe the existing labs working on AGI would implement a solution to AI alignment if we had one. That still leaves Problem 2 that needs to be solved—though at least if we’re facing Problem 2, we do have an aligned AGI to help with the problem.
Hmm. I don’t have strong views on unipolar vs multipolar outcomes, but I think MIRI-type thinks Problem 2 is also easy to solve, due to the last couple clauses of your comment.
Edited the post substantially (and, hopefully, transparently, via strikethrough and “edit:” and such) to reflect the parts of this and the previous comment that I agree with.
Regarding this:
I’ve heard this elsewhere, including at an event for early-career longtermists interested in policy where a very policy-skeptical, MIRI-type figure was giving a Q&A. A student asked: if we solved the alignment problem, wouldn’t we need to enforce its adoption? The MIRI-type figure said something along the lines of:
So, was MIRI-type right? Or would we need policy levers to enforce adoption of the problem, even in this model? The post you cite chronicles how long it took for safety advocates to address glaring risks in the nuclear missile system. My initial model says that if the top labs resemble today’s OpenAI and DeepMind, it would be much easier to convince them than the entrenched, securitized bureaucracy described in the post: the incentives are much better aligned, and the cultures are more receptive to suggestions of change. But this does seem like a cruxy question. If MIRI-type is wrong, this would justify a lot of investigation into what those levers would be, and how to prepare governments to develop and pull these levers. If not, this would support more focus on buying time in the first place, as well as on trying to make sure the top firms at the time the alignment problem is solved are receptive.
(E.g., if the MIRI-type model is right, a US lead over China seems really important: if we expect that the solution will come from alignment researchers in Berkeley, maybe it’s more likely that they implement it if they are private, “open”-tech-culture companies, who speak the same language and live in the same milieu and broadly have trusting relationships with the proponent, etc. Or maybe not!)
I put some credence on the MIRI-type view, but we can’t simply assume that is how it will go down. What if AGI gets developed in the context of an active international crisis or conflict? Could not a government — the US, China, the UK, etc. — come in and take over the tech, and race to get there first? To the extent that there is some “performance” penalty to a safety implementation, or that implementing safety measures takes time that could be used by an opponent to deploy first, there are going to be contexts where not all safety measures are going to be adopted automatically. You could imagine similar dynamics, though less extreme, in an inter-company or inter-lab race situation, where (depending on the perceived stakes) a government might need to step in to prevent premature deployment-for-profit.
The MIRI-type view bakes in a bunch of assumptions about several dimensions of the strategic situation, including: (1) it’s going to be clear to everyone that the AGI system will kill everyone without the safety solution, (2) the safety solution is trusted by everyone and not seen as a potential act of sabotage by an outside actor with its own interest, (3) the external context will allow for reasoned and lengthy conversation about these sorts of decisions. This view makes sense within one scenario in terms of the actors involved, their intentions and perceptions, the broader context, the nature of the tech, etc. It’s not an impossible scenario, but to bet all your chips on it in terms of where the community focuses its effort (I’ve similarly witnessed some MIRI staff’s “policy-skepticism”) strikes me as naive and irresponsible.
Agreed; it strikes me that I’ve probably been over-anchoring on this model
You can essentially think of it as two separate problems:
Problem 1: Conditional on us having a technical solution to AI alignment, how do we ensure the first AGI built implements it?
Problem 2: Conditional on us having a technical solution to AI alignment, how do we ensure no AGI is ever built that does NOT implement it, or some other equivalent solution?
I feel like you are talking about Problem 1, and Locke is talking about Problem 2. I agree with the MIRI-type that Problem 1 is easy to solve, and the hard part of that problem is having the solution. I do believe the existing labs working on AGI would implement a solution to AI alignment if we had one. That still leaves Problem 2 that needs to be solved—though at least if we’re facing Problem 2, we do have an aligned AGI to help with the problem.
Hmm. I don’t have strong views on unipolar vs multipolar outcomes, but I think MIRI-type thinks Problem 2 is also easy to solve, due to the last couple clauses of your comment.