So far I have only been posting on LessWrong, but from now on I will publish my posts both here and on LW. I have four LW posts that are all connected to the same topic: risks from scenarios where someone successfully hits a bad alignment target. This post gives a short summary of this topic, and describes how my four LW posts fits together.
My first LW post showed that PCEV gives a large amount of extra influence to people that intrinsically value hurting other people. A powerful AI controlled by such people would be very dangerous. The fact that this feature of PCEV went undetected for more than a decade shows that analysing alignment targets is difficult. The fact that a successfully implemented PCEV would have led to an outcome massively worse than extinction, shows that failure to properly analyse an alignment target can be very dangerous. The fact that the issue was eventually noticed, shows that it is possible to reduce these dangers. In other words: Alignment Target Analysis (ATA) is a tractable way of reducing a serious risk. But there does not seem to exist a single research project dedicated to ATA. This is why I do ATA.
There seems to exist a wide variety of reasons for thinking that doing ATA now is not needed. In other words: there exists a wide variety of arguments for why it is ok to stay at our current levels of ATA progress, without making any real effort to improve things. These arguments are usually both unpublished and hand wavy. My three other LW posts each counter one such argument.
The post assumes that some limited AI will prevent all unauthorised AI projects forever. The post then shows why this assumption does not actually remove the urgency of doing ATA now. Decisions regarding Sovereign AI will still be in human hands (the limited AI could by assumption be safely launched without any further ATA progress. Thus, no decisions regarding Sovereign AI can be deferred to this limited AI). There exists many reasons why someone would decide to quickly launch a Sovereign AI. The risks from a competing AI project is only one such reason, and the post discusses other reasons. If some alignment target has a hidden flaw, then finding that flaw would thus remain urgent, even if competing AI projects are taken out of the equation.
This post points out that a Last Judge off-switch add on can fail, meaning that this idea cannot remove the need for doing ATA now. The post also outlines a specific scenario where such an add on fails. It finally points out that such an add on can be attached to many different AI projects, aiming at many different alignment targets. This means that the idea of a Last Judge is not very helpful when deciding what alignment target to aim at.
This post shows that a partially successful Corrigibility method can actually make things worse. It outlines a scenario where a Corrigibility method works for a limited AI that is used to buy time. But fails for an AI Sovereign. This can make things worse, because a bad alignment target might end up getting successfully implemented. The Corrigible limited AI makes the Sovereign AI project possible. And the Sovereign AI project moves forwards because the designers think that the Corrigibility method will also work for this project. The designers know that it is possible that their alignment target is bad. But they think that this is a manageable risk, because they think that the Sovereign AI will also be Corrigible.
A short summary of what I have been posting about on LessWrong
So far I have only been posting on LessWrong, but from now on I will publish my posts both here and on LW. I have four LW posts that are all connected to the same topic: risks from scenarios where someone successfully hits a bad alignment target. This post gives a short summary of this topic, and describes how my four LW posts fits together.
A post describing a problem with Parliamentarian CEV (PCEV).
My first LW post showed that PCEV gives a large amount of extra influence to people that intrinsically value hurting other people. A powerful AI controlled by such people would be very dangerous. The fact that this feature of PCEV went undetected for more than a decade shows that analysing alignment targets is difficult. The fact that a successfully implemented PCEV would have led to an outcome massively worse than extinction, shows that failure to properly analyse an alignment target can be very dangerous. The fact that the issue was eventually noticed, shows that it is possible to reduce these dangers. In other words: Alignment Target Analysis (ATA) is a tractable way of reducing a serious risk. But there does not seem to exist a single research project dedicated to ATA. This is why I do ATA.
There seems to exist a wide variety of reasons for thinking that doing ATA now is not needed. In other words: there exists a wide variety of arguments for why it is ok to stay at our current levels of ATA progress, without making any real effort to improve things. These arguments are usually both unpublished and hand wavy. My three other LW posts each counter one such argument.
A post discussing the idea of building a limited AI, that is only used to shut down competing AI projects.
The post assumes that some limited AI will prevent all unauthorised AI projects forever. The post then shows why this assumption does not actually remove the urgency of doing ATA now. Decisions regarding Sovereign AI will still be in human hands (the limited AI could by assumption be safely launched without any further ATA progress. Thus, no decisions regarding Sovereign AI can be deferred to this limited AI). There exists many reasons why someone would decide to quickly launch a Sovereign AI. The risks from a competing AI project is only one such reason, and the post discusses other reasons. If some alignment target has a hidden flaw, then finding that flaw would thus remain urgent, even if competing AI projects are taken out of the equation.
A post explaining why the Last Judge idea does not remove the need to do ATA now.
This post points out that a Last Judge off-switch add on can fail, meaning that this idea cannot remove the need for doing ATA now. The post also outlines a specific scenario where such an add on fails. It finally points out that such an add on can be attached to many different AI projects, aiming at many different alignment targets. This means that the idea of a Last Judge is not very helpful when deciding what alignment target to aim at.
A post explaining why the Corrigibility idea does not remove the need to do ATA now.
This post shows that a partially successful Corrigibility method can actually make things worse. It outlines a scenario where a Corrigibility method works for a limited AI that is used to buy time. But fails for an AI Sovereign. This can make things worse, because a bad alignment target might end up getting successfully implemented. The Corrigible limited AI makes the Sovereign AI project possible. And the Sovereign AI project moves forwards because the designers think that the Corrigibility method will also work for this project. The designers know that it is possible that their alignment target is bad. But they think that this is a manageable risk, because they think that the Sovereign AI will also be Corrigible.