Interesting perspective. Though leaning on Cotra’s recent post, if the first AGI will be developed by iterations of reinforcement learning in different domains, it seems likely that will develop a rather accurate view of the world, as that will give the highest rewards. This means the AGI will have high situational awareness. I.e., it will know that it’s an AGI and it will very likely know about human biases. I thus think it will also be aware that it contains mental bugs itself and may start actively trying to fix them (since that will be reinforced as it gives higher rewards in the longer run). I thus think that we should expect it to contain a surprisingly low number of very general bugs such as weird ways of thinking or false assumptions in its worldview. That’s why I believe the first AGI will already be very capable and smart enough to hide for a long time until it strikes and overthrows its owners.
Yeah, i guess another consequence of how bugs are distributed is that the methodology of AI development matters a lot. An AI that is trained and developed over huge numbers of different domains is far, far, far more likely to succeed at takeover than one trained for specific purposes such as solving math problems. So the HFDT from that post would definitely be of higher concern if it worked (although I’m skeptical that it would).
I do think that any method of training will still leave holes, however. For example, the scenario where HFDT is trained by looking at how experts use a computer would leave out all the other non-computer domains of expertise. So even if it was a perfect reasoner for all scientific, artistic and political knowledge, you couldn’t just shove it in a robot body and expect it do a backflip on it’s first try, no matter how many backflipping manuals it had read. I think there will be sufficently many outside domain problems to stymy world domination attempts, at least initially.
I think a main difference of opinion I have with AI risk people is that I think subjugating all of humanity is a near impossibly hard task, requiring a level of intelligence and perfection across a range of fields that is stupendously far above human level, and I don’t think it’s possible to reach that level without vast, vast amounts of empirical testing.
Agree that it depends a lot on the training procedure. However, I think that given high situational awareness, we should expect the AI to know its shortcomings very well.
So I agree that it won’t be able to do a backflip on the first try. But it will know that it would likely fail and thus not rely on plans that require backflips or if it needs backflips it will find a way of learning them without being suspicious. (I.e. by manipulating a human into training it to learn backflips)
I think overthrowing humanity is certainly hard. But it still seems possible for a patient AGI that slowly accumulates wealth and power by exploiting human conflicts, getting involved in crucial economic processes, and potentially gaining control of communication systems in the military with deepfakes & the wealth and power it has accumulated. (And all this can be done by just interacting with a computer interface as in Cotra’s example) It’s also fairly likely that there are some exploits in the way humans work that we are not aware of that the AGI would learn from being trained with tons of data that would make it even easier.
So overall, I agree the AGI will have bugs, but it will also know it likely has bugs and thus will be very careful with any attempts at overthrowing humanity.
So I think my most plausible scenario of AI success would be similar to yours: You build up wealth and power through some sucker corporation or small country that thinks it controls you, then use their R&D resources along with your intelligence to develop some form of world-destruction level technology that can be deployed without resistance. I think this is orders of magnitudes more likely to work than yudkowsky’s ridiculous “make a nanofactory in a beaker from first principles” strategy.
I still think this plan is doomed to fail (for early AGI). It’s multistep, highly complicated, and requires interactions with a lot of humans, who are highly unpredictable. You really can’t avoid “backflip steps” in such a process. By that I mean, there will be things it needs to do that there are not sufficient data available to perfect, that it just has to roll the dice on. For example, there is no training set for “running a secret globe-spanning conspiracy”, so it will inevitably make mistakes there. If we discover it before it’s ready to defeat us, it loses. Also, by the time it pulls the trigger on it’s plan, there will be other AGI’s around, and other examples of failed attacks that put humanity on alert.
A key crux here seems to be your claim that AI’s will attempt these plans before they have the relevant capacities because they are on short time scales. However, given enough time and patience, it seems clear to me that the AI could succeed simply by not taking risky actions that it knows it might mess up on until it self improves to be able to take those actions. The question then becomes how long the AI think it has until another AI that could dominate it is built, as well as how fast self improvement is.
Interesting perspective. Though leaning on Cotra’s recent post, if the first AGI will be developed by iterations of reinforcement learning in different domains, it seems likely that will develop a rather accurate view of the world, as that will give the highest rewards. This means the AGI will have high situational awareness. I.e., it will know that it’s an AGI and it will very likely know about human biases. I thus think it will also be aware that it contains mental bugs itself and may start actively trying to fix them (since that will be reinforced as it gives higher rewards in the longer run).
I thus think that we should expect it to contain a surprisingly low number of very general bugs such as weird ways of thinking or false assumptions in its worldview.
That’s why I believe the first AGI will already be very capable and smart enough to hide for a long time until it strikes and overthrows its owners.
Yeah, i guess another consequence of how bugs are distributed is that the methodology of AI development matters a lot. An AI that is trained and developed over huge numbers of different domains is far, far, far more likely to succeed at takeover than one trained for specific purposes such as solving math problems. So the HFDT from that post would definitely be of higher concern if it worked (although I’m skeptical that it would).
I do think that any method of training will still leave holes, however. For example, the scenario where HFDT is trained by looking at how experts use a computer would leave out all the other non-computer domains of expertise. So even if it was a perfect reasoner for all scientific, artistic and political knowledge, you couldn’t just shove it in a robot body and expect it do a backflip on it’s first try, no matter how many backflipping manuals it had read. I think there will be sufficently many outside domain problems to stymy world domination attempts, at least initially.
I think a main difference of opinion I have with AI risk people is that I think subjugating all of humanity is a near impossibly hard task, requiring a level of intelligence and perfection across a range of fields that is stupendously far above human level, and I don’t think it’s possible to reach that level without vast, vast amounts of empirical testing.
Agree that it depends a lot on the training procedure. However, I think that given high situational awareness, we should expect the AI to know its shortcomings very well.
So I agree that it won’t be able to do a backflip on the first try. But it will know that it would likely fail and thus not rely on plans that require backflips or if it needs backflips it will find a way of learning them without being suspicious. (I.e. by manipulating a human into training it to learn backflips)
I think overthrowing humanity is certainly hard. But it still seems possible for a patient AGI that slowly accumulates wealth and power by exploiting human conflicts, getting involved in crucial economic processes, and potentially gaining control of communication systems in the military with deepfakes & the wealth and power it has accumulated. (And all this can be done by just interacting with a computer interface as in Cotra’s example) It’s also fairly likely that there are some exploits in the way humans work that we are not aware of that the AGI would learn from being trained with tons of data that would make it even easier.
So overall, I agree the AGI will have bugs, but it will also know it likely has bugs and thus will be very careful with any attempts at overthrowing humanity.
So I think my most plausible scenario of AI success would be similar to yours: You build up wealth and power through some sucker corporation or small country that thinks it controls you, then use their R&D resources along with your intelligence to develop some form of world-destruction level technology that can be deployed without resistance. I think this is orders of magnitudes more likely to work than yudkowsky’s ridiculous “make a nanofactory in a beaker from first principles” strategy.
I still think this plan is doomed to fail (for early AGI). It’s multistep, highly complicated, and requires interactions with a lot of humans, who are highly unpredictable. You really can’t avoid “backflip steps” in such a process. By that I mean, there will be things it needs to do that there are not sufficient data available to perfect, that it just has to roll the dice on. For example, there is no training set for “running a secret globe-spanning conspiracy”, so it will inevitably make mistakes there. If we discover it before it’s ready to defeat us, it loses. Also, by the time it pulls the trigger on it’s plan, there will be other AGI’s around, and other examples of failed attacks that put humanity on alert.
A key crux here seems to be your claim that AI’s will attempt these plans before they have the relevant capacities because they are on short time scales. However, given enough time and patience, it seems clear to me that the AI could succeed simply by not taking risky actions that it knows it might mess up on until it self improves to be able to take those actions. The question then becomes how long the AI think it has until another AI that could dominate it is built, as well as how fast self improvement is.