Hi there everyone, I’m William the Kiwi and this is my first post on EA forums. I have recently discovered AI alignment and have been reading about it for around a month. This seems like an important but terrifyingly under invested in field. I have many questions but in the interest of speed I will involve Cunningham’s Law and post my current conclusions.
My AI conclusions:
Corrigiblity is mathematically impossible for AGI.
Alignment requires defining all important human values in a robust enough way that it can survive near-infinite amounts of optimisation pressure exerted by a superintelligent AGI. Alignment is therefore difficult.
Superintelligence by Nick Bostrum is a way of communicating the antimeme “unaligned AI is dangerous” to the general public.
The extinction of humanity is a plausible outcome of unaligned AI.
Eliezer Yudkowsky seems overly pessimistic but likely correct about most things he says.
Humanity is likely to produce AGI before it produces fully aligned AI.
To incentivize responses to this post I should offer a £1000 reward for a response that supports or refutes each of these conclusions and provides evidence for it.
I am currently visiting England and would love to talk more about this topic with people, either over the Internet or in person.
Why do you think that corrigibility is mathematically impossible for AGI? Because you think it would necessarily have a predefined utility function, or some other reason?
I do not know if has a predefined utility function, or if the functions simply have similar forms. If there is a utility function that provides utility for the AI to shutdown if some arbitrary “shutdown button” is pressed, then there exists a state where the “shutdown button” is being pressed at a very high probability (e.g. an office intern is in the process of pushing the “shutdown button”) that provides more expected utility than the current state. There is therefore an incentive for the AI to move towards that state (e.g. by convincing the office intern to push the “shutdown button”). If instead there was negative utility in the “shutdown button” being pressed, the AI is incentivized to prevent the button from being pressed. If instead the AI had no utility function for whether the “shutdown button” was pressed or not, but there somehow existed a code segment that caused the shutdown process to happen if the “shutdown button” was pressed, then there existed a daughter AGI that has slightly more efficient code if this code segment is omitted. An AGI that has a utility function that provides utility for producing daughter AGIs that are more efficient versions of itself, is incentivized to produce such a daughter that has the “shutdown button” code segment removed.
Hi there everyone, I’m William the Kiwi and this is my first post on EA forums. I have recently discovered AI alignment and have been reading about it for around a month. This seems like an important but terrifyingly under invested in field. I have many questions but in the interest of speed I will involve Cunningham’s Law and post my current conclusions.
My AI conclusions:
Corrigiblity is mathematically impossible for AGI.
Alignment requires defining all important human values in a robust enough way that it can survive near-infinite amounts of optimisation pressure exerted by a superintelligent AGI. Alignment is therefore difficult.
Superintelligence by Nick Bostrum is a way of communicating the antimeme “unaligned AI is dangerous” to the general public.
The extinction of humanity is a plausible outcome of unaligned AI.
Eliezer Yudkowsky seems overly pessimistic but likely correct about most things he says.
Humanity is likely to produce AGI before it produces fully aligned AI.
To incentivize responses to this post I should offer a £1000 reward for a response that supports or refutes each of these conclusions and provides evidence for it.
I am currently visiting England and would love to talk more about this topic with people, either over the Internet or in person.
You might want to read this is as a counter to AI doomerism: https://www.lesswrong.com/posts/LDRQ5Zfqwi8GjzPYG/counterarguments-to-the-basic-ai-x-risk-case
This for a way to contribute to solving this problem without getting into alignment:
https://www.lesswrong.com/posts/uFNgRumrDTpBfQGrs/let-s-think-about-slowing-down-ai
this too:
https://betterwithout.ai/pragmatic-AI-safety
and this for the case that we should stop using neural networks:
https://betterwithout.ai/gradient-dissent
Hi William! Welcome to the Forum :)
Why do you think that corrigibility is mathematically impossible for AGI? Because you think it would necessarily have a predefined utility function, or some other reason?
Hi Robi Rahman, thanks for the welcome.
I do not know if has a predefined utility function, or if the functions simply have similar forms. If there is a utility function that provides utility for the AI to shutdown if some arbitrary “shutdown button” is pressed, then there exists a state where the “shutdown button” is being pressed at a very high probability (e.g. an office intern is in the process of pushing the “shutdown button”) that provides more expected utility than the current state. There is therefore an incentive for the AI to move towards that state (e.g. by convincing the office intern to push the “shutdown button”). If instead there was negative utility in the “shutdown button” being pressed, the AI is incentivized to prevent the button from being pressed. If instead the AI had no utility function for whether the “shutdown button” was pressed or not, but there somehow existed a code segment that caused the shutdown process to happen if the “shutdown button” was pressed, then there existed a daughter AGI that has slightly more efficient code if this code segment is omitted. An AGI that has a utility function that provides utility for producing daughter AGIs that are more efficient versions of itself, is incentivized to produce such a daughter that has the “shutdown button” code segment removed.
There is a more detailed version of this description in https://intelligence.org/files/Corrigibility.pdf
I could be wrong about my conclusion about corrigiblity (and probably am), however it is my best intuition at this point.