Suppose that near-term AGI progress mostly looks like making GPT smarter and smarter. Do people think this, in itself, would likely cause human extinction? How? Due to mesa-optimizers that would emerge during training of GPT? Due to people hooking GPT up to control of actions in the real world, and those autonomous systems would themselves go off the rails? Just due to accelerating disruptive social change that makes all sorts of other risks (nuclear war, bioterrorism, economic or government collapse, etc) more likely? Or do people think the AI extinction risk mainly comes when people start building explicitly agentic AIs to automate real-world tasks like making money or national defense, not just text chats and image understanding as GPT does?
Those all seem like important risks to me, but I’d estimate the highest x-risk from agentic systems that learn to seek power or wirehead, especially after a transition to very rapid economic or scientific progress. If AI progresses slowly or is only a tool used by human operators, x-risk seems much lower to me.
Personally, my worry stems primarily from how difficult it seems to prevent utter fools from mixing up something like ChaosGPT with GPT-5 or 6. That was a doozy for me. You don’t need fancy causal explanations of misalignment if the doom-mechanism is just… somebody telling the GPT to kill us all. And somebody will definitely try.
Secondarily, I also think a gradually increasing share of GPT’s activation network gets funneled through heuristics that are generally useful for all the tasks involved in minimising its loss function at INT<20, and those heuristics may not stay inner- or outer-aligned at INT>20. Such heuristics include:
You get better results if you search a higher-dimensional action-space.
You get better results on novel tasks if you model the cognitive processes producing those results, followed by using that model to produce results. There’s a monotonic path all the way up to consequentialism that goes something like the following.
...index and reuse algorithms that have been reliable for similar tasks, since searching a space of general algorithms is much faster than the alternative.
...extend its ability to recognise which tasks count as ‘similar’.[1]
...develop meta-algorithms for more reliably putting algorithms together in increasingly complex sequences.
This progression could result in something that has an explicit model of its own proxy-values, and explicitly searches a high-dimensional space of action-sequences for plans according to meta-heuristics that have historically maximised those proxy-values. Aka a consequentialist. At which point you should hope those proxy-values capture something you care about.
This is just one hypothetical zoomed-out story that makes sense in my own head, but you definitely shouldn’t defer to my understanding of this. I can explain jargon upon request.
Aka proxy-values. Note that just by extending the domain of inputs for which a particular algorithm is used, you can end up with a proxy-value without directly modelling anything about your loss-function explicitly. Values evolve as the domains of highly general algorithms.
Suppose that near-term AGI progress mostly looks like making GPT smarter and smarter. Do people think this, in itself, would likely cause human extinction? How? Due to mesa-optimizers that would emerge during training of GPT? Due to people hooking GPT up to control of actions in the real world, and those autonomous systems would themselves go off the rails? Just due to accelerating disruptive social change that makes all sorts of other risks (nuclear war, bioterrorism, economic or government collapse, etc) more likely? Or do people think the AI extinction risk mainly comes when people start building explicitly agentic AIs to automate real-world tasks like making money or national defense, not just text chats and image understanding as GPT does?
Those all seem like important risks to me, but I’d estimate the highest x-risk from agentic systems that learn to seek power or wirehead, especially after a transition to very rapid economic or scientific progress. If AI progresses slowly or is only a tool used by human operators, x-risk seems much lower to me.
Good recent post on various failure modes: https://www.lesswrong.com/posts/mSF4KTxAGRG3EHmhb/ai-x-risk-approximately-ordered-by-embarrassment
Personally, my worry stems primarily from how difficult it seems to prevent utter fools from mixing up something like ChaosGPT with GPT-5 or 6. That was a doozy for me. You don’t need fancy causal explanations of misalignment if the doom-mechanism is just… somebody telling the GPT to kill us all. And somebody will definitely try.
Secondarily, I also think a gradually increasing share of GPT’s activation network gets funneled through heuristics that are generally useful for all the tasks involved in minimising its loss function at INT<20, and those heuristics may not stay inner- or outer-aligned at INT>20. Such heuristics include:
You get better results if you search a higher-dimensional action-space.
You get better results on novel tasks if you model the cognitive processes producing those results, followed by using that model to produce results. There’s a monotonic path all the way up to consequentialism that goes something like the following.
...index and reuse algorithms that have been reliable for similar tasks, since searching a space of general algorithms is much faster than the alternative.
...extend its ability to recognise which tasks count as ‘similar’.[1]
...develop meta-algorithms for more reliably putting algorithms together in increasingly complex sequences.
This progression could result in something that has an explicit model of its own proxy-values, and explicitly searches a high-dimensional space of action-sequences for plans according to meta-heuristics that have historically maximised those proxy-values. Aka a consequentialist. At which point you should hope those proxy-values capture something you care about.
This is just one hypothetical zoomed-out story that makes sense in my own head, but you definitely shouldn’t defer to my understanding of this. I can explain jargon upon request.
Aka proxy-values. Note that just by extending the domain of inputs for which a particular algorithm is used, you can end up with a proxy-value without directly modelling anything about your loss-function explicitly. Values evolve as the domains of highly general algorithms.