Personally, my worry stems primarily from how difficult it seems to prevent utter fools from mixing up something like ChaosGPT with GPT-5 or 6. That was a doozy for me. You don’t need fancy causal explanations of misalignment if the doom-mechanism is just… somebody telling the GPT to kill us all. And somebody will definitely try.
Secondarily, I also think a gradually increasing share of GPT’s activation network gets funneled through heuristics that are generally useful for all the tasks involved in minimising its loss function at INT<20, and those heuristics may not stay inner- or outer-aligned at INT>20. Such heuristics include:
You get better results if you search a higher-dimensional action-space.
You get better results on novel tasks if you model the cognitive processes producing those results, followed by using that model to produce results. There’s a monotonic path all the way up to consequentialism that goes something like the following.
...index and reuse algorithms that have been reliable for similar tasks, since searching a space of general algorithms is much faster than the alternative.
...extend its ability to recognise which tasks count as ‘similar’.[1]
...develop meta-algorithms for more reliably putting algorithms together in increasingly complex sequences.
This progression could result in something that has an explicit model of its own proxy-values, and explicitly searches a high-dimensional space of action-sequences for plans according to meta-heuristics that have historically maximised those proxy-values. Aka a consequentialist. At which point you should hope those proxy-values capture something you care about.
This is just one hypothetical zoomed-out story that makes sense in my own head, but you definitely shouldn’t defer to my understanding of this. I can explain jargon upon request.
Aka proxy-values. Note that just by extending the domain of inputs for which a particular algorithm is used, you can end up with a proxy-value without directly modelling anything about your loss-function explicitly. Values evolve as the domains of highly general algorithms.
Personally, my worry stems primarily from how difficult it seems to prevent utter fools from mixing up something like ChaosGPT with GPT-5 or 6. That was a doozy for me. You don’t need fancy causal explanations of misalignment if the doom-mechanism is just… somebody telling the GPT to kill us all. And somebody will definitely try.
Secondarily, I also think a gradually increasing share of GPT’s activation network gets funneled through heuristics that are generally useful for all the tasks involved in minimising its loss function at INT<20, and those heuristics may not stay inner- or outer-aligned at INT>20. Such heuristics include:
You get better results if you search a higher-dimensional action-space.
You get better results on novel tasks if you model the cognitive processes producing those results, followed by using that model to produce results. There’s a monotonic path all the way up to consequentialism that goes something like the following.
...index and reuse algorithms that have been reliable for similar tasks, since searching a space of general algorithms is much faster than the alternative.
...extend its ability to recognise which tasks count as ‘similar’.[1]
...develop meta-algorithms for more reliably putting algorithms together in increasingly complex sequences.
This progression could result in something that has an explicit model of its own proxy-values, and explicitly searches a high-dimensional space of action-sequences for plans according to meta-heuristics that have historically maximised those proxy-values. Aka a consequentialist. At which point you should hope those proxy-values capture something you care about.
This is just one hypothetical zoomed-out story that makes sense in my own head, but you definitely shouldn’t defer to my understanding of this. I can explain jargon upon request.
Aka proxy-values. Note that just by extending the domain of inputs for which a particular algorithm is used, you can end up with a proxy-value without directly modelling anything about your loss-function explicitly. Values evolve as the domains of highly general algorithms.