Ward A comments on All AGI Safety questions welcome (especially basic ones) [April 2023]

Ward A 13 Apr 2023 16:03 UTC
4 points
1 ∶ 0
Personally, my worry stems primarily from how difficult it seems to prevent utter fools from mixing up something like ChaosGPT with GPT-5 or 6. That was a doozy for me. You don’t need fancy causal explanations of misalignment if the doom-mechanism is just… somebody telling the GPT to kill us all. And somebody will definitely try.
Secondarily, I also think a gradually increasing share of GPT’s activation network gets funneled through heuristics that are generally useful for all the tasks involved in minimising its loss function at INT<20, and those heuristics may not stay inner- or outer-aligned at INT>20. Such heuristics include:
1. You get better results if you search a higher-dimensional action-space.
2. You get better results on novel tasks if you model the cognitive processes producing those results, followed by using that model to produce results. There’s a monotonic path all the way up to consequentialism that goes something like the following.
  1. ...index and reuse algorithms that have been reliable for similar tasks, since searching a space of general algorithms is much faster than the alternative.
  2. ...extend its ability to recognise which tasks count as ‘similar’.^[1]
  3. ...develop meta-algorithms for more reliably putting algorithms together in increasingly complex sequences.
  4. This progression could result in something that has an explicit model of its own proxy-values, and explicitly searches a high-dimensional space of action-sequences for plans according to meta-heuristics that have historically maximised those proxy-values. Aka a consequentialist. At which point you should hope those proxy-values capture something you care about.
This is just one hypothetical zoomed-out story that makes sense in my own head, but you definitely shouldn’t defer to my understanding of this. I can explain jargon upon request.
1. ^
  Aka proxy-values. Note that just by extending the domain of inputs for which a particular algorithm is used, you can end up with a proxy-value without directly modelling anything about your loss-function explicitly. Values evolve as the domains of highly general algorithms.