Great question! I wrote a (draft) post kind of answering this recently, basically arguing that even though an AGI that is developed would converge on some instrumental goals, it would likely face powerful competing motivations against disempowering humanity/causing extinction. To note that my argument only applies to AGI ‘misalignment/accident’ risks, and doesn’t rule out AGI misuse risks.
Thanks! I’ve commented on your post. I think you are assuming that major unsolved problems in alignment (reward hacking, corrigibility, inner alignment, outer alignment) are just somehow magically solved (it reads as though you are unaware of what the major problems are in AI alignment, sorry).
Great question! I wrote a (draft) post kind of answering this recently, basically arguing that even though an AGI that is developed would converge on some instrumental goals, it would likely face powerful competing motivations against disempowering humanity/causing extinction. To note that my argument only applies to AGI ‘misalignment/accident’ risks, and doesn’t rule out AGI misuse risks.
Would love to hear your view on the post!
Thanks! I’ve commented on your post. I think you are assuming that major unsolved problems in alignment (reward hacking, corrigibility, inner alignment, outer alignment) are just somehow magically solved (it reads as though you are unaware of what the major problems are in AI alignment, sorry).