Thought provoking post, thanks. I think the Orthogonality Thesis (in its theoretical form—your “Motte”) is useful to counter the common naive intuition that sufficiently intelligent AI will be benevolent by default (or at least open to being “reasoned with”).
But (as Steven Byrnes says), it is just one component of the argument that AGI x-risk is a significant threat. Others being Goodhart’s Law and the fragility of human values (what Stuart Russell refers to as the “King Midas problem”), Instrumental Convergence, Mesa-optimisation, the second species argument (what Stuart Russell refers to as the “gorilla problem”); and differential technological development (capabilities research outstripping alignment research), arms races, and (lack of) global coordination amidst a rapid increase in available compute (increasing hardware overhang).
I guess you could argue that strong convergence on human-compatible values by default would make most of these concerns moot, but there is little to suggest that this is likely. Going through your “Some reasons to expect correlation”, I think that 1-4 don’t address the risks from mesa-optimisation and instrumental convergence (on seeking resources etc—think turning the world into “computronium”). In general, it seems that things have to be completely water-tight in order to avoid x-risk. We might asymptote toward human-compatibility of ML systems, but all the doom flows through the gap between the curve and the axis. Making it completely watertight is an incredibly difficult challenge. Especially as it needs to be done on the first try when deploying AGI.
5-8 are interesting: perhaps (to use your terms) if some form of moral exclusivism bottoming out to valence utilitarianism is true, and the superintelligence discovers it by default, we might be ok (but even then, your 9 may apply).
Riffing on possible reasons to be hopeful, I recently compiled a list of potential “miracles” (including empirical “crucial considerations” [/wishful thinking]) that could mean the problem of AGI x-risk is bypassed:
Possibility of a failed (unaligned) takeoff scenario where the AI fails to model humans accurately enough (i.e. realise smart humans could detect its “hidden” activity in a certain way). [This may only set things back a few months to years; or could lead to some kind of Butlerian Jihad if there is a sufficiently bad (but ultimately recoverable) global catastrophe (and then much more time for Alignment the second time around?)].
Valence realism being true. Binding problem vs AGI Alignment.
Omega experiencing every possible consciousness and picking the best? [Could still lead to x-risk in terms of a Hedonium Shockwave].
Moral Realism being true (and the AI discovering it and the true morality being human-compatible).
Thought provoking post, thanks. I think the Orthogonality Thesis (in its theoretical form—your “Motte”) is useful to counter the common naive intuition that sufficiently intelligent AI will be benevolent by default (or at least open to being “reasoned with”).
But (as Steven Byrnes says), it is just one component of the argument that AGI x-risk is a significant threat. Others being Goodhart’s Law and the fragility of human values (what Stuart Russell refers to as the “King Midas problem”), Instrumental Convergence, Mesa-optimisation, the second species argument (what Stuart Russell refers to as the “gorilla problem”); and differential technological development (capabilities research outstripping alignment research), arms races, and (lack of) global coordination amidst a rapid increase in available compute (increasing hardware overhang).
I guess you could argue that strong convergence on human-compatible values by default would make most of these concerns moot, but there is little to suggest that this is likely. Going through your “Some reasons to expect correlation”, I think that 1-4 don’t address the risks from mesa-optimisation and instrumental convergence (on seeking resources etc—think turning the world into “computronium”). In general, it seems that things have to be completely water-tight in order to avoid x-risk. We might asymptote toward human-compatibility of ML systems, but all the doom flows through the gap between the curve and the axis. Making it completely watertight is an incredibly difficult challenge. Especially as it needs to be done on the first try when deploying AGI.
5-8 are interesting: perhaps (to use your terms) if some form of moral exclusivism bottoming out to valence utilitarianism is true, and the superintelligence discovers it by default, we might be ok (but even then, your 9 may apply).
Riffing on possible reasons to be hopeful, I recently compiled a list of potential “miracles” (including empirical “crucial considerations” [/wishful thinking]) that could mean the problem of AGI x-risk is bypassed:
Possibility of a failed (unaligned) takeoff scenario where the AI fails to model humans accurately enough (i.e. realise smart humans could detect its “hidden” activity in a certain way). [This may only set things back a few months to years; or could lead to some kind of Butlerian Jihad if there is a sufficiently bad (but ultimately recoverable) global catastrophe (and then much more time for Alignment the second time around?)].
Valence realism being true. Binding problem vs AGI Alignment.
Omega experiencing every possible consciousness and picking the best? [Could still lead to x-risk in terms of a Hedonium Shockwave].
Moral Realism being true (and the AI discovering it and the true morality being human-compatible).
Natural abstractions leading to Alignment by Default?
Rohin’s links here.
AGI discovers new physics and exits to another dimension (like the creatures in Greg Egan’s Crystal Nights).
Simulation/anthropics stuff.
Alien Information Theory being true!? (And the aliens having solved alignment).
I don’t think I put more than 10% probability on them collectively though, and my P(doom) is high enough to consider it “crunch time”.