Linch comments on On “first critical tries” in AI alignment

Linch 5 Jun 2024 9:47 UTC
4 points
0 ∶ 0
(sorry if the comment is unclear! Musing out loud)
Thanks for the post and the general sharpening of ideas! One potential disagreement I have with your analysis is that it seems like you tie in the “first critical try” concept with the “Decisive Strategic Advantage” (DSA) concept. But those two seem separable to me. Or rather, my understanding is that DSA motivates some of the first critical try arguments, but is not necessary for them. For example, suppose we set in motion at time t a series of actions that are irrecoverable (eg we make unaligned AIs integral to the world economy). We may not realize what we did until time t+5, at which point it’s too late.
In my understanding of the Yudkowsky/Soares framework, this is like saying “I know with 99%+ certainty that Magnus Carlsen can beat you in chess, even if I can’t predict how.” Similarly, the superhuman agent(s) may end up “beating” humanity through a variety of channels, and a violent/military takeover is just one example. In that sense, the creation/deployment of those agents was the “first critical try” that we messed up, even if we hardened the world against their capability for a military coup.
When looking at the world today, and thinking of ways that smart and amoral people expropriate resources from less intelligent people; sometimes it looks like very obviously and transparently nonconsensual or deceptive behavior. But often it looks more mundane: payday loans, money pumps like warranties and insurance for small items, casinos, student loan forgiveness, and so forth. (The harms are somewhat limited in pratice due to a mixture of a) smart humans not being that much smarter than other humans, and b) partial alignment of values).
Similarly we may end up living in a world where it eventually becomes possible for either agents or processes to wrest control from humanity. In that world, whether we have a “first critical try” or multiple tries depends then on specific empirical details of how many transition points there are, and which ones end up in practice being preventable.