Thomas Kwa🔹 answers AI Risk Microdynamics Survey

Thomas Kwa🔹 9 Oct 2022 22:32 UTC
4 points
1 ∶ 0
I have some qualms with the survey wording.
Conditional on a Misaligned AGI being exposed to high-impact inputs, it will scale (in aggregate) to the point of permanently disempowering roughly all of humanity
I answered 70% for this question, but the wording doesn’t feel quite right. I put >80% that a sufficiently capable misaligned AI would disempower humanity, but the first AGI deployed is likely to not be maximally capable unless takeoff is really fast. It could neither initiate a pivotal act/process nor disempower humanity, then over the next days to years (depending on takeoff speeds) different systems could become powerful enough to disempower humanity.
One way in which Unaligned AGI might cease to be a risk is if we develop a test for Misalignment, such that Misaligned AGIs are never superficially attractive to deploy. What is your best guess for the year when such a test is invented?
Such a test might not end the acute risk period, because people might not trust the results and could still deploy misaligned AGI. The test would also have to extrapolate into the real world, farther than any currently existing benchmark. It would probably need to rely on transparency tools far in advance of what we have today, and because this region of the transparency tech tree also contains alignment solutions, the development of this test should not be treated as uncorrelated with other alignment solutions.
Even then, I also think there’s a good chance this test is very difficult to develop before AGI. The misalignment test and alignment problem aren’t research problems that we are likely to solve independently of AGI, they’re dramatically sped up by being able to iterate on AI systems and get more than one try on difficult problems.
Also, conditional on aligned ASI being deployed, I expect this test to be developed within a few days. So the question should say “conditional on AGI not being developed”.
One way in which Unaligned AGI might cease to be a risk is if we have a method which provably creates Aligned AGIs (‘solving the Alignment Problem’). What is your best guess for the year when this is first accomplished?

I.E. The year when it becomes possible (not necessarily practical / economic) to build an AGI and know it is definitely Aligned.
Solving the alignment problem doesn’t mean we can create a provably aligned AGI. Nate Soares says
Following Eliezer, I think of an AGI as “safe” if deploying it carries no more than a 50% chance of killing more than a billion people:
When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get. [...] Practically all of the difficulty is in getting to “less than certainty of killing literally everyone”. Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment. At this point, I no longer care how it works, I don’t care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’.
Notably absent from this definition is any notion of “certainty” or “proof”. I doubt we’re going to be able to prove much about the relevant AI systems, and pushing for proofs does not seem to me to be a particularly fruitful approach (and never has; the idea that this was a key part of MIRI’s strategy is a common misconception about MIRI).