Mo Putera comments on What are the coolest topics in AI safety, to a hopelessly pure mathematician?

Mo Putera 18 Aug 2022 8:11 UTC
2 points
0 ∶ 0
I’m guessing you’ve already made up your mind on this since it’s been a few months, but since you mentioned computational complexity being your research field you might be interested to know that Scott Aaronson was persuaded by Jan Leike to spend a year at OpenAI to
… think about the theoretical foundations of AI safety and alignment. What, if anything, can computational complexity contribute to a principled understanding of how to get an AI to do what we want and not do what we don’t want?
(Scott admitted, like you, that he basically needed to be nerd-sniped into working on problems; “this is very important so you must work on it” doesn’t work in practice.)
Quoting Scott a bit more (and adding bullets):
So, what projects will I actually work on at OpenAI? Yeah, I’ve been spending the past week trying to figure that out. I still don’t know, but a few possibilities have emerged.
- First, I might work out a general theory of sample complexity and so forth for learning in dangerous environments—i.e., learning where making the wrong query might kill you.
- Second, I might work on explainability and interpretability for machine learning: given a deep network that produced a particular output, what do we even mean by an “explanation” for “why” it produced that output? What can we say about the computational complexity of finding that explanation?
- Third, I might work on the ability of weaker agents to verify the behavior of stronger ones. Of course, if P≠NP, then the gap between the difficulty of solving a problem and the difficulty of recognizing a solution can sometimes be enormous. And indeed, even in empirical machine learing, there’s typically a gap between the difficulty of generating objects (say, cat pictures) and the difficulty of discriminating between them and other objects, the latter being easier. But this gap typically isn’t exponential, as is conjectured for NP-complete problems: it’s much smaller than that. And counterintuitively, we can then turn around and use the generators to improve the discriminators. How can we understand this abstractly? Are there model scenarios in complexity theory where we can prove that something similar happens? How far can we amplify the generator/discriminator gap—for example, by using interactive protocols, or debates between competing AIs?
That said, these mostly lean towards theory-builders, and you mentioned upthread being more problem-solver-oriented, so they probably aren’t as interesting.