Thanks for sharing this list, a bunch of great people! I have a background in cognitive science and am interested in exploring the strategy of understanding human intelligence for designing aligned AIs.
Some quotes from Paul Christiano that I read a couple months ago on the intersection.
The possible extra oomph of Inverse Reinforcement Learning comes from an explicit model of the human’s mistakes or bounded rationality. It’s what specifies what the AI should do differently in order to be “smarter,” what parts of the human’s policy it should throw out. So it implicitly specifies which of the human behaviors the AI should keep. The error model isn’t an afterthought — it’s the main affair.
and
It’s not clear to me whether or exactly how progress in AI will make this problem [of finding any reasonable representation of any reasonable approximation to what that human wants] easier. I can certainly see how enough progress in cognitive science might yield an answer, but it seems much more likely that it will instead tell us “Your question wasn’t well defined.” What do we do then?
“What [the human operator] H wants” is even more problematic [...]. Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we may not be able to build an AI that tries to do what H wants it to do.
Thanks for sharing this list, a bunch of great people! I have a background in cognitive science and am interested in exploring the strategy of understanding human intelligence for designing aligned AIs.
Some quotes from Paul Christiano that I read a couple months ago on the intersection.
From The easy goal inference problem is still hard:
and
From Clarifying “AI alignment”: