Full time independent deconfusion researcher (https://www.alignmentforum.org/posts/5Nz4PJgvLCpJd6YTA/looking-deeper-at-deconfusion) in AI Alignment. (Also PhD in the theory of distributed computing).
If you’re interested by some research ideas that you see in my posts, know that I keep private docs with the most compressed version of my deconfusion ideas in the process of getting feedback. I can give you access if you PM me!
A list of topics I’m currently doing deconfusion on:
Goal-directedness for discussing AI Risk
Myopic Decision Theories for dealing with deception (with Evan Hubinger)
Universality for many alignment ideas of Paul Christiano
Deconfusion itself to get better at it
Models of Languages Models to clarify the alignment issues surrounding them.
Thanks for the thoughtful comment!
This sounds like a potentially good analogy, but one has to be careful that it doesn’t rely on assumptions that only apply to humans, or to quite bounded agents.
The topics of persuasion (both from AIs and of AIs) is indeed an important topic in alignment. There’s a general risk that optimization is very easily spent to push for manipulation of human, whether intentionally (training an AI which actually end up wanting to do something else, and so has reason to manipulate us) or unintentionally (training an AI such that it’s incentivized to answer what we would prefer rather than the most accurate and appropriate answer).
For the persuasion of AIs by AIs, there are some initial thoughts around memetics for AIs, but they are not fully formed yet.
Don’t know much about this literature, but it makes me think of more structural takes on the alignment problem, that emphasize the importance of the structure of society funneling and pushing optimization, rather than the individual power of agents to alter it.
So, as can be seen above, none of these ideas sounds bad or impossible to make work, but judging them correctly would require far more effort put into analyzing them. Maybe you should apply for the fellowship, especially for behavioral work on which you’re more of an expert? ;)
It’s a very good question, and shamefully I don’t have any answer that’s completely satisfying. But here are the next best things, some resources that will give you a more rounded perspective of alignment:
Richard Ngo’s AGI safety from first principles, a condensed starter that presents the main line of arguments in a modern (post ML revolution) way.
Rob Miles’s YouTube channel on alignment, with great videos on many different topics.
Andrew Critch and David Krueger’s ARCHES, a survey of alignment problems and perspectives that puts more emphasis than most on structural approaches.