Do you have any canonical referece for AI aligment research? I have read Eliezer Yudkowsky FAQ and I have been surprised of how little technical details are commented. His arguments are very much “we are building alien squids and they will eat us all”. But they are not squids, and we have not trained them to prey on mammals, but to navigate across symbols. The IAs we are training are not as alien as giant a squid, but far more: they are not even trained for self-preservation.
MR suggests that there is not peer reviewed literature on AI risk:
But perhaps I can read something comprehensive (a pdf, if possible), and not depend on navigating posts, FAQs and similar stuff. Currently my understanding of AI risk is based in technical knowledge about Reinforcement Learning for games and multiagent systems. I have no knowledge nor intuition on other kind of systems, and I want to engage with the “state of the art” (in compact format) before I make a post focused on the AI alignement side.
Yes, it is definitely a little confusing how EA and AI safety often organize themselves via online blog posts instead of papers / books / etc like other fields! Here are two papers that seek to give a comprehensive overview of the problem:
Alternatively, this paper by Joseph Carlsmith at Open Philanthropy is a more philosophical overview that tries to lay out the big-picture argument that powerful, agentic AI is likely to be developed and that safe deployment/control would present a number of difficulties.
There are also lots of papers and reports and such about individual technical topics in the behavior existing AI systems—Research in goal misgeneralization (Shah et al., 2022); power-seeking (Turner et al., 2021); specification gaming (Krakovna et al., 2020); mechanistic interpretability (Olsson et al. (2022), Meng et al. (2022)); ML safety divided into robustness, monitoring, alignment and external safety (Hendrycks et al., 2022). But these are probably more in-the-weeds than you are looking for.
Not technically a paper (yet?), but there have been several surveys of expert machine-learning researchers on questions like “when do you think AGI will be developed?”, “how good/bad do you think this will be for humanity overall?”, etc, which you might find interesting.
Dear Mr. Wagner,
Do you have any canonical referece for AI aligment research? I have read Eliezer Yudkowsky FAQ and I have been surprised of how little technical details are commented. His arguments are very much “we are building alien squids and they will eat us all”. But they are not squids, and we have not trained them to prey on mammals, but to navigate across symbols. The IAs we are training are not as alien as giant a squid, but far more: they are not even trained for self-preservation.
MR suggests that there is not peer reviewed literature on AI risk:
https://marginalrevolution.com/marginalrevolution/2023/04/from-the-comments-on-ai-safety.html
“The only peer-reviewed paper making the case for AI risk that I know of is: https://onlinelibrary.wiley.com/doi/10.1002/aaai.12064. Though note that my paper (the second you linked) is currently under review at a top ML conference.”
But perhaps I can read something comprehensive (a pdf, if possible), and not depend on navigating posts, FAQs and similar stuff. Currently my understanding of AI risk is based in technical knowledge about Reinforcement Learning for games and multiagent systems. I have no knowledge nor intuition on other kind of systems, and I want to engage with the “state of the art” (in compact format) before I make a post focused on the AI alignement side.
Yes, it is definitely a little confusing how EA and AI safety often organize themselves via online blog posts instead of papers / books / etc like other fields! Here are two papers that seek to give a comprehensive overview of the problem:
This one, by Richard Ngo at OpenAI along with some folks from UC Berkeley and the University of Oxford, is a technical overview of why modern deep-learning techniques might lead to various alignment problems, like deceptive behavior, that could be catastrophic in very powerful systems.
Alternatively, this paper by Joseph Carlsmith at Open Philanthropy is a more philosophical overview that tries to lay out the big-picture argument that powerful, agentic AI is likely to be developed and that safe deployment/control would present a number of difficulties.
There are also lots of papers and reports and such about individual technical topics in the behavior existing AI systems—Research in goal misgeneralization (Shah et al., 2022); power-seeking (Turner et al., 2021); specification gaming (Krakovna et al., 2020); mechanistic interpretability (Olsson et al. (2022), Meng et al. (2022)); ML safety divided into robustness, monitoring, alignment and external safety (Hendrycks et al., 2022). But these are probably more in-the-weeds than you are looking for.
Not technically a paper (yet?), but there have been several surveys of expert machine-learning researchers on questions like “when do you think AGI will be developed?”, “how good/bad do you think this will be for humanity overall?”, etc, which you might find interesting.