Yes, it is definitely a little confusing how EA and AI safety often organize themselves via online blog posts instead of papers / books / etc like other fields! Here are two papers that seek to give a comprehensive overview of the problem:
Alternatively, this paper by Joseph Carlsmith at Open Philanthropy is a more philosophical overview that tries to lay out the big-picture argument that powerful, agentic AI is likely to be developed and that safe deployment/control would present a number of difficulties.
There are also lots of papers and reports and such about individual technical topics in the behavior existing AI systems—Research in goal misgeneralization (Shah et al., 2022); power-seeking (Turner et al., 2021); specification gaming (Krakovna et al., 2020); mechanistic interpretability (Olsson et al. (2022), Meng et al. (2022)); ML safety divided into robustness, monitoring, alignment and external safety (Hendrycks et al., 2022). But these are probably more in-the-weeds than you are looking for.
Not technically a paper (yet?), but there have been several surveys of expert machine-learning researchers on questions like “when do you think AGI will be developed?”, “how good/bad do you think this will be for humanity overall?”, etc, which you might find interesting.
Yes, it is definitely a little confusing how EA and AI safety often organize themselves via online blog posts instead of papers / books / etc like other fields! Here are two papers that seek to give a comprehensive overview of the problem:
This one, by Richard Ngo at OpenAI along with some folks from UC Berkeley and the University of Oxford, is a technical overview of why modern deep-learning techniques might lead to various alignment problems, like deceptive behavior, that could be catastrophic in very powerful systems.
Alternatively, this paper by Joseph Carlsmith at Open Philanthropy is a more philosophical overview that tries to lay out the big-picture argument that powerful, agentic AI is likely to be developed and that safe deployment/control would present a number of difficulties.
There are also lots of papers and reports and such about individual technical topics in the behavior existing AI systems—Research in goal misgeneralization (Shah et al., 2022); power-seeking (Turner et al., 2021); specification gaming (Krakovna et al., 2020); mechanistic interpretability (Olsson et al. (2022), Meng et al. (2022)); ML safety divided into robustness, monitoring, alignment and external safety (Hendrycks et al., 2022). But these are probably more in-the-weeds than you are looking for.
Not technically a paper (yet?), but there have been several surveys of expert machine-learning researchers on questions like “when do you think AGI will be developed?”, “how good/bad do you think this will be for humanity overall?”, etc, which you might find interesting.