Executive summary: In this talk (and accompanying essay), Joe Carlsmith explores the importance and challenges of safely automating AI alignment research—especially focusing on evaluation difficulties—and argues that automating empirically grounded alignment research is more tractable and could substantially enable progress on harder conceptual problems; he treats this as a key milestone for mitigating AI risk in scenarios of accelerating capability development.
Key points:
Core thesis: A pivotal question for solving alignment is whether we can safely automate alignment research, particularly to the level of top human experts—what Carlsmith calls an “alignment MVP.”
Empirical vs. conceptual research: Empirical alignment research is likely easier to evaluate and automate due to stronger feedback loops and clearer standards (e.g. “make number go up”), whereas conceptual alignment research suffers from fuzzier evaluation, closer to philosophy or futurism.
Evaluation difficulties as a central barrier: Even without adversarial (“scheming”) AI behavior, challenges like sycophancy, cluelessness, and reward hacking threaten the ability to reliably evaluate AI-generated alignment research; this evaluation problem may be fundamental rather than just practical.
Two feedback loops: Carlsmith frames progress in terms of an AI capabilities feedback loop (driving rapid progress) and an AI safety feedback loop (hopefully keeping up); automating alignment research is key to ensuring the latter doesn’t fall fatally behind.
Pathways forward: Even if automation is risky, partially automating empirical research could help us better study oversight methods, transparency, and model behavior—tools essential both for safe automation and for mitigating scheming risks.
Alternatives and pessimism: If automation fails, options include very slow human-led progress, global pauses, or even whole-brain emulation—but these are speculative and difficult, reinforcing the urgency of pursuing automated alignment research despite the challenges.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.
Executive summary: In this talk (and accompanying essay), Joe Carlsmith explores the importance and challenges of safely automating AI alignment research—especially focusing on evaluation difficulties—and argues that automating empirically grounded alignment research is more tractable and could substantially enable progress on harder conceptual problems; he treats this as a key milestone for mitigating AI risk in scenarios of accelerating capability development.
Key points:
Core thesis: A pivotal question for solving alignment is whether we can safely automate alignment research, particularly to the level of top human experts—what Carlsmith calls an “alignment MVP.”
Empirical vs. conceptual research: Empirical alignment research is likely easier to evaluate and automate due to stronger feedback loops and clearer standards (e.g. “make number go up”), whereas conceptual alignment research suffers from fuzzier evaluation, closer to philosophy or futurism.
Evaluation difficulties as a central barrier: Even without adversarial (“scheming”) AI behavior, challenges like sycophancy, cluelessness, and reward hacking threaten the ability to reliably evaluate AI-generated alignment research; this evaluation problem may be fundamental rather than just practical.
Two feedback loops: Carlsmith frames progress in terms of an AI capabilities feedback loop (driving rapid progress) and an AI safety feedback loop (hopefully keeping up); automating alignment research is key to ensuring the latter doesn’t fall fatally behind.
Pathways forward: Even if automation is risky, partially automating empirical research could help us better study oversight methods, transparency, and model behavior—tools essential both for safe automation and for mitigating scheming risks.
Alternatives and pessimism: If automation fails, options include very slow human-led progress, global pauses, or even whole-brain emulation—but these are speculative and difficult, reinforcing the urgency of pursuing automated alignment research despite the challenges.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.