Executive summary: This post summarizes 29 project proposals for the 2024 AI Safety Camp, listing the goals, desired skills, and teams for each one. The projects span a variety of alignment methods like debate, constitutional AI, and asymmetric control.
Key points:
Many projects focus on restricting uncontrollable AI through methods like operational design domains, data laundering injunctions, and congressional messaging.
Multiple projects aim to improve mechanistic interpretability of LLMs through analysis of toy models, activation engineering, and out-of-context learning.
Evaluating and steering LLMs towards alignment is another theme, with projects on reflectivity benchmarks, situational awareness datasets, tiny model evals, steering techniques, and more.
Additional areas include agent foundations like actuation spaces, optimization and agency, and detecting agents.
Miscellaneous alignment methods being explored include non-maximizing agents, debate improvements, personalized fine-tuning, self-other overlap, and asymmetric control.
Supplementary projects address policy-based model access, economic safety nets for AGI deployment, and organizing virtual AI safety unconferences.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, andcontact us if you have feedback.
Executive summary: This post summarizes 29 project proposals for the 2024 AI Safety Camp, listing the goals, desired skills, and teams for each one. The projects span a variety of alignment methods like debate, constitutional AI, and asymmetric control.
Key points:
Many projects focus on restricting uncontrollable AI through methods like operational design domains, data laundering injunctions, and congressional messaging.
Multiple projects aim to improve mechanistic interpretability of LLMs through analysis of toy models, activation engineering, and out-of-context learning.
Evaluating and steering LLMs towards alignment is another theme, with projects on reflectivity benchmarks, situational awareness datasets, tiny model evals, steering techniques, and more.
Additional areas include agent foundations like actuation spaces, optimization and agency, and detecting agents.
Miscellaneous alignment methods being explored include non-maximizing agents, debate improvements, personalized fine-tuning, self-other overlap, and asymmetric control.
Supplementary projects address policy-based model access, economic safety nets for AGI deployment, and organizing virtual AI safety unconferences.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.