Executive summary: The author announces a substantially revised version of “Intro to Brain-Like-AGI Safety,” arguing that brain-like AGI poses a distinct, unsolved technical alignment problem centered on reward function design, continual learning, and model-based reinforcement learning, and that recent AI progress does not resolve these risks.
Key points:
The series still aims to bring non-experts to the frontier of open problems in brain-like AGI safety, with a core thesis that such systems will have explicit reward functions whose design is critical for alignment.
The author argues that today’s LLMs are not AGI and that focusing on benchmarks or “book smarts” obscures large gaps in autonomous, long-horizon planning and execution.
A central neuroscience claim is that the cortex largely learns from scratch, while evolved steering mechanisms in the hypothalamus and brainstem ultimately ground all human motivations, including prosocial ones.
The update expands critiques of interpretability as a standalone solution, emphasizing scale, continual learning, and competitive pressures as unresolved obstacles.
The author maintains that instrumental convergence is not inevitable but becomes likely for sufficiently capable RL agents with consequentialist preferences, making naive debugging approaches unsafe at high capability levels.
The revised conclusion elevates “reward function design” as a priority research program for alignment, complementing efforts to reverse-engineer human social instincts.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, andcontact us if you have feedback.
Executive summary: The author announces a substantially revised version of “Intro to Brain-Like-AGI Safety,” arguing that brain-like AGI poses a distinct, unsolved technical alignment problem centered on reward function design, continual learning, and model-based reinforcement learning, and that recent AI progress does not resolve these risks.
Key points:
The series still aims to bring non-experts to the frontier of open problems in brain-like AGI safety, with a core thesis that such systems will have explicit reward functions whose design is critical for alignment.
The author argues that today’s LLMs are not AGI and that focusing on benchmarks or “book smarts” obscures large gaps in autonomous, long-horizon planning and execution.
A central neuroscience claim is that the cortex largely learns from scratch, while evolved steering mechanisms in the hypothalamus and brainstem ultimately ground all human motivations, including prosocial ones.
The update expands critiques of interpretability as a standalone solution, emphasizing scale, continual learning, and competitive pressures as unresolved obstacles.
The author maintains that instrumental convergence is not inevitable but becomes likely for sufficiently capable RL agents with consequentialist preferences, making naive debugging approaches unsafe at high capability levels.
The revised conclusion elevates “reward function design” as a priority research program for alignment, complementing efforts to reverse-engineer human social instincts.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.