Executive summary: Joe Carlsmith argues that while AI systems trained with current methods may have alien motivations, they need not be human-like to be safe if designed as corrigible instruction-followers rather than long-term consequentialist “sovereigns.” The essay critiques Yudkowsky and Soares’s claim in If Anyone Builds It, Everyone Dies that alien drives make alignment impossible, suggesting instead that safe, non-consequentialist behavior may generalize adequately from training to deployment.
Key points:
The essay distinguishes between “sovereign” AIs that optimize long-term goals and “corrigible” AIs that safely follow instructions; only the latter are realistic and desirable for alignment.
It argues that AI safety depends more on rejecting rogue actions than on having human-like long-term motivations, making deontological or virtue-like architectures acceptable.
It challenges the “alien motivations” argument in If Anyone Builds It, Everyone Dies, noting that current AIs trained on human data already show moderately reliable generalization and human-legible behavior.
The fragility-of-value concern applies mainly to long-term consequentialist AIs, not to instruction-following systems whose safety tolerates some motivational error.
The essay emphasizes “option control” (limiting AIs’ ability to act dangerously) and balancing motivational weights such that ambition is outweighed by inhibition and failure aversion.
It doubts that coherence theorems or rational-agent models require AIs to be perfectly consequentialist, seeing corrigibility as a legitimate form of non-consequentialism.
Out-of-distribution generalization from safe to dangerous contexts may be difficult but empirically tractable, comparable to image-classification robustness rather than impossibly fragile.
While alien motivations are still “extremely scary,” the essay argues that iterative testing, red-teaming, and using intermediate-capability AIs to advance alignment research could make safe progress before any superintelligence is built.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, andcontact us if you have feedback.
Executive summary: Joe Carlsmith argues that while AI systems trained with current methods may have alien motivations, they need not be human-like to be safe if designed as corrigible instruction-followers rather than long-term consequentialist “sovereigns.” The essay critiques Yudkowsky and Soares’s claim in If Anyone Builds It, Everyone Dies that alien drives make alignment impossible, suggesting instead that safe, non-consequentialist behavior may generalize adequately from training to deployment.
Key points:
The essay distinguishes between “sovereign” AIs that optimize long-term goals and “corrigible” AIs that safely follow instructions; only the latter are realistic and desirable for alignment.
It argues that AI safety depends more on rejecting rogue actions than on having human-like long-term motivations, making deontological or virtue-like architectures acceptable.
It challenges the “alien motivations” argument in If Anyone Builds It, Everyone Dies, noting that current AIs trained on human data already show moderately reliable generalization and human-legible behavior.
The fragility-of-value concern applies mainly to long-term consequentialist AIs, not to instruction-following systems whose safety tolerates some motivational error.
The essay emphasizes “option control” (limiting AIs’ ability to act dangerously) and balancing motivational weights such that ambition is outweighed by inhibition and failure aversion.
It doubts that coherence theorems or rational-agent models require AIs to be perfectly consequentialist, seeing corrigibility as a legitimate form of non-consequentialism.
Out-of-distribution generalization from safe to dangerous contexts may be difficult but empirically tractable, comparable to image-classification robustness rather than impossibly fragile.
While alien motivations are still “extremely scary,” the essay argues that iterative testing, red-teaming, and using intermediate-capability AIs to advance alignment research could make safe progress before any superintelligence is built.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.