Executive summary: Max Harms argues that Bentham’s Bulldog substantially underestimates AI existential risk by relying on flawed multi-stage probabilistic reasoning and overconfidence in alignment-by-default and warning-shot scenarios, while correctly recognizing that even optimistic estimates still imply an unacceptably dire situation that warrants drastic action to slow or halt progress toward superintelligence.
Key points:
Harms claims Bentham’s Bulldog commits the “multiple-stage fallacy” by decomposing doom into conditional steps whose probabilities are multiplied, masking correlated failures, alternative paths to catastrophe, and systematic under-updating.
He argues If Anyone Builds It, Everyone Dies makes an object-level claim about superintelligence being lethal if built with modern methods, not a meta-claim that readers should hold extreme confidence after one book.
Harms rejects the idea that alignment will emerge “by default” from RLHF or similar methods, arguing these techniques select for proxy behaviors, overfit training contexts, and fail to robustly encode human values.
He contends that proposed future alignment solutions double-count existing methods, underestimate interpretability limits, and assume implausibly strong human verification of AI-generated alignment schemes.
The essay argues that “warning shots” are unlikely to mobilize timely global bans and may instead accelerate state-led races toward more dangerous systems.
Harms maintains that once an ambitious superintelligence exists, it is unlikely to lack the resources, pathways, or strategies needed to disempower humanity, even without overt warfare.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, andcontact us if you have feedback.
Executive summary: Max Harms argues that Bentham’s Bulldog substantially underestimates AI existential risk by relying on flawed multi-stage probabilistic reasoning and overconfidence in alignment-by-default and warning-shot scenarios, while correctly recognizing that even optimistic estimates still imply an unacceptably dire situation that warrants drastic action to slow or halt progress toward superintelligence.
Key points:
Harms claims Bentham’s Bulldog commits the “multiple-stage fallacy” by decomposing doom into conditional steps whose probabilities are multiplied, masking correlated failures, alternative paths to catastrophe, and systematic under-updating.
He argues If Anyone Builds It, Everyone Dies makes an object-level claim about superintelligence being lethal if built with modern methods, not a meta-claim that readers should hold extreme confidence after one book.
Harms rejects the idea that alignment will emerge “by default” from RLHF or similar methods, arguing these techniques select for proxy behaviors, overfit training contexts, and fail to robustly encode human values.
He contends that proposed future alignment solutions double-count existing methods, underestimate interpretability limits, and assume implausibly strong human verification of AI-generated alignment schemes.
The essay argues that “warning shots” are unlikely to mobilize timely global bans and may instead accelerate state-led races toward more dangerous systems.
Harms maintains that once an ambitious superintelligence exists, it is unlikely to lack the resources, pathways, or strategies needed to disempower humanity, even without overt warfare.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.