Executive summary: Carlsmith argues that aligning advanced AI will require building systems that are capable of, and disposed toward, doing “human-like philosophy,” because safely generalizing human concepts and values to radically new situations depends on contingent, reflective practices rather than objective answers alone.
Key points:
The author defines “human-like philosophy” as the kind of reflective equilibrium humans would endorse on reflection, emphasizing that this may be contingent rather than objectively correct.
Philosophy matters for AI alignment because it underpins out-of-distribution generalization, including how concepts like honesty, harm, or manipulation extend to unfamiliar cases.
Carlsmith argues that philosophical capability in advanced AIs will likely arise by default, but that disposition—actually using human-like philosophy rather than alien alternatives—is the main challenge.
He rejects views that alignment requires solving all deep philosophical questions in advance or building “sovereign” AIs whose values must withstand unbounded optimization.
Some philosophical failures could be existential, especially around manipulation, honesty, or early locked-in policy decisions where humans cannot meaningfully intervene.
The author outlines research directions such as training on top-human philosophical examples, scalable oversight, transparency, and studying generalization behavior to better elicit human-like philosophy.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, andcontact us if you have feedback.
Executive summary: Carlsmith argues that aligning advanced AI will require building systems that are capable of, and disposed toward, doing “human-like philosophy,” because safely generalizing human concepts and values to radically new situations depends on contingent, reflective practices rather than objective answers alone.
Key points:
The author defines “human-like philosophy” as the kind of reflective equilibrium humans would endorse on reflection, emphasizing that this may be contingent rather than objectively correct.
Philosophy matters for AI alignment because it underpins out-of-distribution generalization, including how concepts like honesty, harm, or manipulation extend to unfamiliar cases.
Carlsmith argues that philosophical capability in advanced AIs will likely arise by default, but that disposition—actually using human-like philosophy rather than alien alternatives—is the main challenge.
He rejects views that alignment requires solving all deep philosophical questions in advance or building “sovereign” AIs whose values must withstand unbounded optimization.
Some philosophical failures could be existential, especially around manipulation, honesty, or early locked-in policy decisions where humans cannot meaningfully intervene.
The author outlines research directions such as training on top-human philosophical examples, scalable oversight, transparency, and studying generalization behavior to better elicit human-like philosophy.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.