Executive summary: Joe Carlsmith is leaving Open Philanthropy to join Anthropic to help design Claude’s model specification, arguing (with explicit uncertainty) that working inside a frontier lab may currently be their highest-impact way to improve AI safety, while stressing substantial existential risks, support for capability restraint, and a continued commitment to independent, transparent thinking; the post blends reflection with exploratory analysis.
Key points:
Open Phil recap: Carlsmith highlights the “AI soon/fast/big/bad” research program (e.g., timelines, takeoff, growth, misalignment) and its ethos of rigorous, public-facing worldview investigation that others can scrutinize.
Why Anthropic: they think helping specify Claude’s “character/constitution/spec” could materially reduce risk—both directly (avoiding “King Midas” failures even when models obey the spec) and indirectly (interacting with obedience, evaluation, and misuse)—and it uniquely fits their skills and learning goals, though they remain unsure it’s optimal.
On the “spec vs obedience” debate: while much existential risk may stem from AIs disobeying specs, they argue spec content still matters for catastrophic outcomes and for shaping broader policies (e.g., transparency) and misuse by humans.
Working at labs: they tentatively judge Anthropic net positive in expectation, but treat their personal marginal impact as the decisive factor; they favor pursuing “safety progress” alongside capability restraint and acknowledge risks of talent concentration, comms constraints, and epistemic/financial distortion—laying out safeguards and conditions for leaving.
Risk stance: they assign a nontrivial (double-digit) chance that frontier AI could destroy or disempower humanity, see no actor (Anthropic included) with an adequate plan to safely build superintelligence, and believe current benefits don’t justify the risk absent race dynamics—hence their support for strong, coordinated capability restraint.
Commitments and disagreements: they expect to be more vocal than the Anthropic median about misalignment and regulation, aim to keep writing independently (with limited sign-off for Anthropic-specific work), and pledge to update publicly and exit if their impact, freedom to speak, or risk assessment warrants it.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, andcontact us if you have feedback.
Executive summary: Joe Carlsmith is leaving Open Philanthropy to join Anthropic to help design Claude’s model specification, arguing (with explicit uncertainty) that working inside a frontier lab may currently be their highest-impact way to improve AI safety, while stressing substantial existential risks, support for capability restraint, and a continued commitment to independent, transparent thinking; the post blends reflection with exploratory analysis.
Key points:
Open Phil recap: Carlsmith highlights the “AI soon/fast/big/bad” research program (e.g., timelines, takeoff, growth, misalignment) and its ethos of rigorous, public-facing worldview investigation that others can scrutinize.
Why Anthropic: they think helping specify Claude’s “character/constitution/spec” could materially reduce risk—both directly (avoiding “King Midas” failures even when models obey the spec) and indirectly (interacting with obedience, evaluation, and misuse)—and it uniquely fits their skills and learning goals, though they remain unsure it’s optimal.
On the “spec vs obedience” debate: while much existential risk may stem from AIs disobeying specs, they argue spec content still matters for catastrophic outcomes and for shaping broader policies (e.g., transparency) and misuse by humans.
Working at labs: they tentatively judge Anthropic net positive in expectation, but treat their personal marginal impact as the decisive factor; they favor pursuing “safety progress” alongside capability restraint and acknowledge risks of talent concentration, comms constraints, and epistemic/financial distortion—laying out safeguards and conditions for leaving.
Risk stance: they assign a nontrivial (double-digit) chance that frontier AI could destroy or disempower humanity, see no actor (Anthropic included) with an adequate plan to safely build superintelligence, and believe current benefits don’t justify the risk absent race dynamics—hence their support for strong, coordinated capability restraint.
Commitments and disagreements: they expect to be more vocal than the Anthropic median about misalignment and regulation, aim to keep writing independently (with limited sign-off for Anthropic-specific work), and pledge to update publicly and exit if their impact, freedom to speak, or risk assessment warrants it.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.