I appreciate the concern that you (and clearly many other Forum users) have, and I do empathise. Still, Iâd like to present a somewhat different perspective to others here.
EA seems far too friendly toward AGI labs and feels completely uncalibrated to the actual existential risk (from an EA perspective)
I think that this implicitly assumes that there is such a things as âan EA perspectiveâ, but I donât think this is a useful abstraction. EA has many different strands, and in general seems a lot more fractured post-FTX.
e.g. You ask âWhy arenât we publicly shaming AI researchers every day?â, but if youâre an AI-sceptical EA working in GH&D that seems entirely useless to your goals! If you take âweâ to mean all EAs already convinced of AI doom then thatâs assuming the conclusion, whether there is a action-significant amount of doom is the question here.
Why are we friendly with Anthropic? Anthropic actively accelerates the frontier, currently holds the best coding model, and explicitly aims to build AGIâyet somehow, EAs rally behind them? Iâm sure almost everyone agrees that Anthropic could contribute to existential risk, so why do they get a pass? Do we think their AGI is less likely to kill everyone than that of other companies?
Anthropicâs alignment strategy, at least publicly facing, is found here.[1] I think Chris Olahâs tweets about it found here include one particularly useful chart:
The probable cruxes here are that âAnthropicâ, or various employees there, are much more optimistic about the difficulty of AI safety than you are. They also likely believe that empirical feedback from actual Frontier models is crucial to a successful science of AI Safety. I think if you hold these two beliefs, then working at Anthropic makes a lot more sense from an AI Safety perspective.
For the record, the more technical work Iâve done, and the more understanding I have about AI systems as they exist today, the more âalignment optimisticâ Iâve got, and I get increasingly skeptical of OG-MIRI-style alignment work, or AI Safety work done in the absence of actual models. We must have contact with reality to make progress,[2] and I think the AI Safety field cannot update on this point strongly enough. Beren Millidge has really influenced my thinking here, and Iâd recommend reading Alignment Needs Empirical Evidenceand other blog posts of his to get this perspective (which I suspect many people at anthropic share).
Finally, pushing the frontier of model performance isnât apriori bad, especially if you donât accept MIRI-style arguments. Like, I donât see Sonnet 3.7 as increasing the risk of extinction from AI. In fact, it seems to be both a highly capable model thatâs also very-well aligned according to Anthropicâs HHH criteria. All of my experience using Claude and engaging with the research literature about the model has pushed my distribution of AI Safety towards the âSteam Engineâ level in the chart above, instead of the P vs NP/âImpossible level.
Spending time in the EA community does not calibrate me to the urgency of AI doomerism or the necessary actions that should follow
Finally, on the ânecessary actionsâ point, even if we had a clear empirical understanding of what the current p(doom) is, there are no clear necessary actions. Thereâs still lots of arguments to be had here! See Matthew Barnett has argued in these comments that one can make utilitarian arguments for AI acceleration even in the presence of AI risk,[3] or Nora Belrose arguing that pause-style policies will likely be net-negative. You donât have to agree with either of these, but they do mean that there arenât clear ânecessary actionsâ, at least from my PoV.
Of course, if one has completely lost trust with Anthropic as an actor, then this isnât useful information to you at all. But I think thatâs conceptually a separate problem, because I think have given information to answer the questions you raise, perhaps not to your satisfaction.
I appreciate the concern that you (and clearly many other Forum users) have, and I do empathise. Still, Iâd like to present a somewhat different perspective to others here.
I think that this implicitly assumes that there is such a things as âan EA perspectiveâ, but I donât think this is a useful abstraction. EA has many different strands, and in general seems a lot more fractured post-FTX.
e.g. You ask âWhy arenât we publicly shaming AI researchers every day?â, but if youâre an AI-sceptical EA working in GH&D that seems entirely useless to your goals! If you take âweâ to mean all EAs already convinced of AI doom then thatâs assuming the conclusion, whether there is a action-significant amount of doom is the question here.
Anthropicâs alignment strategy, at least publicly facing, is found here.[1] I think Chris Olahâs tweets about it found here include one particularly useful chart:
The probable cruxes here are that âAnthropicâ, or various employees there, are much more optimistic about the difficulty of AI safety than you are. They also likely believe that empirical feedback from actual Frontier models is crucial to a successful science of AI Safety. I think if you hold these two beliefs, then working at Anthropic makes a lot more sense from an AI Safety perspective.
For the record, the more technical work Iâve done, and the more understanding I have about AI systems as they exist today, the more âalignment optimisticâ Iâve got, and I get increasingly skeptical of OG-MIRI-style alignment work, or AI Safety work done in the absence of actual models. We must have contact with reality to make progress,[2] and I think the AI Safety field cannot update on this point strongly enough. Beren Millidge has really influenced my thinking here, and Iâd recommend reading Alignment Needs Empirical Evidence and other blog posts of his to get this perspective (which I suspect many people at anthropic share).
Finally, pushing the frontier of model performance isnât apriori bad, especially if you donât accept MIRI-style arguments. Like, I donât see Sonnet 3.7 as increasing the risk of extinction from AI. In fact, it seems to be both a highly capable model thatâs also very-well aligned according to Anthropicâs HHH criteria. All of my experience using Claude and engaging with the research literature about the model has pushed my distribution of AI Safety towards the âSteam Engineâ level in the chart above, instead of the P vs NP/âImpossible level.
Finally, on the ânecessary actionsâ point, even if we had a clear empirical understanding of what the current p(doom) is, there are no clear necessary actions. Thereâs still lots of arguments to be had here! See Matthew Barnett has argued in these comments that one can make utilitarian arguments for AI acceleration even in the presence of AI risk,[3] or Nora Belrose arguing that pause-style policies will likely be net-negative. You donât have to agree with either of these, but they do mean that there arenât clear ânecessary actionsâ, at least from my PoV.
Of course, if one has completely lost trust with Anthropic as an actor, then this isnât useful information to you at all. But I think thatâs conceptually a separate problem, because I think have given information to answer the questions you raise, perhaps not to your satisfaction.
Theory will only take you so far
Though this isnât what motivates Anthropicâs thinking afaik
To the extent that word captures the classic âsingle superintelligent modelâ form of risk