evhub

Karma: 1,775

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic”

Selected work:

evhub 23 May 2025 1:17 UTC
18 points
1 ∶ 2
in reply to: MikhailSamin’s comment on: Samin’s Shortform
This is false. Our ASL-4 thresholds are clearly specified in the current RSP—see “CBRN-4” and “AI R&D-4″. We evaluated Claude Opus 4 for both of these thresholds prior to release and found that the model was not ASL-4. All of these evaluations are detailed in the Claude 4 system card.
What links here?
- Mo Putera's comment on Anthropic is Quietly Backpedalling on its Safety Commitments by garrison (LessWrong; 23 May 2025 8:43 UTC; 8 points)

evhub 19 Apr 2025 5:44 UTC
32 points
13 ∶ 0
in reply to: Jeroen Willems🔸’s comment on: Jeroen_W’s Shortform
The situation doesn’t seem very similar to Anthropic. Regardless of whether you think Anthropic is good or bad (I think Anthropic is very good, but I work at Anthropic, so take that as you will), Anthropic was founded with the explicitly altruistic intention of making AI go well. Mechanize, by contrast, seems to mostly not be making any claims about altruistic motivations at all.

evhub 1 Mar 2024 3:14 UTC
27 points
5 ∶ 1
on: Counting arguments provide no evidence for AI doom
I won’t repeat my full LessWrong comment here in detail; instead I’d just recommend heading over there and reading it and the associated comment chain. The bottom-line summary is that, in trying to cover some heavy information theory regarding how to reason about simplicity priors and counting arguments without actually engaging with the proper underlying formalism, this post commits a subtle but basic mathematical mistake that makes the whole argument fall apart.

Introducing Alignment Stress-Testing at Anthropic

evhub12 Jan 2024 23:51 UTC

80 points

0 comments2 min readEA link

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub12 Jan 2024 19:51 UTC

65 points

0 comments3 min readEA link

(arxiv.org)

evhub 23 Nov 2023 1:33 UTC
2 points
0 ∶ 0
in reply to: David Krueger’s comment on: RSPs are pauses done right
I think this is a very good point, and it definitely gives me some pause—and probably my original statement there was too strong. Certainly I agree that you need to do evaluations using the best possible scaffolding that you have, but overall my sense is that this problem is not that bad. Some reasons to think that:
- At least currently, scaffolding-related performance improvements don’t seem to generally be that large (e.g. chain-of-thought is just not that helpful on most tasks), especially relative to the gains from scaling.
- You can evaluate pretty directly for the sorts of capabilities that would help make scaffolding way better, like the model being able to correct its own errors, so you don’t have to just evaluate the whole system + scaffolding end-to-end.
- This is mostly just a problem for large-scale model deployments. If you instead keep your largest model mostly in-house for alignment research, or only give it to a small number of external partners whose scaffolding you can directly evaluate, it makes this problem way less bad.
That last point is probably the most important here, since it demonstrates that you easily can (and should) absorb this sort of concern into an RSP. For example, you could set a capabilities threshold for models’ ability to do self-correction, and once your models pass that threshold you restrict deployment except in contexts where you can directly evaluate the relevant scaffolding that will be used in advance.

evhub 27 Oct 2023 23:36 UTC
9 points
8 ∶ 11
Error
The value NIL is not of type SIMPLE-STRING when binding #:USER-ID162

evhub 26 Oct 2023 1:03 UTC
9 points
3 ∶ 0
on: Responsible Scaling Policies Are Risk Management Done Wrong
Cross-posted with LessWrong.

I found this post very frustrating, because it’s almost all dedicated to whether current RSPs are sufficient or not (I agree that they are insufficient), but that’s not my crux and I don’t think it’s anyone else’s crux either. And for what I think is probably the actual crux here, you only have one small throwaway paragraph:

Which brings us to the question: “what’s the effect of RSPs on policy and would it be good if governments implemented those”. My answer to that is: An extremely ambitious version yes; the misleading version, no. No, mostly because of the short time we have before we see heightened levels of risks, which gives us very little time to update regulations, which is a core assumption on which RSPs are relying without providing evidence of being realistic.

As I’ve talked about now extensively, I think enacting RSPs in policy now makes it easier not harder to get even better future regulations enacted. It seems that your main reason for disagreement is that you believe in extremely short timelines / fast takeoff, such that we will never get future opportunities to revise AI regulation. That seems pretty unlikely to me: my expectation especially is that as AI continues to heat up in terms of its economic impact, new policy windows will keep arising in rapid succession, and that we will see many of them before the end of days.
What links here?
- evhub's comment on Responsible Scaling Policies Are Risk Management Done Wrong by simeon_c (LessWrong; 26 Oct 2023 1:02 UTC; 32 points)

evhub 14 Oct 2023 17:41 UTC
1 point
0 ∶ 1
in reply to: Greg_Colbourn ⏸️ ’s comment on: Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope

The RSP angle is part of the corporate “big AI” “business as usual” agenda. To those of us playing the outside game it seems very close to safetywashing.

I’ve written up more about why I think this is not true here.

evhub 14 Oct 2023 4:09 UTC
7 points
0 ∶ 0
Error
The value NIL is not of type SIMPLE-STRING when binding #:USER-ID162

RSPs are pauses done right

evhub14 Oct 2023 4:06 UTC

93 points

6 comments7 min readEA link

evhub 13 Oct 2023 23:37 UTC
4 points
1 ∶ 1
in reply to: Greg_Colbourn ⏸️ ’s comment on: Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope

Have the resumption condition be a global consensus on an x-safety solution or a global democratic mandate for restarting (and remember there are more components of x-safety than just alignment—also misuse and multi-agent coordination).

This seems basically unachievable and even if it was achievable it doesn’t even seem like the right thing to do—I don’t actually trust the global median voter to judge whether additional scaling is safe or not. I’d much rather have rigorous technical standards than nebulous democratic standards.

I think it’s pushing it a bit at this stage to say that they, as companies, are primarily concerned with reducing x-risk.

That’s why we should be pushing them to have good RSPs! I just think you should be pushing on the RSP angle rather than the pause angle.
What links here?
- Greg_Colbourn ⏸️ 's comment on Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope by Greg_Colbourn ⏸️ (14 Oct 2023 9:40 UTC; 2 points)

evhub 13 Oct 2023 23:00 UTC
17 points
2 ∶ 0
in reply to: Greg_Colbourn ⏸️ ’s comment on: Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope

This is soon enough to be pushing as hard as we can for a pause right now!

I mean, yes, obviously we should be doing everything we can right now. I just think that a RSP-gated pause is the right way to do a pause. I’m not even sure what it would mean to do a pause without an RSP-like resumption condition.

Why try and take it right down to the wire with RSPs?

Because it’s more likely to succeed. RSPs provides very clear and legible risk-based criteria that are much more plausibly things that you could actually get a government to agree to.

The tradeoff for a few tens of $Bs of extra profit really doesn’t seem worth it!

This seems extremely disingenuous and bad faith. That’s obviously not the tradeoff and it confuses me why you would even claim that. Surely you know that I am not Sam Altman or Dario Amodei or whatever.

The actual tradeoff is the probability of success. If I thought e.g. just advocating for a six month pause right now was more effective at reducing existential risk, I would do it.

evhub 13 Oct 2023 22:04 UTC
2 points
0 ∶ 0
Error
The value NIL is not of type SIMPLE-STRING when binding #:USER-ID162

evhub 13 Oct 2023 22:01 UTC
7 points
1 ∶ 1
in reply to: Greg_Colbourn ⏸️ ’s comment on: Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope
Perhaps the crux is related to how dangerous you think current models are? I’m quite confident that we have at least a couple additional orders of magnitude of scaling before the world ends, so I’m not too worried about stopping training of current models, or even next-generation models. But I do start to get worried with next-next-generation models.

So, in my view, the key is to make sure that we have a well-enforced Responsible Scaling Policy (RSP) regime that is capable of preventing scaling unless hard safety metrics are met (I favor understanding-based evals for this) before the next two scaling generations. That means we need to get good RSPs into law with solid enforcement behind them and—at least in very short timeline worlds—that needs to happen in the next few years. By far the best way to make that happen, in my opinion, is to pressure labs to put out good RSPs now that governments can build on.

evhub 13 Oct 2023 19:59 UTC
5 points
0 ∶ 0
Error
The value NIL is not of type SIMPLE-STRING when binding #:USER-ID162

evhub 13 Oct 2023 18:15 UTC
10 points
3 ∶ 0
in reply to: Greg_Colbourn ⏸️ ’s comment on: Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope
I guess I’m not really sure what your objection is to Responsible Scaling Policies? I see that there’s a bunch of links, but I don’t really see a consistent position being staked out by the various sources you’ve linked to. Do you want to describe what your objection is?

I guess the closest there is “the danger is already apparent enough” which, while true, doesn’t really seem like an objection. I agree that the danger is apparent, but I don’t think that advocating for a pause is a very good way to address that danger.

evhub 13 Oct 2023 5:17 UTC
25 points
3 ∶ 2
on: Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope
I tend to put P(doom) around 80%, so I think I’m on the pessimistic side, and I tend to think short timelines are at least a real and serious possibility that we should be planning for. Nevertheless, I disagree with a global stop or a pause being the “only reasonable hope”—global stops and pauses seem basically unworkable to me. I’m much more excited about governmentally enforced Responsible Scaling Policies, which seem like the “better option” that you’re missing here.

evhub 6 Oct 2023 5:24 UTC
8 points
1 ∶ 1
in reply to: Nora Belrose’s comment on: AI Pause Will Likely Backfire

In any case, I don’t see any reason to think the neural net prior is malign, or particularly biased toward deceptive, misaligned generalization. If anything the simplicity prior seems like good news for alignment.

I definitely disagree with this—especially the last sentence; essentially all of my hope for neural net inductive biases comes from them not being like an actual simplicity prior. The primary literature I’d reference here would be “How likely is deceptive alignment?” for the practical question regarding concrete neural net inductive biases and “The Solomonoff Prior is Malign” for the purely theoretical question concerning the actual simplicity prior.

The Hubinger lectures on AGI safety: an introductory lecture series

evhub22 Jun 2023 0:59 UTC

44 points

0 comments1 min readEA link

(www.youtube.com)

evhub

In­tro­duc­ing Align­ment Stress-Test­ing at Anthropic

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

Error

Error

RSPs are pauses done right

Error

Error

The Hub­inger lec­tures on AGI safety: an in­tro­duc­tory lec­ture series

Introducing Alignment Stress-Testing at Anthropic

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

The Hubinger lectures on AGI safety: an introductory lecture series