evhub

Karma: 1,730

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic”

Selected work:

We must be very clear: fraud in the service of effective altruism is unacceptable

evhub10 Nov 2022 23:31 UTC

709 points

85 comments3 min readEA link

evhub 13 Apr 2022 21:22 UTC
200 points
1 ∶ 0
on: Free-spending EA might be a big problem for optics and epistemics
One thing that bugged me when I first got involved with EA was the extent to which the community seemed hesitant to spend lots of money on stuff like retreats, student groups, dinners, compensation, etc. despite the cost-benefit analysis seeming to favor doing so pretty strongly. I know that, from my perspective, I felt like this was some evidence that many EAs didn’t take their stated ideals as seriously as I had hoped—e.g. that many people might just be trying to act in the way that they think an altruistic person should rather than really carefully thinking through what an altruistic person should actually do.

This is in direct contrast to the point you make that spending money like this might make people think we take our ideals less seriously—at least in my experience, had I witnessed an EA community that was more willing to spend money on projects like this, I would have been more rather than less convinced that EA was the real deal. I don’t currently have any strong beliefs about which of these reactions is more likely/concerning, but I think it’s at least worth pointing out that there is definitely an effect in the opposite direction to the one that you point out as well.
What links here?
- Benjamin_Todd's comment on Free-spending EA might be a big problem for optics and epistemics by George Rosenfeld (15 Apr 2022 10:32 UTC; 181 points)

You can talk to EA Funds before applying

evhub28 Sep 2021 20:39 UTC

104 points

7 comments1 min readEA link

RSPs are pauses done right

evhub14 Oct 2023 4:06 UTC

97 points

7 comments1 min readEA link

Introducing Alignment Stress-Testing at Anthropic

evhub12 Jan 2024 23:51 UTC

80 points

0 comments1 min readEA link

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub12 Jan 2024 19:51 UTC

65 points

0 comments1 min readEA link

(arxiv.org)

evhub 20 Dec 2022 23:10 UTC
59 points
19 ∶ 0
on: A Case for Voluntary Abortion Reduction
In my opinion, I think the best solution here is incentivizing people to voluntarily have more children—e.g. child tax credits, maternity/paternity leave, etc. If you don’t think fetuses are moral patients, then the pro-natalist, longtermist, total utilitarian view doesn’t distinguish between having an abortion and just choosing not to have a child, so I don’t really see the reason to focus on abortion specifically in that case.
What links here?
- bruce's comment on A Case for Voluntary Abortion Reduction by Ariel Simnegar (21 Dec 2022 16:32 UTC; 6 points)
- Ariel Simnegar's comment on A Case for Voluntary Abortion Reduction by Ariel Simnegar (24 Dec 2022 20:58 UTC; 1 point)

The Hubinger lectures on AGI safety: an introductory lecture series

evhub22 Jun 2023 0:59 UTC

44 points

0 comments1 min readEA link

evhub 22 Apr 2021 1:38 UTC
35 points
0 ∶ 0
in reply to: Animal Charity Evaluators’s comment on: Concerns with ACE’s Recent Behavior
Why was this response downvoted so heavily? (This is not a rhetorical question—I’m genuinely curious what the specific reasons were.)

As Jakub has mentioned above, we have reviewed the points in his comment and fully support Anima International’s wish to share their perspective in this thread. However, Anima’s description of the events above does not align with our understanding of the events that took place, primarily within points 1,5, and 6.

This is relevant, useful information.

The most time-consuming part of our commitment to Representation, Equity, and Inclusion has been responding to hostile communications in the EA community about the topic, such as this one.

Perhaps the objection is to ACE’s description of the OP as “hostile”? I certainly didn’t think the OP was hostile, so if that’s the concern, I would agree, but...

We prefer to use our time and generously donated funds towards our core programs. Therefore, we will not be engaging any further in this thread.

I think this is an extremely reasonable position, and I don’t think any person or group should be downvoted or otherwise shamed for not wanting to engage in any sort of online discussion. Online discussions are very often terrible and I think it’s a problem if we have a norm that requires people or organizations to publicly engage with any online discussion that mentions them.

evhub 12 Nov 2022 6:45 UTC
31 points
14 ∶ 0
on: IMPCO, don’t injure yourself by returning FTXFF money for services you already provided
I agree with this post from a moral perspective, though one thing it does not touch on is the legal question. My guess is that, in the same way that a court probably wouldn’t try to claw back money from a utility company/janitor/etc. that FTXFF beneficiaries are also probably safe, but IANAL so maybe somebody who knows more there could comment.

evhub 11 Nov 2022 20:40 UTC
26 points
9 ∶ 0
in reply to: David M’s comment on: We must be very clear: fraud in the service of effective altruism is unacceptable
That’s a pretty wild misreading of my post. The main thesis of the post is that we should unequivocally condemn fraud. I do not think that the reason that fraud is bad is because of PR reasons, nor do I say that in the post—if you read what I wrote about why I think it’s wrong to commit fraud at the end, what I say is that you should have a general policy against ever committing fraud, regardless of the PR consequences one way or another.

Discovering Language Model Behaviors with Model-Written Evaluations

evhub20 Dec 2022 20:09 UTC

25 points

0 comments1 min readEA link

evhub 13 Oct 2023 5:17 UTC
25 points
3 ∶ 2
on: Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope
I tend to put P(doom) around 80%, so I think I’m on the pessimistic side, and I tend to think short timelines are at least a real and serious possibility that we should be planning for. Nevertheless, I disagree with a global stop or a pause being the “only reasonable hope”—global stops and pauses seem basically unworkable to me. I’m much more excited about governmentally enforced Responsible Scaling Policies, which seem like the “better option” that you’re missing here.

evhub 1 Mar 2024 3:14 UTC
23 points
4 ∶ 1
on: Counting arguments provide no evidence for AI doom
I won’t repeat my full LessWrong comment here in detail; instead I’d just recommend heading over there and reading it and the associated comment chain. The bottom-line summary is that, in trying to cover some heavy information theory regarding how to reason about simplicity priors and counting arguments without actually engaging with the proper underlying formalism, this post commits a subtle but basic mathematical mistake that makes the whole argument fall apart.

evhub 22 Apr 2021 3:58 UTC
19 points
0 ∶ 0
in reply to: Habryka’s comment on: Concerns with ACE’s Recent Behavior

Yeah, I downvoted because it called the communication hostile without any justification for that claim. The comment it is replying to doesn’t seem at all hostile to me, and asserting it is, feels like it’s violating some pretty important norms about not escalating conflict and engaging with people charitably.

Yeah—I mostly agree with this.

I think it’s pretty important for people to make themselves available for communication.

Are you sure that they’re not available for communication? I know approximately nothing about ACE, but I’d surprised if they wouldn’t be willing to talk to you after e.g. sending them an email.

Importantly, the above also doesn’t highlight any non-public communication channels that people who are worried about the negative effects of ACE can use instead. The above is not saying “we are worried about this conversation being difficult to have in public, please reach out to us via these other channels if you think we are causing harm”. Instead it just declares a broad swath of potential communication “hostile” and doesn’t provide any path forward for concerns to be addressed. That strikes me as quite misguided given the really substantial stakes of shared reputational, financial, and talent-related resources that ACE is sharing with the rest of the EA community.

I’m a bit skeptical of this sort of “well, if they’d also said X then it would be okay” argument. I think we should generally try to be charitable in interpreting unspecified context rather than assume the worst. I also think there’s a strong tendency for goalpost-moving with this sort of objection—are you sure that, if they had said more things along those lines, you wouldn’t still have objected?

I mean, it’s fine if ACE doesn’t want to coordinate with the rest of the EA community, but I do think that currently, unless something very substantial changes, ACE and the rest of EA are drawing from shared resource pools and need to coordinate somehow if we want to avoid tragedies of the commons.

To be clear, I don’t have a problem with this post existing—I think it’s perfectly reasonable for Hypatia to present their concerns regarding ACE in a public forum so that the EA community can discuss and coordinate around what to do regarding those concerns. What I have a problem with is the notion that we should punish ACE for not responding to those accusations—I don’t think they should have an obligation to respond, and I don’t think we should assume the worst about them from their refusal to do so (nor should we always assume the best, I think the correct response is to be charitable but uncertain).

evhub 13 Oct 2023 23:00 UTC
17 points
2 ∶ 0
in reply to: Greg_Colbourn’s comment on: Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope

This is soon enough to be pushing as hard as we can for a pause right now!

I mean, yes, obviously we should be doing everything we can right now. I just think that a RSP-gated pause is the right way to do a pause. I’m not even sure what it would mean to do a pause without an RSP-like resumption condition.

Why try and take it right down to the wire with RSPs?

Because it’s more likely to succeed. RSPs provides very clear and legible risk-based criteria that are much more plausibly things that you could actually get a government to agree to.

The tradeoff for a few tens of $Bs of extra profit really doesn’t seem worth it!

This seems extremely disingenuous and bad faith. That’s obviously not the tradeoff and it confuses me why you would even claim that. Surely you know that I am not Sam Altman or Dario Amodei or whatever.

The actual tradeoff is the probability of success. If I thought e.g. just advocating for a six month pause right now was more effective at reducing existential risk, I would do it.

evhub 16 Jun 2023 18:58 UTC
17 points
5 ∶ 3
on: Conjecture: A standing offer for public debates on AI
Public debates strengthen society and public discourse. They spread truth by testing ideas and filtering out weaker arguments.

I think this is extremely not true, and am pretty disappointed with this sort of “debate me” communications policy. In my opinion, I think public debates very rarely converge towards truth. Lots of things sound good in a debate but break down under careful analysis, and the pressure of saying things that look good to a public audience creates a lot of pressure opposed to actual truth-seeking.

I understand and agree with the importance of good communications here, but imo this is really not the way. Some alternative possibilities:
- Private discussions with experts that get summarized publicly afterward.
- Adversarial collaborations with public writeups on tricky subjects.
- Public talks where people can ask questions on confusions.
- Panel discussions involving experts with different opinions.
I’m sure there’s a bunch more here; these are just some ideas off the top of my head. In general, I think there’s a lot of ways to do public communications on complex, controversial topics that don’t involve public debates, and I’d strongly encourage going in one of those alternative directions instead.

Cross-posted from LessWrong.

evhub 15 Nov 2022 21:01 UTC
17 points
5 ∶ 0
in reply to: David M’s comment on: We must be very clear: fraud in the service of effective altruism is unacceptable
The portion you quote is included at the very end as an additional point about how even if you don’t buy my primary arguments that fraud in general is bad, in this case it was empirically bad. It is not my primary reason for thinking fraud is bad here, and I think the post is quite clear about that.

FLI AI Alignment podcast: Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

evhub1 Jul 2020 20:59 UTC

13 points

2 comments1 min readEA link

(futureoflife.org)

evhub 24 Apr 2021 23:45 UTC
13 points
0 ∶ 0
in reply to: Wei Dai’s comment on: Concerns with ACE’s Recent Behavior
To be clear, I agree with a lot of the points that you’re making—the point of sketching out that model was just to show the sort of thing I’m doing; I wasn’t actually trying to argue for a specific conclusion. The actual correct strategy for figuring out the right policy here, in my opinion, is to carefully weigh all the different considerations like the ones you’re mentioning, which—at the risk of crossing object and meta levels—I suspect to be difficult to do in a low-bandwidth online setting like this.

Maybe it’ll still be helpful to just give my take using this conversation as an example. In this situation, I expect that:
- My models here are complicated enough that I don’t expect to be able to convey them here to a point where you’d understand them without a lot of effort.
- I expect I could properly convey them in a more high-bandwidth conversation (e.g. offline, not text) with you, which I’d be willing to have with you if you wanted.
- To the extent that we try to do so online, I think there are systematic biases in the format which will lead to beliefs (of at least the readers) being systematically pushed in incorrect directions—as an example, I expect arguments/positions that use simple, universalizing arguments (e.g. Bayesian reasoning says we should do this, therefore we should do it) to lose out to arguments that involve summing up a bunch of pros and cons and then concluding that the result is above or below some threshold (which in my opinion is what most actual true arguments look like).