jsteinhardt

Karma: 821

Building Technology to Drive AI Governance

jsteinhardt18 Feb 2026 22:35 UTC

14 points

2 comments7 min readEA link

jsteinhardt 18 Jun 2024 6:35 UTC
40 points
6 ∶ 1
in reply to: Ariel Simnegar 🔸’s comment on: [Linkpost] An update from Good Ventures
I generally agree with the spirit of empathy in this comment, but I also think you may be misinterpreting Dustin in a similar way to how others are. My understanding is that Dustin is not primarily driven by how other actors might use his funding / public comments against him. Instead, it is something like the following:
“Dustin doesn’t want to be continually funding stuff that he doesn’t endorse, because he thinks that doing things well and being responsible for the consequences of your actions is intrinsically important. He is a virtue ethicist and not a utilitarian in this regard. He feels that OP has funded things he doesn’t endorse enough times in enough areas to not want to extend blanket trust, and thus feels more responsibility than before to evaluate cases himself, to make sure that both individual grants and higher-level funding strategies are aligned with his values. He believes in doing fewer things well than more things poorly, which is why some areas are being cut.”
Obviously this could be wrong and I don’t want Dustin to feel any obligation to confirm/not confirm it. I’m writing it because I’m fairly confident that it’s at least more right than the prevailing narrative currently in the comments, and because the reasoning makes a fair amount of sense to me (and much more sense than the PR-based narrative that many are currently projecting).

jsteinhardt 2 Apr 2023 22:55 UTC
26 points
7 ∶ 0
in reply to: Neel Nanda’s comment on: Critiques of prominent AI safety labs: Redwood Research
To push back on this point, presumably even if grantmaker time is the binding resource and not money, Redwood also took up grantmaker time from OP (indeed I’d guess that OP’s grantmaker time on RR is much higher than for most other grants given the board member relationship). So I don’t think this really negates Omega’s argument—it is indeed relevant to ask how Redwood looks compared to grants that OP hasn’t made.
Personally, I am pretty glad Redwood exists and think their research so far is promising. But I am also pretty disappointed that OP hasn’t funded some academics that seem like slam dunks to me and think this reflects an anti-academia bias within OP (note they know I think this and disagree with me). Presumably this is more a discussion for the upcoming post on OP, though, and doesn’t say whether OP was overvaluing RR or undervaluing other grants (mostly the latter imo, though it seems plausible that OP should have been more critical about the marginal $1M to RR especially if overhiring was one of their issues).

jsteinhardt 2 Apr 2023 20:28 UTC
13 points
1 ∶ 0
in reply to: Neel Nanda’s comment on: Critiques of prominent AI safety labs: Redwood Research
I agree with this.

jsteinhardt 2 Apr 2023 16:34 UTC
18 points
1 ∶ 0
in reply to: Omega’s comment on: Critiques of prominent AI safety labs: Redwood Research
Thanks for this! I think we still disagree though. I’ll elaborate on my position below, but don’t feel obligated to update the post unless you want to.
* The adversarial training project had two ambitious goals, which were the unrestricted threat model and also a human-defined threat model (e.g. in contrast to synthetic L-infinity threat models that are usually considered).
* I think both of these were pretty interesting goals to aim for and at roughly the right point on the ambition-tractability scale (at least a priori). Most research projects are less ambitious and more tractable, but I think that’s mostly a mistake.
* Redwood was mostly interested in the first goal and the second was included somewhat arbitrarily iirc. I think this was a mistake and it would have been better to start with the simplest case possible to examine the unrestricted threat model. (It’s usually a mistake to try to do two ambitious things at once rather than nailing one, moreso if one of the things is not even important to you.)
* After the original NeurIPS paper Redwood moved in this direction and tried a bunch of simpler settings with unrestricted threat models. I was an advisor on this work. After several months with less progress than we wanted, we stopped pursuing this direction. It would have been better to get to a point where we could make this call sooner (after 1-2 months). Some of the slowness was indeed due to unfamiliarity with the literature, e.g. being stuck on something for a few weeks that was isomorphic to a standard gradient hacking issue. My impression (not 100% certain) is Redwood updated quite a bit in the direction of caring about related literature as a result of this, and I’d guess they’d be a lot faster doing this a second time, although still with room to improve.
Note by academic standards the project was a “success” in the sense of getting into NeurIPS, although the reviewers seemed to most like the human-defined aspect of the threat model rather than the unrestricted aspect.

jsteinhardt 1 Apr 2023 5:31 UTC
149 points
29 ∶ 1
on: Critiques of prominent AI safety labs: Redwood Research
I’ll briefly comment on a few parts of this post since my name was mentioned (lack of comment on other parts does not imply any particular position on them). Also, thanks to the authors for their time writing this (and future posts)! I think criticism is valuable, and having written criticism myself in the past, I know how time-consuming it can be.
I’m worried that your method for evaluating research output would make any ambitious research program look bad, especially early on. Specifically:
The failure of Redwood’s adversarial training project is unfortunately wholly unsurprising given almost a decade of similarly failed attempts at defenses to adversarial robustness from hundreds or even thousands of ML researchers.
I think for any ambitious research project that fails, you could tell a similarly convincing story about how it’s “obvious in hindsight” it would fail. A major point of research is to find ideas that other people don’t think will work and then show that they do work! For many of my most successful research projects, people gave me advice not to work on them because they thought it would predictably fail, and if I had failed then they could have said something similar to what you wrote above.
I think Redwood’s failures here are ones of execution and not of problem selection—I thought the problem they picked was pretty interesting but they could have much more quickly realized the particular approaches they were taking to it were unlikely to pan out. If they had done that, perhaps they would have switched to other approaches that ended up succeeding, or just pivoted to interpretability faster. In any case, I definitely wouldn’t want to discourage them or future organizations from using a similar problem selection process.
(If you asked a random ML researcher if the problem seemed feasible, they would have said no. But I wouldn’t have used that as a reason not to work on the project.)
CTO Buck Shlegeris has 3 years of software engineering experience and a limited ML research background.
My personal judgment is that Buck is a stronger researcher than most people with ML PhDs. He is weaker at empirical ML than this baseline, but very strong conceptually in ways that translate well to machine learning. I do think Buck will do best in a setting where he’s either paired with a good empirical ML researcher or gains more experience there himself (he’s already gotten a lot better in the past year). But overall I view Buck as on par with a research scientist at a top ML university.
What links here?

jsteinhardt 23 Jan 2023 5:10 UTC
23 points
6 ∶ 8
on: Doing EA Better
Thanks for this thoughtful and excellently written post. I agree with the large majority of what you had to say, especially regarding collective vs. individual epistemics (and more generally on the importance of good institutions vs. individual behavior), as well as concerns about insularity, conflicts of interest, and underrating expertise and overrating “value alignment”. I have similarly been concerned about these issues for a long time, but especially concerned over the past year.
I am personally fairly disappointed by the extent to which many commenters seem to be dismissing the claims or disagreeing with them in broad strokes, as they generally seem true and important to me. I would value the opportunity to convince anyone in a position of authority in EA that these critiques are both correct and critical to address. I don’t read this forum often (was linked to this thread by a friend), but feel free to e-mail me (jacob.steinhardt@gmail.com) if you’re in this position and want to chat.
Also, to the anonymous authors, if there is some way I can support you please feel free to reach out (also via e-mail). I promise to preserve your anonymity.

jsteinhardt 19 Dec 2022 1:58 UTC
3 points
0 ∶ 0
in reply to: Phosphorous’s comment on: Update on spending for CEA-run events
This is kind of tangential, but anyone who is FODMAP-sensitive would be unable to eat any of Soylent, Huel, or Mealsquares as far as I’m aware.

jsteinhardt 8 Oct 2022 16:06 UTC
6 points
0 ∶ 0
on: Deliberate practice for research?
Relevant blog post I wrote: https://bounded-regret.ghost.io/film-study/

jsteinhardt 20 Apr 2022 15:34 UTC
25 points
0 ∶ 0
on: Longtermist EA needs more Phase 2 work
Thanks for writing this! One thing that might help would be more examples of Phase 2 work. For instance, I think that most of my work is Phase 2 by your definition (see here for a recent round-up). But I am not entirely sure, especially given the claim that very little Phase 2 work is happening. Other stuff in the “I think this counts but not sure” category would be work done by Redwood Research, Chris Olah at Anthropic, or Rohin Shah at DeepMind (apologies to any other people who I’ve unintentionally left out).
Another advantage of examples is it could help highlight what you want to see more of.

jsteinhardt 15 Jan 2022 16:10 UTC
7 points
0 ∶ 0
on: Where is a good place to start learning about Forecasting?
I’m teaching a class on forecasting this semester! The notes will all be online: http://www.stat157.com/

jsteinhardt 31 Dec 2021 15:16 UTC
5 points
0 ∶ 0
in reply to: Davidmanheim’s comment on: Democratising Risk—or how EA deals with critics
It seems clear that none of the content in the paper comes anywhere close to your examples. These are also more like “instructions” than “arguments”, and Rubi was calling for suppressing arguments on the danger that they would be believed.

jsteinhardt 29 Dec 2021 22:27 UTC
19 points
0 ∶ 0
in reply to: Peter Slattery 🔸’s comment on: Democratising Risk—or how EA deals with critics
At the same time, what occurred mostly sounded reasonable to me, even if it was unpleasant. Strong opinions were expressed, concerns were made salient, people may have been defensive or acted with some self-interest, but no one was forced to do anything. Now the paper and your comments are out, and we can read and react to them. I have heard much worse in other academic and professional settings.
I don’t think “the work got published, so the censorship couldn’t have been that bad” really makes sense as a reaction to claims of censorship. You won’t see work that doesn’t get published, so this is basically a catch-22 (either it gets published, in which cases there isn’t censorship, or it doesn’t get published, in which case no one ever hears about it).
Also, most censorship is soft rather than hard, and comes via chilling effects.
(I’m not intending this response to make any further object-level claims about the current situation, just that the quoted argument is not a good argument.)

jsteinhardt 29 Dec 2021 22:17 UTC
10 points
0 ∶ 0
in reply to: Will Bradshaw’s comment on: Democratising Risk—or how EA deals with critics
I also agree with you. I would find it very problematic if anyone was trying to “ensure harmful and wrong ideas are not widely circulated”. Ideas should be argued against, not suppressed.

jsteinhardt 21 Dec 2021 22:54 UTC
14 points
0 ∶ 0
on: Bayesian Mindset
Re: Bayesian thinking helping one to communicate more clearly. I agree that this is a benefit, but I don’t think it’s the fastest route or the one with the highest marginal value. For instance, when you write:
A lot of expressed beliefs are “fake beliefs”: things people say to express solidarity with some group (“America is the greatest country in the world”), to emphasize some value (“We must do this fairly”), to let the listener hear what they want to hear (“Make America great again”), or simply to sound reasonable (“we will balance costs and benefits”) or wise (“I don’t see this issue as black or white”).
I’m immediately reminded of Orwell’s essay Politics and the English Language. I would generally expect people to learn more about clear, truth-seeking communication from reading Orwell (and other good books on writing) than by being Bayesian. Indeed, I find many Bayesian rationalists to be highly obscurantist in practice, perhaps moreso than the average similarly-educated person, and I feel that rationalist community norms tend to reward rather than punish this, because many people are drawn to deep but difficult-to-understand truths.
I would say that the value of the rationalist project so far has been in generating important hypotheses, rather than in clear communication around those hypotheses.

jsteinhardt 7 Apr 2021 16:43 UTC
35 points
0 ∶ 0
in reply to: Habryka [Deactivated]’s comment on: EA Debate Championship & Lecture Series
I just don’t think this is very relevant to whether outreach to debaters is good. A better metric would be to look at life outcomes of top debaters in high school. I don’t have hard statistics on this but the two very successful debaters I know personally are both now researchers at the top of their respective fields, and certainly well above average in truth-seeking.

I also think the above arguments are common tropes in the “maths vs fuzzies” culture war, and given EA’s current dispositions I suspect we’re systematically more likely to hear and be receptive to anti-debate than to pro-debate talking points. (I say this as someone who loved to hate on debate in high school, especially as it was one of the main competitors with math team for recruiting smart students. But with hindsight from seeing my classmates’ life outcomes I think most of the arguments I made were overrated.)

jsteinhardt 3 Apr 2021 1:44 UTC
1 point
0 ∶ 0
in reply to: Linch’s comment on: Please stand with the Asian diaspora
Thanks, and sorry for not responding to this earlier (was on vacation at the time). I really appreciated this and agree with willbradshaw’s comment below :).

jsteinhardt 21 Mar 2021 20:08 UTC
14 points
0 ∶ 0
in reply to: Wei Dai’s comment on: Please stand with the Asian diaspora
I think we just disagree about what a downvote means, but I’m not really that excited to argue about something that meta :).

As another data point, I appreciated Dicentra’s comment elsewhere in the thread. I haven’t decided whether I agree with it, but I thought it demonstrated empathy for all sides of a difficult issue even while disagreeing with the OP, and articulated an important perspective.

jsteinhardt 21 Mar 2021 16:04 UTC
5 points
0 ∶ 0
in reply to: Wei Dai’s comment on: Please stand with the Asian diaspora
I think your characterization of my thought process is completely false for what it’s worth. I went out of my way multiple times to say that I was not expressing disapproval of Dale’s comment.

Edit: Maybe it’s helpful for me to clarify that I think it’s both good for Dale to write his comment, and for Khorton to write hers.

jsteinhardt 21 Mar 2021 7:20 UTC
32 points
0 ∶ 0
in reply to: JKM’s comment on: Please stand with the Asian diaspora
I didn’t downvote Dale, nor do I wish to express social disapproval of his post (I worry that the length of this thread might lead Dale to feel otherwise, so I want to be explicit that I don’t feel that way).
To your question, if I were writing a post similar to Dale, what I would do differently is be more careful to make sure I was responding to the actual content of the post. The OP asked people to support Asian community members who were upset, while at least the last paragraph of Dale’s post seemed to assume that OP was arguing that we should be searching for ways to reduce violence against Asians. Whenever I engage on an emotionally charged topic I re-read the original post and my draft response to make sure that I actually understood the original post’s argument, and I think this is good practice.
Another mistake I think Dale’s post makes is assuming that whether the Atlanta attacks are racially motivated is a crux for most people’s emotional response. I think Dale’s claim may well be correct (I could see both arguments), but the larger context is a significant increase in violent incidents against Asians, at least some of which seem obviously racially motivated (the increase is also larger than other races). These have taken a constant emotional toll on Asians for a while now, and the particular Atlanta shootings are simply the first instance where it actually penetrated the broader public consciousness.
I can’t think of an easy-to-implement rule that would avoid this mistake. The best would be “try harder to think from the perspective of the listener”, but this is of course very difficult especially when there is a large gap in experience between the speaker and the listener. If I were trying super-hard I would run the post by an Asian friend to see if they felt like it engaged with the key arguments, but I think it would be unreasonable to expect, or expend, that level of effort for forum comments.
Again, I think people make communication mistakes like this all the time and do not find them particularly blameworthy and would normally not bother to comment on them. I am only pointing them out in detail because you asked me to.

jsteinhardt

Build­ing Tech­nol­ogy to Drive AI Governance

Building Technology to Drive AI Governance