Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.blogspot.com
richard_ngo
Lessons from my time in Effective Altruism
AGI safety career advice
Three intuitions about EA: responsibility, scale, self-improvement
Some thoughts on vegetarianism and veganism
Brainstorming ways to make EA safer and more inclusive
Scope-sensitive ethics: capturing the core intuition motivating utilitarianism
EDIT: I’ve now written up my own account of how we should do epistemic deference in general, which fleshes out more clearly a bunch of the intuitions I outline in this comment thread.
I think that a bunch of people are overindexing on Yudkowsky’s views; I’ve nevertheless downvoted this post because it seems like it’s making claims that are significantly too strong, based on a methodology that I strongly disendorse. I’d much prefer a version of this post which, rather than essentially saying “pay less attention to Yudkowsky”, is more nuanced about how to update based on his previous contributions; I’ve tried to do that in this comment, for example. (More generally, rather than reading this post, I recommend people read this one by Paul Christiano, which outlines specific agreements and disagreements. Note that the list of agreements there, which I expect that many other alignment researchers also buy into, serves as a significant testament to Yudkowsky’s track record.)
The part of this post which seems most wild to me is the leap from “mixed track record” to
In particular, I think, they shouldn’t defer to him more than they would defer to anyone else who seems smart and has spent a reasonable amount of time thinking about AI risk.
For any reasonable interpretation of this sentence, it’s transparently false. Yudkowsky has proven to be one of the best few thinkers in the world on a very difficult topic. Insofar as there are others who you couldn’t write a similar “mixed track record” post about, it’s almost entirely because they don’t have a track record of making any big claims, in large part because they weren’t able to generate the relevant early insights themselves. Breaking ground in novel domains is very, very different from forecasting the weather or events next year; a mixed track record is the price of entry.
Based on his track record, I would endorse people deferring more towards the general direction of Yudkowsky’s views than towards the views of almost anyone else. I also think that there’s a good case to be made that Yudkowsky tends to be overconfident, and this should be taken into account when deferring; but when it comes to making big-picture forecasts, the main value of deference is in helping us decide which ideas and arguments to take seriously, rather than the specific credences we should place on them, since the space of ideas is so large. The EA community has ended up strongly moving in Yudkowsky’s direction over the last decade, and that seems like much more compelling evidence than anything listed in this post.
Protests are by nature adversarial and high-variance actions prone to creating backlash, so I think that if you’re going to be organizing them, you need to be careful to actually convey the right message (and in particular, way more careful than you need to be in non-adversarial environments—e.g. if news media pick up on this, they’re likely going to twist your words). I don’t think this post is very careful on that axis. In particular, two things I think are important to change:
“Meta’s frontier AI models are fundamentally unsafe.”
I disagree; the current models are not dangerous on anywhere near the level that most AI safety people are concerned about. Since “current models are not dangerous yet” is one of the main objections people have to prioritizing AI safety, it seems really important to be clearer about what you mean by “safe” so that it doesn’t sound like the protest is about language models saying bad things, etc.
Suggestion: be very clear that you’re protesting the policy that Meta has of releasing model weights because of future capabilities that models could have, rather than the previous decisions they made of releasing model weights.
“Stop free-riding on the goodwill of the open-source community. Llama models are not and have never been open source, says the Open Source Initiative.”
This basically just seems like a grab-bag accusation… you’re accusing them of not being open-source enough? That’s the exact opposite of the other objections; I think it’s both quite disingenuous and also a plausible way things might backfire (e.g. if this is the one phrase that the headlines run with).
- 20 Sep 2023 18:39 UTC; 89 points) 's comment on richard_ngo’s Quick takes by (
AGI Safety Fundamentals curriculum and application
[Question] How much EA analysis of AI safety as a cause area exists?
Alignment 201 curriculum
The career and the community
[Question] What are your main reservations about identifying as an effective altruist?
Technical AGI safety research outside AI
(COI note: I work at OpenAI. These are my personal views, though.)
My quick take on the “AI pause debate”, framed in terms of two scenarios for how the AI safety community might evolve over the coming years:
AI safety becomes the single community that’s the most knowledgeable about cutting-edge ML systems. The smartest up-and-coming ML researchers find themselves constantly coming to AI safety spaces, because that’s the place to go if you want to nerd out about the models. It feels like the early days of hacker culture. There’s a constant flow of ideas and brainstorming in those spaces; the core alignment ideas are standard background knowledge for everyone there. There are hackathons where people build fun demos, and people figuring out ways of using AI to augment their research. Constant interactions with the models allows people to gain really good hands-on intuitions about how they work, which they leverage into doing great research that helps us actually understand them better. When the public ends up demanding regulation, there’s a large pool of competent people who are broadly reasonable about the risks, and can slot into the relevant institutions and make them work well.
AI safety becomes much more similar to the environmentalist movement. It has broader reach, but alienates a lot of the most competent people in the relevant fields. ML researchers who find themselves in AI safety spaces are told they’re “worse than Hitler” (which happened to a friend of mine). People get deontological about AI progress; some hesitate to pay for ChatGPT because it feels like they’re contributing to the problem (another true story); others overemphasize the risks of existing models in order to whip up popular support. People are sucked into psychological doom spirals similar to how many environmentalists think about climate change: if you’re not depressed then you obviously don’t take it seriously enough. Just like environmentalists often block some of the most valuable work on fixing climate change (e.g. nuclear energy, geoengineering, land use reform), safety advocates block some of the most valuable work on alignment (e.g. scalable oversight, interpretability, adversarial training) due to acceleration or misuse concerns. Of course, nobody will say they want to dramatically slow down alignment research, but there will be such high barriers to researchers getting and studying the relevant models that it has similar effects. The regulations that end up being implemented are messy and full of holes, because the movement is more focused on making a big statement than figuring out the details.
Obviously I’ve exaggerated and caricatured these scenarios, but I think there’s an important point here. One really good thing about the AI safety movement, until recently, is that the focus on the problem of technical alignment has nudged it away from the second scenario (although it wasn’t particularly close to the first scenario either, because the “nerding out” was typically more about decision theory or agent foundations than ML itself). That’s changed a bit lately, in part because a bunch of people seem to think that making technical progress on alignment is hopeless. I think this is just not an epistemically reasonable position to take: history is full of cases where people dramatically underestimated the growth of scientific knowledge, and its ability to solve big problems. Either way, I do think public advocacy for strong governance measures can be valuable, but I also think that “pause AI” advocacy runs the risk of pushing us towards scenario 2. Even if you think that’s a cost worth paying, I’d urge you to think about ways to get the benefits of the advocacy while reducing that cost and keeping the door open for scenario 1.
I think “it’s easy to overreact on a personal level” is an important lesson from covid, but much more important is “it’s easy to underreact on a policy level”. I.e. given the level of foresight that EAs had about covid, I think we had a disappointingly small influence on mitigating it, in part because people focused too much on making sure they didn’t get it themselves.
In this case, I’ve seen a bunch of people posting about how they’re likely to leave major cities soon, and basically zero discussion of whether there are things people can do to make nuclear war overall less likely and/or systematically help a lot of other people. I don’t think it’s bad to be trying to ensure your personal survival as a key priority, and I don’t want to discourage people from seriously analysing the risks from that perspective, but I do want to note that the overall effect is a bit odd, and may indicate some kind of community-level blind spot.
AGI safety from first principles
[Question] What are the key ongoing debates in EA?
I think I agree with the core claims Buck is making. But I found the logical structure of this post hard to follow. So here’s my attempt to re-present the core thread of the argument I think Buck is making:
In his original post, Will conditions on long futures being plausible, since these are the worlds that longtermists care about most. Let’s assume from now on that this is the case. Will claims, based on his uniform prior over hinginess, that we should require extraordinary evidence to believe in our century’s hinginess, conditional on long futures being plausible. But there are at least two reasons to think that we shouldn’t use a uniform prior. Firstly, it’s more reasonable to instead have a prior that early times in human history (such as our time) are more likely to be hingey—for example because we should expect humanity to expand over time, and also from considering technological advances.
Secondly: if we condition on long futures being plausible, then xrisk must be near-zero in almost every century (otherwise there’s almost no chance we’d survive for that long). So observing any nonnegligible amount of (preventable) xrisk in our present time becomes very strong evidence of this century being an extreme outlier in terms of xrisk, which implies that it’s also an extreme outlier in terms of hinginess. So using the uniform prior on hinginess means we have to choose between two very implausible options—either current xrisk is in fact incredibly low (far lower than seems plausible, and far lower than Will himself claims to believe it is), or else we’re in a situation that the uniform prior judges as extremely improbable and “fishy”. Instead of biting either of these bullets, it seems preferable to use a prior which isn’t so dogmatic—e.g. a prior which isn’t so surprised by early times in human history being outliers.
Toby gives an example of an alternative (and in my mind better) prior as a reply to Will’s original post.
Note that anyone who conditions on a long future being possible should afterwards doubt the evidence for current xrisk to some degree. But Will is forced to do so to an extreme extent because his uniform prior on hinginess is such a bold one—whereas people with exponentially diminishing priors on hinginess like Toby’s won’t update as much after conditioning. All of this analysis remains roughly the same if you replace Will’s uniformity-of-hinginess prior with a uniformity-of-(preventable)-xrisk prior, and Toby’s exponentially-decreasing-hinginess prior with an exponentially-decreasing-(preventable)-xrisk prior. I add “preventable” here and above because if our current xrisk isn’t preventable, then it’s still possible that we’re in a low-hinginess period.
Lastly, it seems to me that conditioning on long futures being plausible was what caused a lot of the messiness here, and so for pedagogical purposes it’d probably be better to spell out all the different permutations of options more explicitly, and be clearer about when the conditioning is happening.
[Edit: this was in response to the original version of the parent comment, not the new edited version]
Strong −1, the last line in particular seems deeply inappropriate given the live possibility that these events were caused by large-scale fraud on the part of FTX, and I’m disappointed that so many people endorsed it. (Maybe because the reasons to suspect fraud weren’t flagged in original post?) At a point where the integrity of leading figures in the movement has been called into question, it is particularly important that we hold ourselves to high standards rather than reflexively falling back on tribalist instincts.