One implication I strongly disagree with is that people should be getting jobs in AI labs. I don’t see you connecting that to actual safety impact, and I sincerely doubt working as a researcher gives you any influence on safety at this point (if it ever did). There is a definite cost to working at a lab, which is capture and NDA-walling. Already so many EAs work at Anthropic that it is shielded from scrutiny within EA, and the attachment to “our player” Anthropic has made it hard for many EAs to do the obvious thing by supporting PauseAI. Put simply: I see no meaningful path to impact on safety working as an AI lab researcher, and I see serious risks to individual and community effectiveness and mission focus.
I see no meaningful path to impact on safety working as an AI lab researcher
This is a very strong statement. I’m not following technical alignment research that closely, but my general sense is that exciting work is being done. I just wrote this comment advertising a line of research which strikes me as particularly promising.
I noticed the other day that the people who are particularly grim about AI alignment also don’t seem to be engaging much with contemporary technical alignment research. That missing intersection seems suspicious. I’m interested in any counterexamples that come to mind.
My subjective sense is there’s a good chance we lose because all the necessary insights to build aligned AI were lying around, they just didn’t get sufficiently developed or implemented. This seems especially true for techniques like gradient routing which would need to be baked in to a big, expensive training run.
(I’m also interested in arguments for why unlearning won’t work. I’ve thought about this a fair amount, and it seems to me that sufficiently good unlearning kind of just oneshots AI safety, as elaborated in the comment I linked.)
My subjective sense is there’s a good chance we lose because all the necessary insights to build aligned AI were lying around, they just didn’t get sufficiently developed or implemented.
For both theoretical and empirical reasons, I would assign a probably as low as 5% to there being alignment insights just laying around that could protect us at the superintelligence capabilities level and don’t require us to slow or stop development to implement in time.
I don’t see a lot of technical safety people engaging in advocacy, either? It’s not like they tried advocacy first and then decided on technical safety. Maybe you should question their epistemology.
I don’t see a lot of technical safety people engaging in advocacy, either? It’s not like they tried advocacy first and then decided on technical safety. Maybe you should question their epistemology.
My impression is that so far most of the impactful “public advocacy” work has been done by “technical safety” people. Some notable examples include Yoshua Bengio, Dan Hendryks, Ian Hogarth, and Geoffrey Hinton.
I agree with you that people seem to somewhat overrate getting jobs in AI companies.
However, I do think there’s good work to do inside AI companies. Currently, a lot of the quality-adjusted safety research happens inside AI companies. And see here for my rough argument that it’s valuable to have safety-minded people inside AI companies at the point where they develop catastrophically dangerous AI.
What you write there makes sense but it’s not free to have people in those positions, as I said. I did a lot of thinking about this when I was working on wild animal welfare. It seems superficially like you could get the right kind of WAW-sympathetic person into agencies like FWS and the EPA and they would be there to, say, nudge the agency in a way no one else cared about to help animals when the time came. I did some interviews and looked into some historical cases and I concluded this is not a good idea.
The risk of being captured by the values and motivations of the org where they spend most of their daily lives before they have the chance to provide that marginal difference is high. Then that person is lost the Safety cause or converted into further problem. I predict that you’ll get one successful Safety sleeper agent in, generously, 10 researchers who go to work at a lab. In that case your strategy is just feeding the labs talent and poisoning the ability of their circles to oppose them.
Even if it’s harmless, planting an ideological sleeper agent in firms is generally not the best counterfactual use of the person because their influence in a large org is low. Even relatively high-ranking people frequently have almost no discretion about what happens in the end. AI labs probably have more flexibility than US agencies, but I doubt the principle is that different.
Therefore I think trying to influence the values and safety of labs by working there is a bad idea that would not be pulled off.
My sense is that of the many EAs who have taken EtG jobs quite a few have remained fairly value-aligned? I don’t have any data on this and am just going on vibes, but I would guess significantly more than 10%. Which is some reason to think the same would be the case for AI companies. Though plausibly the finance company’s values are only orthogonal to EA, while the AI company’s values (or at least plans) might be more directly opposed.
In that case your strategy is just feeding the labs talent and poisoning the ability of their circles to oppose them.
It seems like your model only has such influence going one way. The lab worker will influence their friends, but not the other way around. I think two-way influence is a more accurate model.
Another option is to ask your friends to monitor you so you don’t get ideologically captured, and hold an intervention if it seems appropriate.
I think you, and this community, have no idea how difficult it is to resist value/mission drift in these situations. This is not a friend:friend exchange. It’s a small community of nonprofits and individuals:the most valuable companies in the world. They aren’t just gonna pick up the values of a few researchers by osmosis.
From your other comment it seems like you have already been affected by the lab’s influence via the technical research community. The emphasis on technical solutions only benefits them, and it just so happens that to work on the big models you have to work with them. This is not an open exchange where they have been just as influenced by us. Sam and Dario sure want you and the US government to think they are the right safety approach, though.
“The emphasis on technical solutions only benefits them”
This is blatantly question-begging, right? In that it is only true if looking for technical solutions doesn’t lead to safe models, which is one of the main points in dispute between you versus people with a higher opinion of the work inside on safety strategy. Of course, it is true that if you don’t have your own opinion already, you shouldn’t trust people who work at leading labs (or want to) on the question of whether technical safety work will help, for the reasons you give. But “people have an incentive to say X” isn’t actually evidence that X is false, it’s just evidence you shouldn’t trust them. If all people outside labs thought technical safety work was useless that would be one thing. But I don’t think that is actually true, it seems people with relevant expertise are divided even outside the labs. Now of course, there are subtler ways in which even people outside the labs might be incentivized to play down the risks. (Though they might also have other reasons to play them up.) But even that won’t get you to “therefore technical safety is definitely useless”; it’s all meta, not object-level.
There’s also a subtler point that even if “do technical safety work on the inside” is unlikely to work, it might still be the better strategy if confrontational lobbying from the outside is unlikely to work too (something that I think is more true now Trump is in power, although Musk is a bit of a wildcard in that respect.)
I didn’t mean “there is no benefit to technical safety work”; I meant more like “there is only benefit to labs to emphasizing technical safety work to the exclusion of other things”, as in it benefits them and doesn’t cost them to do this.
What you seem to be hinting at, essentially espionage, may honestly be the best reason to work in a lab. But of course those people need to be willing to break NDAs and there are better ways to get that info than getting a technical safety job.
(Edited to add context for bringing up “espionage” and implications elaborated.)
Already so many EAs work at Anthropic that it is shielded from scrutiny within EA
What makes you think this? Zach’s post is a clear counterexample here (though comments are friendlier to Anthropic) and I’ve heard of criticism of the RSPs (though I’m not watching closely).
Maybe you think there should be much more criticism?
There should be protests against them (PauseAI US will be protesting them in SF 2⁄28) and we should all consider them evil for building superintelligence when it is not safe! Dario is now openly calling for recursive self-improvement. They are the villains—this is not hard. The fact that you would think Zach’s post with “maybe” in the title is scrutiny is evidence of the problem.
Honestly this writeup did update me somewhat in favor of at least a few competent safety-conscious people working at major labs, if only so the safety movement has some access to what’s going on inside the labs if/when secrecy grows. The marginal extra researcher going to Anthropic, though? Probably not.
Connect the rest of the dots for me—how does that researcher’s access become community knowledge? How does the community do anything productive with this knowledge? How do you think people working at the labs detracts from other strategies?
The primary benefit I’m imagining is a single well-placed whistleblower positioned to publicly sound the alarm on a particularly obvious and immediate threat, perhaps related to CBRN capabilities. A better answer requires a longer post, which is in the works but may take a while.
I agree having access to what the labs are doing and having the ability to blow the whistle would be super useful. I’ve just recently updated hugely in the direction of respecting the risk of value drift of having people embedded in the labs. We’re imagining cheaply having access to the labs, but the labs and their values will also have access back to us and our mission-alignment through these people.
I think EA should be setting an example to a more confused public of how dangerous this technology is, and being mixed up in making it makes that very difficult.
It seems like having genuinely safety-minded people within orgs is invaluable. Do you think that having them refuse to join is going to meaningfully slow things down?
It just takes one brave or terrified person in the know to say “these guys are internally deploying WHAT? I’ve got to stop this!”
I worry very much that we won’t have one such person in the know in OpenAI. I’m very glad we have them in Anthropic.
Having said that, I agree that Anthropic should not be shielded from criticism.
Your assumption that influence flows one way in organizations seems based on fear not psychology. If someone believes AGI is a real risk, they should be motivated enough to resist some pressure from superiors who merely argue that they’re doing good stuff.
If you won’t actively resist changing your beliefs once you join a culture with importantly different beliefs, then don’t join an org.
While Anthropic’s plan is a terrible one, so is PauseAI’s. We have no good plans. And we must’nt fight amongst ourselves.
The post is an excellent summary of the current situation, yet ignoring the elephant in the room: we need to urge for the Pause! The incoming summit, in France, is the ideal place to do so. That’s why I’m calling on you, the AI-worried reader, to join our Paris demonstration on the 10th: https://lu.ma/vo3354ab Each additional participant would be a tremendous asset, as we are a handful of people. Assemble!
Feels like Anthropic has been putting out a lot of good papers recently that help build the case for various AI threats. Given this, “no meaningful path to impact” seem a bit strong.
I agree with you but for reasons that are more basic and more heretical than you’re going for. In general I’m critical that EA seems to have a prior that being in a relevant place at a relevant time is doing God’s work. It’s a little defensible from the viewpoint of early career path navigation, but now we’re talking about 3 year timelines and still saying things like “so I guess you should ‘work on’ AI safety”. I don’t really grasp why this argument is unfolding such that you have the burden of proof.
The real plan on a three year timeline is to hike the Patagonia or something. But that conclusion is too radical so we try to commit to the outcome space being selecting a job like we always do. If you’re early career you should probably assume AGI/SI won’t happen, to maximize utility.
People defending work at AI + 3 year timeline should probably be talking about how easy it is to get to a staff+ engineer position from start date.
One implication I strongly disagree with is that people should be getting jobs in AI labs. I don’t see you connecting that to actual safety impact, and I sincerely doubt working as a researcher gives you any influence on safety at this point (if it ever did).
Wouldn’t it allow access to the most advanced models which are not yet publicly available? I think these are the models that would pose the most risk?
Working at a frontier lab would also give the opportunity to reach people far less concerned about safety, and maybe their minds could be changed?
If the supposed justification for taking these jobs is so that they can be close to what’s going on, and then they never tell and (I predict) get no influence on what the company does, how could this possibly be the right altruistic move?
Great piece— great prompt to rethink things and good digests of implications.
If you agree that mass movement building is a priority, check out PauseAI-US.org (I am executive director), or donate here: https://www.zeffy.com/donation-form/donate-to-help-pause-ai
One implication I strongly disagree with is that people should be getting jobs in AI labs. I don’t see you connecting that to actual safety impact, and I sincerely doubt working as a researcher gives you any influence on safety at this point (if it ever did). There is a definite cost to working at a lab, which is capture and NDA-walling. Already so many EAs work at Anthropic that it is shielded from scrutiny within EA, and the attachment to “our player” Anthropic has made it hard for many EAs to do the obvious thing by supporting PauseAI. Put simply: I see no meaningful path to impact on safety working as an AI lab researcher, and I see serious risks to individual and community effectiveness and mission focus.
This is a very strong statement. I’m not following technical alignment research that closely, but my general sense is that exciting work is being done. I just wrote this comment advertising a line of research which strikes me as particularly promising.
I noticed the other day that the people who are particularly grim about AI alignment also don’t seem to be engaging much with contemporary technical alignment research. That missing intersection seems suspicious. I’m interested in any counterexamples that come to mind.
My subjective sense is there’s a good chance we lose because all the necessary insights to build aligned AI were lying around, they just didn’t get sufficiently developed or implemented. This seems especially true for techniques like gradient routing which would need to be baked in to a big, expensive training run.
(I’m also interested in arguments for why unlearning won’t work. I’ve thought about this a fair amount, and it seems to me that sufficiently good unlearning kind of just oneshots AI safety, as elaborated in the comment I linked.)
Here’s our crux:
For both theoretical and empirical reasons, I would assign a probably as low as 5% to there being alignment insights just laying around that could protect us at the superintelligence capabilities level and don’t require us to slow or stop development to implement in time.
I don’t see a lot of technical safety people engaging in advocacy, either? It’s not like they tried advocacy first and then decided on technical safety. Maybe you should question their epistemology.
My impression is that so far most of the impactful “public advocacy” work has been done by “technical safety” people. Some notable examples include Yoshua Bengio, Dan Hendryks, Ian Hogarth, and Geoffrey Hinton.
Yeah good point. I thought Ebenezer was referring to more run-of-the-mill community members.
I agree with you that people seem to somewhat overrate getting jobs in AI companies.
However, I do think there’s good work to do inside AI companies. Currently, a lot of the quality-adjusted safety research happens inside AI companies. And see here for my rough argument that it’s valuable to have safety-minded people inside AI companies at the point where they develop catastrophically dangerous AI.
What you write there makes sense but it’s not free to have people in those positions, as I said. I did a lot of thinking about this when I was working on wild animal welfare. It seems superficially like you could get the right kind of WAW-sympathetic person into agencies like FWS and the EPA and they would be there to, say, nudge the agency in a way no one else cared about to help animals when the time came. I did some interviews and looked into some historical cases and I concluded this is not a good idea.
The risk of being captured by the values and motivations of the org where they spend most of their daily lives before they have the chance to provide that marginal difference is high. Then that person is lost the Safety cause or converted into further problem. I predict that you’ll get one successful Safety sleeper agent in, generously, 10 researchers who go to work at a lab. In that case your strategy is just feeding the labs talent and poisoning the ability of their circles to oppose them.
Even if it’s harmless, planting an ideological sleeper agent in firms is generally not the best counterfactual use of the person because their influence in a large org is low. Even relatively high-ranking people frequently have almost no discretion about what happens in the end. AI labs probably have more flexibility than US agencies, but I doubt the principle is that different.
Therefore I think trying to influence the values and safety of labs by working there is a bad idea that would not be pulled off.
My sense is that of the many EAs who have taken EtG jobs quite a few have remained fairly value-aligned? I don’t have any data on this and am just going on vibes, but I would guess significantly more than 10%. Which is some reason to think the same would be the case for AI companies. Though plausibly the finance company’s values are only orthogonal to EA, while the AI company’s values (or at least plans) might be more directly opposed.
It seems like your model only has such influence going one way. The lab worker will influence their friends, but not the other way around. I think two-way influence is a more accurate model.
Another option is to ask your friends to monitor you so you don’t get ideologically captured, and hold an intervention if it seems appropriate.
I think you, and this community, have no idea how difficult it is to resist value/mission drift in these situations. This is not a friend:friend exchange. It’s a small community of nonprofits and individuals:the most valuable companies in the world. They aren’t just gonna pick up the values of a few researchers by osmosis.
From your other comment it seems like you have already been affected by the lab’s influence via the technical research community. The emphasis on technical solutions only benefits them, and it just so happens that to work on the big models you have to work with them. This is not an open exchange where they have been just as influenced by us. Sam and Dario sure want you and the US government to think they are the right safety approach, though.
“The emphasis on technical solutions only benefits them”
This is blatantly question-begging, right? In that it is only true if looking for technical solutions doesn’t lead to safe models, which is one of the main points in dispute between you versus people with a higher opinion of the work inside on safety strategy. Of course, it is true that if you don’t have your own opinion already, you shouldn’t trust people who work at leading labs (or want to) on the question of whether technical safety work will help, for the reasons you give. But “people have an incentive to say X” isn’t actually evidence that X is false, it’s just evidence you shouldn’t trust them. If all people outside labs thought technical safety work was useless that would be one thing. But I don’t think that is actually true, it seems people with relevant expertise are divided even outside the labs. Now of course, there are subtler ways in which even people outside the labs might be incentivized to play down the risks. (Though they might also have other reasons to play them up.) But even that won’t get you to “therefore technical safety is definitely useless”; it’s all meta, not object-level.
There’s also a subtler point that even if “do technical safety work on the inside” is unlikely to work, it might still be the better strategy if confrontational lobbying from the outside is unlikely to work too (something that I think is more true now Trump is in power, although Musk is a bit of a wildcard in that respect.)
I didn’t mean “there is no benefit to technical safety work”; I meant more like “there is only benefit to labs to emphasizing technical safety work to the exclusion of other things”, as in it benefits them and doesn’t cost them to do this.
What you seem to be hinting at, essentially espionage, may honestly be the best reason to work in a lab. But of course those people need to be willing to break NDAs and there are better ways to get that info than getting a technical safety job.
(Edited to add context for bringing up “espionage” and implications elaborated.)
What makes you think this? Zach’s post is a clear counterexample here (though comments are friendlier to Anthropic) and I’ve heard of criticism of the RSPs (though I’m not watching closely).
Maybe you think there should be much more criticism?
There should be protests against them (PauseAI US will be protesting them in SF 2⁄28) and we should all consider them evil for building superintelligence when it is not safe! Dario is now openly calling for recursive self-improvement. They are the villains—this is not hard. The fact that you would think Zach’s post with “maybe” in the title is scrutiny is evidence of the problem.
Honestly this writeup did update me somewhat in favor of at least a few competent safety-conscious people working at major labs, if only so the safety movement has some access to what’s going on inside the labs if/when secrecy grows. The marginal extra researcher going to Anthropic, though? Probably not.
Connect the rest of the dots for me—how does that researcher’s access become community knowledge? How does the community do anything productive with this knowledge? How do you think people working at the labs detracts from other strategies?
The primary benefit I’m imagining is a single well-placed whistleblower positioned to publicly sound the alarm on a particularly obvious and immediate threat, perhaps related to CBRN capabilities. A better answer requires a longer post, which is in the works but may take a while.
I agree having access to what the labs are doing and having the ability to blow the whistle would be super useful. I’ve just recently updated hugely in the direction of respecting the risk of value drift of having people embedded in the labs. We’re imagining cheaply having access to the labs, but the labs and their values will also have access back to us and our mission-alignment through these people.
I think EA should be setting an example to a more confused public of how dangerous this technology is, and being mixed up in making it makes that very difficult.
It seems like having genuinely safety-minded people within orgs is invaluable. Do you think that having them refuse to join is going to meaningfully slow things down?
It just takes one brave or terrified person in the know to say “these guys are internally deploying WHAT? I’ve got to stop this!”
I worry very much that we won’t have one such person in the know in OpenAI. I’m very glad we have them in Anthropic.
Having said that, I agree that Anthropic should not be shielded from criticism.
Your assumption that influence flows one way in organizations seems based on fear not psychology. If someone believes AGI is a real risk, they should be motivated enough to resist some pressure from superiors who merely argue that they’re doing good stuff.
If you won’t actively resist changing your beliefs once you join a culture with importantly different beliefs, then don’t join an org.
While Anthropic’s plan is a terrible one, so is PauseAI’s. We have no good plans. And we must’nt fight amongst ourselves.
Who’s “ourselves”? Anthropic doesn’t have “a terrible plan” for AI Safety—they are the AI danger.
The post is an excellent summary of the current situation, yet ignoring the elephant in the room: we need to urge for the Pause! The incoming summit, in France, is the ideal place to do so. That’s why I’m calling on you, the AI-worried reader, to join our Paris demonstration on the 10th: https://lu.ma/vo3354ab
Each additional participant would be a tremendous asset, as we are a handful of people.
Assemble!
assembling from Victoria, BC, on Friday!
Feels like Anthropic has been putting out a lot of good papers recently that help build the case for various AI threats. Given this, “no meaningful path to impact” seem a bit strong.
What happens because of these papers? Do they influence Anthropic to stop developing powerful AI? Evidently not.
I agree with you but for reasons that are more basic and more heretical than you’re going for. In general I’m critical that EA seems to have a prior that being in a relevant place at a relevant time is doing God’s work. It’s a little defensible from the viewpoint of early career path navigation, but now we’re talking about 3 year timelines and still saying things like “so I guess you should ‘work on’ AI safety”. I don’t really grasp why this argument is unfolding such that you have the burden of proof.
The real plan on a three year timeline is to hike the Patagonia or something. But that conclusion is too radical so we try to commit to the outcome space being selecting a job like we always do. If you’re early career you should probably assume AGI/SI won’t happen, to maximize utility.
People defending work at AI + 3 year timeline should probably be talking about how easy it is to get to a staff+ engineer position from start date.
Wouldn’t it allow access to the most advanced models which are not yet publicly available? I think these are the models that would pose the most risk?
Working at a frontier lab would also give the opportunity to reach people far less concerned about safety, and maybe their minds could be changed?
See my other comments. “Access” to do what? At what cost?
I’ll just highlight that it seems particularly cruxy whether to view such NDAs as covenants or contracts that are not intrinsically immoral to break
It’s not obvious to me that it should be the former, especially when the NDA comes with basically a monetary incentive for not breaking
If the supposed justification for taking these jobs is so that they can be close to what’s going on, and then they never tell and (I predict) get no influence on what the company does, how could this possibly be the right altruistic move?