(NOTE: Coming at this from a place of: a. ignorance of what the AI Safety community actually does and b. not wanting to take the ego hit of admitting that I have been wrong about my long-held skepticism of AI Safety)
I think it was and is fair to be skeptical of the shift to AI Safety in EA on the basis that itâs not that tractable, and that thereâs thereâs not clear evidence that the AI Safety movement has had a positive effect on the trajectory of AI.
âBut it brought the ideas into the mainstreamâ
I think the AI Safety community will be tempted to think theyâve normalised in the zeitgeist ideas about superintelligent AIs and the philosphical questions and risks that arise from them, but 2001: A Space Odyssey came out in 1968, Terminator in 1984 and The Matrix in 1999 etc.. The ideas of superintelligant AIs and the existential risks of them are diffused through modern culture and itâs possible that The Pope and The UN would have made the same statements about them given the recent progress of LLMs regardless of the AI Safety movement.
Are there many ideas in If Anyone Builds It, Everyone Diesthat werenât broadly covered in Terminator/âThe Matrix/â2001 a Space Odyssey/âDune etc.?
âBut the work theyâve done has set us on the right pathâ
I havenât seen strong evidence for the direct work of the AI Safety movement reducing existential risks from AI:
Amanda Askellâs involvement with shaping the character of Claude sounds good. Has it made much difference or is it just putting a nice and brittle mask on the beast?
AI Safety organisations like MIRI an Redwood Research have been operating for 25 and 5 years respectively. As an outsider I coudnât point to any particular breakthrough theyâve made in AI alignment. Redwood seems to do some kinda interesting work on measuring rogue behaviour and creating checks. I dunno. Seems like any organisation trying to make a reliable AI product would be heavily incentivised to do this stuff regardless.
In Australia Good Ancestors has probably contributed in some way to the governmentâs decision to potentially open an AI Safety Institute here. The statements the government puts out about them seem to mostly emphasise deepfake porn and the threat to peopleâs jobs rather than existential risks, which makes me think that this decision might have just happened anyway regardless of the AI Safety movement.
Interpretability research seems far from being able to understand more than a few components at a time. And also the companies making AI would likely have been incentivised to do this work regardless of the AI Safety movement because customers donât want a black box.
From the outside it seems thereâs a good argument that the AI situation would have evolved pretty similarly regardless of EA/âAI Safety input.
From that position, itâs easy to believe that if EA had just stuck to Earning To Give and malaria nets and decaging chickens then the impact would have been greater, both directly and because the movement might not have lost as much momentum when AI Safety alienated people.
For me personally, even just granting âthey were right about the trajectory of AIâ is a huge update. I thought AI was a nothingburger, that the bioanchors report saying AGI would be reached by 2047 was ludicrously optimistic. Now I think I was wrong and the AI safety community was rightâeven pessimistic! - about AI progress. Whether they have changed all that much about AI risk is a different debate, but even if they had done nothing on that front I would be inclined to agree with Dylan.
Maybe, but âif EA had just stuck to Earning To Give and malaria nets and decaging chickens then the impact would have been greaterâ doesnât clearly follow. Malaria nets look a lot worse if we all die in a few years from AI anyway, and cage free pledges have ~0 value if humanity ends before the pledge can be fulfilled.
Thatâs a fair point. At either end of the extreme of outcomes: âASI kills us allâ or âASI quickly uplifts everyone out of povertyâ almost all decisions/âactions we make today are pretty meaningless.
But if the next few decades fall somewhere between those two extremes, which I think they probably will, the impact of improving peopleâs lives remains substantial.
Yes but my point is that whether the AI Safety community has moved the dial on interpretability or government interest is unclear and worth being skeptical of
I suspect that Iâm still misunderstanding you, but: eg interpretability tools are empirically able to identify misalignment, which feels like a (somewhat simple example of) the thing we want. Neel Nandaâs 80k podcast goes over the state of the field; tldr is roughly that there are pretty meaningful advances but also heâs skeptical that it will be a silver bullet.
I agree with Ben Stewart that thereâs a galaxy-brain argument that these positive impacts are outweighed by accelerating progress, but it seems hard to argue that things like interpretability arenât making progress on their own terms.
I think Henryâs skeptical that the AI safety community made a counterfactual difference in getting interpretability started earlier or growing faster. Not questioning interpretabilityâs prospects for reducing x-risk.
Thanks Ben. I actually suggested both in my original comment: both
(a) that there is market incentive for the companies to do this themselves so ?did AI Safety movement really move the dial on this?,
and also
(b) that Iâm skeptical of the value of interpretability research (based only on not having seen anything impressive come from it, but Iâm very ignorant of the field)
I see, thanks! Iâm not sure exactly what youâd consider as evidence here, but e.g. hereâs citation count on papers from the past year vs. AI Lab Watch safety rating[1]
Raw data. Note that anthropic doesnât use arxiv, which affects their citation counts. This is just coming from a dumb search of semantic scholar; I expect a lot of disagreement could be had over the exact criteria for considering something âinterpretabilityâ but I expect the Ant/âGDM > OAI >> * ordering to be true for almost any definition.
Not familiar with this paper so sunk to using Claude to give myself a summary. The misalignment âblind audit gameâ seems a bit sus
âThree auditing teams played this game under conditions mimicking those of AI developers conducting real pre-deployment alignment audits, including access to model weights and training dataâ
They had access to the training documents? That doesnât seem like detecting alignment, thatâs just a search through files to find one with malicious instructions
âA fourth team was given affordances that a lab-external auditors would haveânamely, only API access to sample from the modelâand failed to discover its hidden objective despite it being possible in principle to do so.â
đ€ when they donât have the training documents they canât detect the misalignment?
If theyâre claiming that finding a file with malicious instructions is detecting misalignment then this would update me further against AI Safety research.
Edit: Actually it looks like the teams were using some clever auto-encoder method rather than just a search through documents. In any case this seems all pretty artificial. This method might detect misalignment if it is due to a few malicious documents mixed-in with the training data and where the malicious behaviour is easy to detect. This feels like killing cancer cells in a petri dishâeasy and doesnât tell you much.
Table 1 shows the techniques used; the teams which were allowed to use SAEs (an interpretability technique) used them; the one which was prohibited from using them searched the data.
Also note that âtraining dataâ does not mean âinstructionsâ. Section 3 describes their training process.
I think thereâs a good case for AI safety having a pretty good counterfactual effect on a bunch of productive areas, but obviously thatâs depends on a lot of details and thereâs plenty of room for debate. I think a stronger line of critique could be that early-mid AI safety efforts/âthinking made the frontier race start earlier, go faster, and be more intense (e.g. roles in getting key frontier leaders obsessed, introducing Deepmind cofounders, boosting OpenAIâs founding, etc). I havenât interrogated that history to know where to come down, but itâs a plausible way that the whole of AI safety has been net-negative. (This claim doesnât really detract from future impact of AI safety though, if the catâs out of the bag)
Malaria nets only last 3 years anyway, their direct impact does not require the world to last longer than that (although, perhaps you value saving a life less, if you think the world will soon end).
The way the benefits calculation cashes out on an individual beneficiary basis essentially requires that they (mostly under-5s) live out full lives and enjoy 40 years of increased income, it isnât a function of how long the nets last.
The existence of existential threats does not in itself create a strong argument to redirect the effort. Otherwise EA should have been focusing on nuclear disarmament, climate change, asteroid defence, pandemic prevention etc. from the get go
I guess as you disclaimed might be the case up front, I donât think these are the strongest or most informed examples of EAs impact on AI safety.
In many of cases of such impact, one can quibble about many things:
Whether that impact was clearly positive, or whether it had some kind of indirectly negative harmful effect, most commonly via speeding up AI development. See Paul Christianoâs reflections on the impact of Reinforcement Learning with Human Feedback as an example.
The counterfactuality and persistence of the impact â e.g., like you said for many of these, would this have happened (eventually) anyway?
How attributable that was to EA (and unfortunately in some cases, due to EA having a toxic brand in many places, itâs actually best if it is not that attributable to EA).
And last âDoes any of that matter? All of EAs impact â for better or worse â has been its influence on Anthropic.â
Yet, I think taken as a whole, I think EA has punched above its weight in many ways with respect to making AI go well. Itâs led to:
More and better staffed AI safety/âsecurity institutes
More and better staffed third-party evaluations, auditing, and science (METR, AVERI)
Large amounts of field-building that encourages talented people to work on making AI go well (MATS, BlueDot, 80k)
A significant amount of policy advocacy and public communications about AI risk.
Probably other examples, too.
A lot of the effort to make this happened relied on EA motivated people willing to take lower paid or less glamorous jobs.[1] While some specific organizationsâ or research or policy wins or public communications would have happened otherwise, but some wouldnât, and even still, happening earlier is still better.
I started out in EA caring about global health, and my first EA job was as a Researcher at GWWC. Even after becoming pretty convinced by AI risk and longtermism, I was still fairly sympathetic to concerns like âAI Safety alienating peopleâ. For instance, I was pretty against 80,000 Hours becoming explicitly focused on longtermism, and also pretty skeptical /â worried about its pivot last year into leaning even more into AI. Now, looking at just how fast AI progress is developing, how much there is still be done to make it go well, and how valuable (I think) EA has been to date, I think I got a lot of that wrong.
And of course, in some cases, they happened to get pretty well-paid jobs that ended up being fairly glamorous (even if they werenât in the beginning). I donât think that undermines the impact much. I donât really begrudge the quant finance folks who give >50% of their income to charities, even if theyâre still pretty rich at the end of the day.
Iâm not sure this addresses Henryâs critiques? In general, every bullet listed under âI think EA has punched above its weight in many ways with respect to making AI go wellâ is a proxy somewhere in the middle of the ToC chain while his comment is more end-of-ToC focused as heâs skeptical of the proxies actually being beneficial, and none of these bullets address the counterfactuality he brought up. In particular, and for instance, you mentioned the founding of Redwood Research as an example of EA making AI go well despite Henry explicitly being skeptical of its impact so far:
AI Safety organisations like MIRI an Redwood Research have been operating for 25 and 5 years respectively. As an outsider I coudnât point to any particular breakthrough theyâve made in AI alignment. Redwood seems to do some kinda interesting work on measuring rogue behaviour and creating checks. I dunno. Seems like any organisation trying to make a reliable AI product would be heavily incentivised to do this stuff regardless.
To be clear Iâm not taking sides or anything, Iâm just disheartened by what I perceive to be a lot of talking past each other between AIS advocates and skeptics on this forum, some of which seem easily preventable, like in this case.
Fair enough â I think I was trying to say something along the lines of âgoing through any specific example invites a lot of genuinely thorny and difficult questions about counterfactuality/âsign of impact/âattribution to EAâ (and again many of these are hard to discuss on a public forum) but I think zooming out, you can see EAs fingerprints in various important places. I think this leads to an overall common-sense perspective that EA has helped improve the situation.
Also, I agree I pointed to work in the middle of the ToC chain, but that seems kind of reasonable to me given that AI is currently not that powerful and not really that scary. AI hasnât yet been capable of causing a disaster, so itâs not really possible to have prevented one (yet).
On the specific example of Redwood Research is doing a lot of really valuable safety work. I think pioneering Control has been a fairly useful accomplishment, and I suspect if someone wanted to dig into the details, theyâd find that it was fairly counterfactual.
Even if youâre skeptical about the direct impact of AI safety work on reducing existential risk (a much longer conversation, and one Iâm not fully qualified to have), thereâs a strong indirect case that the EA and EA-adjacent prioritization of AI in the mid-2010s will end up being hugely important for âtraditionalâ, non-speculative EA causes like global health and animal welfare. Most of Anthropicâs co-founders and many of its early employees were deeply involved in the EA and rationalist communities, and itâs at least plausible that this engagement is what led them to take AI seriously enough to found Anthropic in 2021 or to join early with substantial equity. As Sophie Kimâs post documents, Anthropicâs seven co-founders have pledged to donate 80% of their wealth, which at current valuations could amount to roughly $37.8B combined, nearly ten times what Coefficient Giving has disbursed in its entire history. Including employee equity already in DAFs, the total pool of EA-influenced philanthropic capital could reach nine or ten figures. Itâs not unreasonable to assume that a substantial fraction of this is likely to flow into non-AI causes. Many of these donors signed the GWWC pledge before AI was their focus and hold a worldview and values closely aligned with the broader effective altruism community (vven outside EA, it isnât uncommon for wealthy individuals with modest altruistic inclinations to donate significant amounts to global health causes). Needless to say, this is an average estimate and not guaranteed. Itâs possible that Anthropic or the entire AI ecosystem collapses and these funds never materialize, but itâs also possible that Anthropicâs returns end up being even larger.
(NOTE: Coming at this from a place of: a. ignorance of what the AI Safety community actually does and b. not wanting to take the ego hit of admitting that I have been wrong about my long-held skepticism of AI Safety)
I think it was and is fair to be skeptical of the shift to AI Safety in EA on the basis that itâs not that tractable, and that thereâs thereâs not clear evidence that the AI Safety movement has had a positive effect on the trajectory of AI.
âBut it brought the ideas into the mainstreamâ
I think the AI Safety community will be tempted to think theyâve normalised in the zeitgeist ideas about superintelligent AIs and the philosphical questions and risks that arise from them, but 2001: A Space Odyssey came out in 1968, Terminator in 1984 and The Matrix in 1999 etc.. The ideas of superintelligant AIs and the existential risks of them are diffused through modern culture and itâs possible that The Pope and The UN would have made the same statements about them given the recent progress of LLMs regardless of the AI Safety movement.
Are there many ideas in If Anyone Builds It, Everyone Dies that werenât broadly covered in Terminator/âThe Matrix/â2001 a Space Odyssey/âDune etc.?
âBut the work theyâve done has set us on the right pathâ
I havenât seen strong evidence for the direct work of the AI Safety movement reducing existential risks from AI:
Amanda Askellâs involvement with shaping the character of Claude sounds good. Has it made much difference or is it just putting a nice and brittle mask on the beast?
AI Safety organisations like MIRI an Redwood Research have been operating for 25 and 5 years respectively. As an outsider I coudnât point to any particular breakthrough theyâve made in AI alignment. Redwood seems to do some kinda interesting work on measuring rogue behaviour and creating checks. I dunno. Seems like any organisation trying to make a reliable AI product would be heavily incentivised to do this stuff regardless.
In Australia Good Ancestors has probably contributed in some way to the governmentâs decision to potentially open an AI Safety Institute here. The statements the government puts out about them seem to mostly emphasise deepfake porn and the threat to peopleâs jobs rather than existential risks, which makes me think that this decision might have just happened anyway regardless of the AI Safety movement.
Interpretability research seems far from being able to understand more than a few components at a time. And also the companies making AI would likely have been incentivised to do this work regardless of the AI Safety movement because customers donât want a black box.
From the outside it seems thereâs a good argument that the AI situation would have evolved pretty similarly regardless of EA/âAI Safety input.
From that position, itâs easy to believe that if EA had just stuck to Earning To Give and malaria nets and decaging chickens then the impact would have been greater, both directly and because the movement might not have lost as much momentum when AI Safety alienated people.
For me personally, even just granting âthey were right about the trajectory of AIâ is a huge update. I thought AI was a nothingburger, that the bioanchors report saying AGI would be reached by 2047 was ludicrously optimistic. Now I think I was wrong and the AI safety community was rightâeven pessimistic! - about AI progress. Whether they have changed all that much about AI risk is a different debate, but even if they had done nothing on that front I would be inclined to agree with Dylan.
Maybe, but âif EA had just stuck to Earning To Give and malaria nets and decaging chickens then the impact would have been greaterâ doesnât clearly follow. Malaria nets look a lot worse if we all die in a few years from AI anyway, and cage free pledges have ~0 value if humanity ends before the pledge can be fulfilled.
Thatâs a fair point. At either end of the extreme of outcomes: âASI kills us allâ or âASI quickly uplifts everyone out of povertyâ almost all decisions/âactions we make today are pretty meaningless.
But if the next few decades fall somewhere between those two extremes, which I think they probably will, the impact of improving peopleâs lives remains substantial.
Hmm, but in a success without dignity world making interpretability a bit better, or governments a bit more interested, is relevant, right?
Yes but my point is that whether the AI Safety community has moved the dial on interpretability or government interest is unclear and worth being skeptical of
I suspect that Iâm still misunderstanding you, but: eg interpretability tools are empirically able to identify misalignment, which feels like a (somewhat simple example of) the thing we want. Neel Nandaâs 80k podcast goes over the state of the field; tldr is roughly that there are pretty meaningful advances but also heâs skeptical that it will be a silver bullet.
I agree with Ben Stewart that thereâs a galaxy-brain argument that these positive impacts are outweighed by accelerating progress, but it seems hard to argue that things like interpretability arenât making progress on their own terms.
I think Henryâs skeptical that the AI safety community made a counterfactual difference in getting interpretability started earlier or growing faster. Not questioning interpretabilityâs prospects for reducing x-risk.
Thanks Ben. I actually suggested both in my original comment: both
(a) that there is market incentive for the companies to do this themselves so ?did AI Safety movement really move the dial on this?,
and also
(b) that Iâm skeptical of the value of interpretability research (based only on not having seen anything impressive come from it, but Iâm very ignorant of the field)
I see, thanks! Iâm not sure exactly what youâd consider as evidence here, but e.g. hereâs citation count on papers from the past year vs. AI Lab Watch safety rating[1]
Raw data. Note that anthropic doesnât use arxiv, which affects their citation counts. This is just coming from a dumb search of semantic scholar; I expect a lot of disagreement could be had over the exact criteria for considering something âinterpretabilityâ but I expect the Ant/âGDM > OAI >> * ordering to be true for almost any definition.
Not familiar with this paper so sunk to using Claude to give myself a summary. The misalignment âblind audit gameâ seems a bit sus
They had access to the training documents? That doesnât seem like detecting alignment, thatâs just a search through files to find one with malicious instructionsđ€ when they donât have the training documents they canât detect the misalignment?If theyâre claiming that finding a file with malicious instructions is detecting misalignment then this would update me further against AI Safety research.Edit: Actually it looks like the teams were using some clever auto-encoder method rather than just a search through documents. In any case this seems all pretty artificial. This method might detect misalignment if it is due to a few malicious documents mixed-in with the training data and where the malicious behaviour is easy to detect. This feels like killing cancer cells in a petri dishâeasy and doesnât tell you much.
Table 1 shows the techniques used; the teams which were allowed to use SAEs (an interpretability technique) used them; the one which was prohibited from using them searched the data.
Also note that âtraining dataâ does not mean âinstructionsâ. Section 3 describes their training process.
I think thereâs a good case for AI safety having a pretty good counterfactual effect on a bunch of productive areas, but obviously thatâs depends on a lot of details and thereâs plenty of room for debate.
I think a stronger line of critique could be that early-mid AI safety efforts/âthinking made the frontier race start earlier, go faster, and be more intense (e.g. roles in getting key frontier leaders obsessed, introducing Deepmind cofounders, boosting OpenAIâs founding, etc). I havenât interrogated that history to know where to come down, but itâs a plausible way that the whole of AI safety has been net-negative. (This claim doesnât really detract from future impact of AI safety though, if the catâs out of the bag)
Malaria nets only last 3 years anyway, their direct impact does not require the world to last longer than that (although, perhaps you value saving a life less, if you think the world will soon end).
The way the benefits calculation cashes out on an individual beneficiary basis essentially requires that they (mostly under-5s) live out full lives and enjoy 40 years of increased income, it isnât a function of how long the nets last.
The existence of existential threats does not in itself create a strong argument to redirect the effort. Otherwise EA should have been focusing on nuclear disarmament, climate change, asteroid defence, pandemic prevention etc. from the get go
.
I guess as you disclaimed might be the case up front, I donât think these are the strongest or most informed examples of EAs impact on AI safety.
In many of cases of such impact, one can quibble about many things:
Whether that impact was clearly positive, or whether it had some kind of indirectly negative harmful effect, most commonly via speeding up AI development. See Paul Christianoâs reflections on the impact of Reinforcement Learning with Human Feedback as an example.
The counterfactuality and persistence of the impact â e.g., like you said for many of these, would this have happened (eventually) anyway?
How attributable that was to EA (and unfortunately in some cases, due to EA having a toxic brand in many places, itâs actually best if it is not that attributable to EA).
And last âDoes any of that matter? All of EAs impact â for better or worse â has been its influence on Anthropic.â
Yet, I think taken as a whole, I think EA has punched above its weight in many ways with respect to making AI go well. Itâs led to:
More and better staffed AI safety/âsecurity institutes
A richer non-profit ecosystem of safety research (like Truthful AI, FAR AI, Redwood Research, etc.)
More and better staffed third-party evaluations, auditing, and science (METR, AVERI)
Large amounts of field-building that encourages talented people to work on making AI go well (MATS, BlueDot, 80k)
A significant amount of policy advocacy and public communications about AI risk.
Probably other examples, too.
A lot of the effort to make this happened relied on EA motivated people willing to take lower paid or less glamorous jobs.[1] While some specific organizationsâ or research or policy wins or public communications would have happened otherwise, but some wouldnât, and even still, happening earlier is still better.
I started out in EA caring about global health, and my first EA job was as a Researcher at GWWC. Even after becoming pretty convinced by AI risk and longtermism, I was still fairly sympathetic to concerns like âAI Safety alienating peopleâ. For instance, I was pretty against 80,000 Hours becoming explicitly focused on longtermism, and also pretty skeptical /â worried about its pivot last year into leaning even more into AI. Now, looking at just how fast AI progress is developing, how much there is still be done to make it go well, and how valuable (I think) EA has been to date, I think I got a lot of that wrong.
And of course, in some cases, they happened to get pretty well-paid jobs that ended up being fairly glamorous (even if they werenât in the beginning). I donât think that undermines the impact much. I donât really begrudge the quant finance folks who give >50% of their income to charities, even if theyâre still pretty rich at the end of the day.
Iâm not sure this addresses Henryâs critiques? In general, every bullet listed under âI think EA has punched above its weight in many ways with respect to making AI go wellâ is a proxy somewhere in the middle of the ToC chain while his comment is more end-of-ToC focused as heâs skeptical of the proxies actually being beneficial, and none of these bullets address the counterfactuality he brought up. In particular, and for instance, you mentioned the founding of Redwood Research as an example of EA making AI go well despite Henry explicitly being skeptical of its impact so far:
To be clear Iâm not taking sides or anything, Iâm just disheartened by what I perceive to be a lot of talking past each other between AIS advocates and skeptics on this forum, some of which seem easily preventable, like in this case.
Fair enough â I think I was trying to say something along the lines of âgoing through any specific example invites a lot of genuinely thorny and difficult questions about counterfactuality/âsign of impact/âattribution to EAâ (and again many of these are hard to discuss on a public forum) but I think zooming out, you can see EAs fingerprints in various important places. I think this leads to an overall common-sense perspective that EA has helped improve the situation.
Also, I agree I pointed to work in the middle of the ToC chain, but that seems kind of reasonable to me given that AI is currently not that powerful and not really that scary. AI hasnât yet been capable of causing a disaster, so itâs not really possible to have prevented one (yet).
On the specific example of Redwood Research is doing a lot of really valuable safety work. I think pioneering Control has been a fairly useful accomplishment, and I suspect if someone wanted to dig into the details, theyâd find that it was fairly counterfactual.
Even if youâre skeptical about the direct impact of AI safety work on reducing existential risk (a much longer conversation, and one Iâm not fully qualified to have), thereâs a strong indirect case that the EA and EA-adjacent prioritization of AI in the mid-2010s will end up being hugely important for âtraditionalâ, non-speculative EA causes like global health and animal welfare. Most of Anthropicâs co-founders and many of its early employees were deeply involved in the EA and rationalist communities, and itâs at least plausible that this engagement is what led them to take AI seriously enough to found Anthropic in 2021 or to join early with substantial equity. As Sophie Kimâs post documents, Anthropicâs seven co-founders have pledged to donate 80% of their wealth, which at current valuations could amount to roughly $37.8B combined, nearly ten times what Coefficient Giving has disbursed in its entire history. Including employee equity already in DAFs, the total pool of EA-influenced philanthropic capital could reach nine or ten figures. Itâs not unreasonable to assume that a substantial fraction of this is likely to flow into non-AI causes. Many of these donors signed the GWWC pledge before AI was their focus and hold a worldview and values closely aligned with the broader effective altruism community (vven outside EA, it isnât uncommon for wealthy individuals with modest altruistic inclinations to donate significant amounts to global health causes). Needless to say, this is an average estimate and not guaranteed. Itâs possible that Anthropic or the entire AI ecosystem collapses and these funds never materialize, but itâs also possible that Anthropicâs returns end up being even larger.