It leaves me with a question: what is the possibility that the work outlined in the article makes things worse rather than better? These concerns are fleshed out in more details in this question and its comment threads, but the TL;DR is:
AI safety work is difficult: there are lots of hypotheses, experiments are hard to design, we can’t do RCTs to measure whether it works, etc. Thus, there is uncertainty even about the sign of the impact.
AI safety work could plausibly speed up AI development, create information hazards, be used for greenwashing regular AI companies… thereby increasing rather than decreasing AI risk.
I’d love to see a discussion of this concern, for example in the form of an entry under “Arguments against working on AI risk to which we think there are strong responses”, or some content about how to make sure that the work is actually beneficial.
Final note: I hope this didn’t sound too adversarial. My question is not meant as a critique of the article, but rather a genuine question that makes me hesitant to switch to AI safety work.
(Responding on Benjamin’s behalf, as he’s away right now):
Agree that it’s hard to know what works in AI safety + it’s easy to do things that make things worse rather than better. My personal view is that we should expect the field of AI safety to be overall good because people trying to optimise for a thing will overall move things in its direction in expectation even if they sometimes move away from it by mistake. It seems unlikely that the best thing to do is nothing, given that AI capabilities are racing forward regardless.
I do think that the difficulty of telling what will work is a strike against pursuing a career in this area, because it makes the problem less tractable, but it doesn’t seem decisive to me.
I appreciate the response, and I think I agree with your personal view, at least partially. “AI capabilities are racing forward regardless” is a strong argument, and it would mean that AI safety’s contribution to AI progress would be small, in relative terms.
That said, it seems that the AI safety field might be particularly prone to work that’s risky or neutral, for example:
Interpretability research: interpretability is a quasi-requirement for deploying powerful models. Research in this direction is likely to produce tools that increase confidence in AI models and lead to more of them being deployed, earlier.
Robustness research: Similar to interpretability, robustness is a very useful property of all AI models. It makes them more applicable and will likely increase use of AI.
AI forecasting: Probably neutral, maybe negative since it creates buzz about AI and increases investments.
It’s puzzling that there is much concern about AI risk, and yet little awareness of the dual-use nature of all AI research. I would appreciate a stronger discussion about how we can make AI actually more safe, as opposed to more interpretable, more robust, etc.
I think these are all great points! We should definitely worry about negative effects of work intended to do good.
That said here are two other places where maybe we have differing intuitions:
You seem much more confident than I am that work on AI that is unrelated to AI safety is in fact negative in sign.
It seems hard to conclude that the counterfactual where any one or more of “no work on AI safety / no interpretability work / no robustness work / no forecasting work” were true is in fact a world with less x-risk from AI overall. That is, while I can see there are potential negative effects of these things, when I truly try to imagine the counterfactual, the overall impact seems likely positive to me.
Of course, intuitions like these are much less concrete than actually trying to evaluate the claims , and I agree it seems extremely important for people evaluating or doing anything in AI safety to ensure they’re doing positive work overall.
You seem much more confident than I am that work on AI that is unrelated to AI safety is in fact negative in sign.
Work on AI drives AI risk. This is not equally true of all AI work, but the overall correlation is clear. There are good arguments that AI will not be aligned by default, and that current methods can produce bad outcomes if naively scaled up. These are cited in your problem profile. With that in mind, I would not say that I’m confident that AI work is net-negative… but the risk of negative outcomes is too large to feel comfortable.
It seems hard to conclude that the counterfactual where any one or more of “no work on AI safety / no interpretability work / no robustness work / no forecasting work” were true is in fact a world with less x-risk from AI overall.
A world with more interpretability / robustness work is a world where powerful AI arrives faster (maybe good, maybe bad, certainly risky). I am echoing section 2 of the problem profile, which argues that the sheer speed of AI advances is cause for concern. Moreover, because interpretability and robustness work advances AI, traditional AI companies are likely to pursue such work even without an 80000hours problem profile. This could be an opportunity for 80000hours to direct people to work that is even more central to safety.
As you say, these are currently just intuitions, not concretely evaluated claims. It’s completely OK if you don’t put much weight on them. Nevertheless, I think these are real concerns shared by others (e.g. Alexander Berger, Michael Nielsen, Kerry Vaughan), and I would appreciate a brief discussion, FAQ entry, or similar in the problem profile.
And now I’ll stop bothering you :) Thanks for having written the problem profile. It’s really nice work overall.
Thanks a lot for this profile!
It leaves me with a question: what is the possibility that the work outlined in the article makes things worse rather than better? These concerns are fleshed out in more details in this question and its comment threads, but the TL;DR is:
AI safety work is difficult: there are lots of hypotheses, experiments are hard to design, we can’t do RCTs to measure whether it works, etc. Thus, there is uncertainty even about the sign of the impact.
AI safety work could plausibly speed up AI development, create information hazards, be used for greenwashing regular AI companies… thereby increasing rather than decreasing AI risk.
I’d love to see a discussion of this concern, for example in the form of an entry under “Arguments against working on AI risk to which we think there are strong responses”, or some content about how to make sure that the work is actually beneficial.
Final note: I hope this didn’t sound too adversarial. My question is not meant as a critique of the article, but rather a genuine question that makes me hesitant to switch to AI safety work.
(Responding on Benjamin’s behalf, as he’s away right now):
Agree that it’s hard to know what works in AI safety + it’s easy to do things that make things worse rather than better. My personal view is that we should expect the field of AI safety to be overall good because people trying to optimise for a thing will overall move things in its direction in expectation even if they sometimes move away from it by mistake. It seems unlikely that the best thing to do is nothing, given that AI capabilities are racing forward regardless.
I do think that the difficulty of telling what will work is a strike against pursuing a career in this area, because it makes the problem less tractable, but it doesn’t seem decisive to me.
Agree that a section on this could be good!
I appreciate the response, and I think I agree with your personal view, at least partially. “AI capabilities are racing forward regardless” is a strong argument, and it would mean that AI safety’s contribution to AI progress would be small, in relative terms.
That said, it seems that the AI safety field might be particularly prone to work that’s risky or neutral, for example:
Interpretability research: interpretability is a quasi-requirement for deploying powerful models. Research in this direction is likely to produce tools that increase confidence in AI models and lead to more of them being deployed, earlier.
Robustness research: Similar to interpretability, robustness is a very useful property of all AI models. It makes them more applicable and will likely increase use of AI.
AI forecasting: Probably neutral, maybe negative since it creates buzz about AI and increases investments.
It’s puzzling that there is much concern about AI risk, and yet little awareness of the dual-use nature of all AI research. I would appreciate a stronger discussion about how we can make AI actually more safe, as opposed to more interpretable, more robust, etc.
I think these are all great points! We should definitely worry about negative effects of work intended to do good.
That said here are two other places where maybe we have differing intuitions:
You seem much more confident than I am that work on AI that is unrelated to AI safety is in fact negative in sign.
It seems hard to conclude that the counterfactual where any one or more of “no work on AI safety / no interpretability work / no robustness work / no forecasting work” were true is in fact a world with less x-risk from AI overall. That is, while I can see there are potential negative effects of these things, when I truly try to imagine the counterfactual, the overall impact seems likely positive to me.
Of course, intuitions like these are much less concrete than actually trying to evaluate the claims , and I agree it seems extremely important for people evaluating or doing anything in AI safety to ensure they’re doing positive work overall.
Thanks for pointing out these two places!
Work on AI drives AI risk. This is not equally true of all AI work, but the overall correlation is clear. There are good arguments that AI will not be aligned by default, and that current methods can produce bad outcomes if naively scaled up. These are cited in your problem profile. With that in mind, I would not say that I’m confident that AI work is net-negative… but the risk of negative outcomes is too large to feel comfortable.
A world with more interpretability / robustness work is a world where powerful AI arrives faster (maybe good, maybe bad, certainly risky). I am echoing section 2 of the problem profile, which argues that the sheer speed of AI advances is cause for concern. Moreover, because interpretability and robustness work advances AI, traditional AI companies are likely to pursue such work even without an 80000hours problem profile. This could be an opportunity for 80000hours to direct people to work that is even more central to safety.
As you say, these are currently just intuitions, not concretely evaluated claims. It’s completely OK if you don’t put much weight on them. Nevertheless, I think these are real concerns shared by others (e.g. Alexander Berger, Michael Nielsen, Kerry Vaughan), and I would appreciate a brief discussion, FAQ entry, or similar in the problem profile.
And now I’ll stop bothering you :) Thanks for having written the problem profile. It’s really nice work overall.