People who do AI safety research sometimes worry that their research could also contribute to AI capabilities, thereby hastening a possible AI safety disaster. But when might this be a reasonable concern?
We can model a researcher i as contributing intellectual resources of si to safety, and ci to capabilities, both real numbers. We let the total safety investment (of all researchers) be s=∑isi, and the capabilities investment be c=∑ici. Then, we assume that a good outcome is achieved if s>c/k, for some constant k, and a bad outcome otherwise.
The assumption about s>b/k could be justified by safety and capabilities research having diminishing return. Then you could have log-uniform beliefs (over some interval) about the level of capabilities c′ required to achieve AGI, and the amount of safety research c′/k required for a good outcome. Within the support of c′ and c′/k, linearly increasing s/c, will linearly increase the chance of safe AGI.
In this model, having a positive marginal impact doesn’t require us to completely abstain from contributing to capabilities. Rather, one’s impact is positive if the ratio of safety and capabilities contributions si/ci is greater than the average of the rest of the world. For example, a 50% safety/50% capabilities project is marginally beneficial, if the AI world focuses only 3% on safety.
If the AI world does only focus 3% on safety, then when is nervousness warranted? Firstly, technical researchers might make a big capabilities contribution if they are led to fixate on dangerous schemes that lie outside of current paradigms, like self-improvement perhaps. This means that MIRI’s concerns about information security are not obviously unreasonable. Secondly, AI timeline research could lead one to understand the roots of AI progress, and thereby set in motion a wider trend toward more dangerous research. This could justify worries about the large compute experiments of OpenAI. It could also justify worries about the hypothetical future in which an AIS person launches a large AI projects for the government. Personally, I think it’s reasonable to worry about cases like these breaching the 97% barrier.
It is a high bar, however. And I think in the case of a typical AI safety researcher, these worries are a bit overblown. In this 97%-capabilities world, the median person should worry a bit less about abstaining from capability contributions, and a bit more about the size of their contribution to safety.
I propose an adjustment to this model: you have to be greater than the rest of the world’s total contributions over time under the action-relevant probability measure. What I mean by action-relevant measure is the probability distribution where worlds are weighted according to your expected impact, not just their probability.
So if you think there’s a decent chance that we’re barely going to solve alignment, and that in those worlds the world will pivot towards a much higher safety focus, you should be more cautious about contributing to capabilities.
2) Rather, one’s impact is positive if the ratio of safety and capabilities contributions si/ci is greater than the average of the rest of the world.
I haven’t quite followed your model, but this doesn’t see exactly correct to me. For example, if the mean player is essentially “causing a lot of net-harm”, then “just causing a bit of net-harm”, clearly isn’t a net-good.
It seems entirely possible that even with a 100 safety to 1 capabilities researcher ratio, 100 capabilities researchers could kill everyone before the 10k safety researchers came up with a plan that didnt kill everyone. It does not seem like a symmetric race.
Likewise, if the output of safety research is just “this is not safe to do” (as MIRI’s seems to be), capabilities will continue, or in fact they will do MORE capabilities work so they can upskill and “help” with the safety problem.
The Safety/Capabilities Ratio
People who do AI safety research sometimes worry that their research could also contribute to AI capabilities, thereby hastening a possible AI safety disaster. But when might this be a reasonable concern?
We can model a researcher i as contributing intellectual resources of si to safety, and ci to capabilities, both real numbers. We let the total safety investment (of all researchers) be s=∑isi, and the capabilities investment be c=∑ici. Then, we assume that a good outcome is achieved if s>c/k, for some constant k, and a bad outcome otherwise.
The assumption about s>b/k could be justified by safety and capabilities research having diminishing return. Then you could have log-uniform beliefs (over some interval) about the level of capabilities c′ required to achieve AGI, and the amount of safety research c′/k required for a good outcome. Within the support of c′ and c′/k, linearly increasing s/c, will linearly increase the chance of safe AGI.
In this model, having a positive marginal impact doesn’t require us to completely abstain from contributing to capabilities. Rather, one’s impact is positive if the ratio of safety and capabilities contributions si/ci is greater than the average of the rest of the world. For example, a 50% safety/50% capabilities project is marginally beneficial, if the AI world focuses only 3% on safety.
If the AI world does only focus 3% on safety, then when is nervousness warranted? Firstly, technical researchers might make a big capabilities contribution if they are led to fixate on dangerous schemes that lie outside of current paradigms, like self-improvement perhaps. This means that MIRI’s concerns about information security are not obviously unreasonable. Secondly, AI timeline research could lead one to understand the roots of AI progress, and thereby set in motion a wider trend toward more dangerous research. This could justify worries about the large compute experiments of OpenAI. It could also justify worries about the hypothetical future in which an AIS person launches a large AI projects for the government. Personally, I think it’s reasonable to worry about cases like these breaching the 97% barrier.
It is a high bar, however. And I think in the case of a typical AI safety researcher, these worries are a bit overblown. In this 97%-capabilities world, the median person should worry a bit less about abstaining from capability contributions, and a bit more about the size of their contribution to safety.
I propose an adjustment to this model: you have to be greater than the rest of the world’s total contributions over time under the action-relevant probability measure. What I mean by action-relevant measure is the probability distribution where worlds are weighted according to your expected impact, not just their probability.
So if you think there’s a decent chance that we’re barely going to solve alignment, and that in those worlds the world will pivot towards a much higher safety focus, you should be more cautious about contributing to capabilities.
Interesting take, quick notes:
1) I worked on a similar model with Justin Shovelain a few years back. See: https://www.lesswrong.com/posts/BfKQGYJBwdHfik4Kd/fai-research-constraints-and-agi-side-effects
2) Rather, one’s impact is positive if the ratio of safety and capabilities contributions si/ci is greater than the average of the rest of the world.
I haven’t quite followed your model, but this doesn’t see exactly correct to me. For example, if the mean player is essentially “causing a lot of net-harm”, then “just causing a bit of net-harm”, clearly isn’t a net-good.
It seems entirely possible that even with a 100 safety to 1 capabilities researcher ratio, 100 capabilities researchers could kill everyone before the 10k safety researchers came up with a plan that didnt kill everyone. It does not seem like a symmetric race.
Likewise, if the output of safety research is just “this is not safe to do” (as MIRI’s seems to be), capabilities will continue, or in fact they will do MORE capabilities work so they can upskill and “help” with the safety problem.