Thanks for exploring this, I found it quite interesting.
I’m worried that casual readers might come away with the impression “these dynamics of compensation for safety work being a big deal obviously apply to AI risk”. But I think this is unclear, because we may not have the key property C≪¯¯¯¯C (that you call assumption (b)).
Intuitively I’d describe this property as “meaningful restraint”, i.e. people are holding back a lot from what they might achieve if they weren’t worried about safety. I don’t think this is happening in the world at the moment. It seems plausible that it will never happen—i.e. the world will be approximately full steam ahead until it gets death or glory. In this case there is no compensation effect, and safety work is purely good in the straightforward way.
To spell out the scenario in which safety work now could be bad because of risk compensation: perhaps in the future everyone is meaningfully restrained, but if there’s been more work on how to build things safely done ahead of time, they’re less worried so less restrained. I think this is a realistic possibility. But I think that this world is made much safer by less variance in the models of different actors about how much risk there is, in order to avoid having the actor who is an outlier in not expecting risk being the one to press ahead. Relatedly, I think we’re much more likely to reach such a scenario if many people have got on a similar page about the levels of risk. But I think that a lot of “technical safety” work at the moment (and certainly not just “evals”) is importantly valuable for helping people to build common pictures of the character of risk, and how high risk levels are with various degrees of safety measure. So a lot of what people think of as safety work actually looks good even in exactly the scenario where we might get >100% risk compensation.
All of this isn’t to say “risk compensation shouldn’t be a concern”, but more like “I think we’re going to have to model this in more granularity to get a sense of when it might or might not be a concern for the particular case of technical AI safety work”.
I‘ve just edited the intro to say: it’s not obvious to me one way or the other whether it’s a big deal in the AI risk case. I don’t think I know much about the AI risk case (or any other case) to have much of an opinion, and I certainly don’t think anything here is specific enough to come to a conclusion in any case. My hope is just that something here makes it easier to for people who do know about particular cases to get started thinking through the problem.
If I have to make a guess about the AI risk case, I’d emphasize my conjecture near the end, just before the “takeaways” section, namely that (as you suggest) there currently isn’t a ton of restraint, so (b) mostly fails, but that this has a good chance of changing in the future:
Today, while even the most advanced AI systems are neither very capable nor very dangerous, safety concerns are not constraining C much below ¯C. If technological advances unlock the ability to develop systems which offer utopia if their deployment is successful, but which pose large risks, then the developer’s choice of C at any given S is more likely to be far below
¯C, and the risk compensation induced by increasing S is therefore more likely to be strong.
If lots/most of AI safety work (beyond evals) is currently acting more “like evals” than like pure “increases to S”, great to hear—concern about risk compensation can just be an argument for making sure it stays that way!
Thanks for exploring this, I found it quite interesting.
I’m worried that casual readers might come away with the impression “these dynamics of compensation for safety work being a big deal obviously apply to AI risk”. But I think this is unclear, because we may not have the key property C≪¯¯¯¯C (that you call assumption (b)).
Intuitively I’d describe this property as “meaningful restraint”, i.e. people are holding back a lot from what they might achieve if they weren’t worried about safety. I don’t think this is happening in the world at the moment. It seems plausible that it will never happen—i.e. the world will be approximately full steam ahead until it gets death or glory. In this case there is no compensation effect, and safety work is purely good in the straightforward way.
To spell out the scenario in which safety work now could be bad because of risk compensation: perhaps in the future everyone is meaningfully restrained, but if there’s been more work on how to build things safely done ahead of time, they’re less worried so less restrained. I think this is a realistic possibility. But I think that this world is made much safer by less variance in the models of different actors about how much risk there is, in order to avoid having the actor who is an outlier in not expecting risk being the one to press ahead. Relatedly, I think we’re much more likely to reach such a scenario if many people have got on a similar page about the levels of risk. But I think that a lot of “technical safety” work at the moment (and certainly not just “evals”) is importantly valuable for helping people to build common pictures of the character of risk, and how high risk levels are with various degrees of safety measure. So a lot of what people think of as safety work actually looks good even in exactly the scenario where we might get >100% risk compensation.
All of this isn’t to say “risk compensation shouldn’t be a concern”, but more like “I think we’re going to have to model this in more granularity to get a sense of when it might or might not be a concern for the particular case of technical AI safety work”.
Good to hear, thanks!
I‘ve just edited the intro to say: it’s not obvious to me one way or the other whether it’s a big deal in the AI risk case. I don’t think I know much about the AI risk case (or any other case) to have much of an opinion, and I certainly don’t think anything here is specific enough to come to a conclusion in any case. My hope is just that something here makes it easier to for people who do know about particular cases to get started thinking through the problem.
If I have to make a guess about the AI risk case, I’d emphasize my conjecture near the end, just before the “takeaways” section, namely that (as you suggest) there currently isn’t a ton of restraint, so (b) mostly fails, but that this has a good chance of changing in the future:
If lots/most of AI safety work (beyond evals) is currently acting more “like evals” than like pure “increases to S”, great to hear—concern about risk compensation can just be an argument for making sure it stays that way!