What you call the “lab’s” utility function isn’t really specific to the lab; it could just as well apply to safety researchers. One might assume that the parameters would be set in such a way as to make the lab more C-seeking (e.g. it takes less C to produce 1 util for the lab than for everyone else).
But at least in the case of AI safety, I don’t think this is the case. I doubt I could easily distinguish a lab capabilities researcher (or lab leadership, or some “aggregate lab utility function”) from an external safety researcher if you just gave me their utility functions over C and S. (AI safety has significant overlap with transhumanism; relative to the rest of humanity they are way more likely to think there are huge benefits to development of safe AGI.) In practice it seems like the issue is more like epistemic disagreement.
You could still recover many of the conclusions in this post by positing that an increase to S leads to a proportional decrease in probability of non-survival, and the proportion is the same between the lab and everyone else, but the absolute numbers aren’t. I’d still feel like this was a poor model of the real situation though.
Okay great, good to know. Again, my hope here is to present the logic of risk compensation in a way that makes it easy to make up your mind about how you think it applies in some domain, not to argue that it does apply in any domain. (And certainly not to argue that a model stripped down to the point that the only effect going on is a risk compensation effect is a realistic model of any domain!)
As for the role of preference-differences in the AI risk case—if what you’re saying is that there’s no difference at all between capabilities researchers’ and safety researchers’ preferences (rather than just that the distributions overlap), that’s not my own intuition at all. I would think that if I learn
that two people have similar transhuamanist-ey preferences except that one discounts the distant future (or future generations), and so cares primarily about achieving amazing outcomes in the next few decades for people alive today, whereas the other cares primarily about the “expected value of the lightcone”; and
that one works on AI capabilities and the other works on AI safety,
my guess about who was who would be a fair bit better than random.
But I absolutely agree that epistemic disagreement is another reason, and could well be a bigger reason, why different people put different values on safety work relative to capabilities work. I say a few words about how this does / doesn’t change the basic logic of risk compensation in the section on “misperceptions”: nothing much seems to change if the parties just disagree in a proportional way about the magnitude of the risk at any given levels of C and S—though this disagreement can change who prioritizes which kind of work, it doesn’t change how the risk compensation interaction plays out. What really changes things there is if the parties disagree about the effectiveness of marginal increases to S, or really, if they disagree about the extent to which increases to S decrease the extent to which increases to C lower P.
In any event though, if what you’re saying is that a framing more applicable to the AI risk context would have made the epistemic diagreement bit central and the preference disagreement secondary (or swept under the rug entirely), fair enough! Look forward to seeing that presentation of it all if someone writes it up.
Tbc if the preferences are written in words like “expected value of the lightcone” I agree it would be relatively easy to tell which was which, mainly by identifying community shibboleths. My claim is that if you just have the input/output mapping of (safety level of AI, capabilities level of AI) --> utility, then it would be challenging. Even longtermists should be willing to accept some risk, just because AI can help with other existential risks (and of course many safety researchers—probably the majority at this point—are not longtermists).
What you call the “lab’s” utility function isn’t really specific to the lab; it could just as well apply to safety researchers. One might assume that the parameters would be set in such a way as to make the lab more C-seeking (e.g. it takes less C to produce 1 util for the lab than for everyone else).
But at least in the case of AI safety, I don’t think this is the case. I doubt I could easily distinguish a lab capabilities researcher (or lab leadership, or some “aggregate lab utility function”) from an external safety researcher if you just gave me their utility functions over C and S. (AI safety has significant overlap with transhumanism; relative to the rest of humanity they are way more likely to think there are huge benefits to development of safe AGI.) In practice it seems like the issue is more like epistemic disagreement.
You could still recover many of the conclusions in this post by positing that an increase to S leads to a proportional decrease in probability of non-survival, and the proportion is the same between the lab and everyone else, but the absolute numbers aren’t. I’d still feel like this was a poor model of the real situation though.
Okay great, good to know. Again, my hope here is to present the logic of risk compensation in a way that makes it easy to make up your mind about how you think it applies in some domain, not to argue that it does apply in any domain. (And certainly not to argue that a model stripped down to the point that the only effect going on is a risk compensation effect is a realistic model of any domain!)
As for the role of preference-differences in the AI risk case—if what you’re saying is that there’s no difference at all between capabilities researchers’ and safety researchers’ preferences (rather than just that the distributions overlap), that’s not my own intuition at all. I would think that if I learn
that two people have similar transhuamanist-ey preferences except that one discounts the distant future (or future generations), and so cares primarily about achieving amazing outcomes in the next few decades for people alive today, whereas the other cares primarily about the “expected value of the lightcone”; and
that one works on AI capabilities and the other works on AI safety,
my guess about who was who would be a fair bit better than random.
But I absolutely agree that epistemic disagreement is another reason, and could well be a bigger reason, why different people put different values on safety work relative to capabilities work. I say a few words about how this does / doesn’t change the basic logic of risk compensation in the section on “misperceptions”: nothing much seems to change if the parties just disagree in a proportional way about the magnitude of the risk at any given levels of C and S—though this disagreement can change who prioritizes which kind of work, it doesn’t change how the risk compensation interaction plays out. What really changes things there is if the parties disagree about the effectiveness of marginal increases to S, or really, if they disagree about the extent to which increases to S decrease the extent to which increases to C lower P.
In any event though, if what you’re saying is that a framing more applicable to the AI risk context would have made the epistemic diagreement bit central and the preference disagreement secondary (or swept under the rug entirely), fair enough! Look forward to seeing that presentation of it all if someone writes it up.
Tbc if the preferences are written in words like “expected value of the lightcone” I agree it would be relatively easy to tell which was which, mainly by identifying community shibboleths. My claim is that if you just have the input/output mapping of (safety level of AI, capabilities level of AI) --> utility, then it would be challenging. Even longtermists should be willing to accept some risk, just because AI can help with other existential risks (and of course many safety researchers—probably the majority at this point—are not longtermists).