I agree that this is an important distinction, but it seems hard to separate them in practice. In practice, we canât really know with certainty that weâre making AI safer, and without strong evidence/âfeedback, our judgements of tradeoffs may be prone to fairly arbitrary subjective judgements, motivated reasoning and selection effects. Some AI safety researchers are doing technical research on value learning/âalignment, like (cooperative) inverse reinforcement learning, and doing this research may contribute to further research on the topic down the line and eventual risky ambitious value alignment, whether or not âweâ end up concluding that itâs too risky.
Furthermore, when it matters most, I think itâs unlikely there will be a strong and justified consensus in favour of this kind of research (given wide differences in beliefs about the likelihood of worst cases and/âor differences in ethical views), and I think thereâs at least a good chance there wonât be any strong and justified consensus at all. To me, the appropriate epistemic state with regards to value learning research (or at least its publication) is one of complex cluelessness, and itâs possible this cluelessness could end up infecting AGI safety as a cause in general, depending on how large the downside risks could be (which explicit modelling with sensitivity analysis could help us check).
Also, itâs not just AI alignment research that Iâm worried about, since I see potential for tradeoffs more generally between failure modes. Preventing unipolar takeover or extinction may lead to worse outcomes (s-risk/âhyperexistential risks), but maybe (this is something to check) those worse outcomes are easier to prevent with different kinds of targeted work and weâre sufficiently invested in those. I guess the question would be whether, looking at the portfolio of things the AI safety community is working on, are we increasing any risks (in a way that isnât definitely made up for by reductions in other risks)? Each time we make a potential tradeoff with something in that portfolio, would (almost) every reasonable and informed person think itâs a good tradeoff, or if itâs ambiguous, is the downside made up for with something else?
In practice, we canât really know with certainty that weâre making AI safer, and without strong evidence/âfeedback, our judgements of tradeoffs may be prone to fairly arbitrary subjective judgements, motivated reasoning and selection effects.
This strikes me as too pessimistic. Suppose I bring a complicated new board game to a party. Two equally-skilled opposing teams each get a copy of the rulebook to study for an hour before the game starts. Team A spends the whole hour poring over the rulebook and doing scenario planning exercises. Team B immediately throws the rulebook in the trash and spends the hour watching TV.
Neither team has âstrong evidence/âfeedbackââthey havenât started playing yet. Team A could think they have good strategy ideas but in fact they are engaging in arbitrary subjective judgments and motivated reasoning. In fact, their strategy ideas, which seemed good on paper, could in fact turn out to be counterproductive!
Still, I would put my money on Team A beating Team B. Because Team A is trying. Their planning abilities donât have to be all that good to be strictly better (in expectation) than ânot doing any planning whatsoever, weâll just wing itâ. Thatâs a low bar to overcome!
So by the same token, it seems to me that vast swathes of AGI safety research easily surpasses the (low) bar of doing better in expectation than the alternative of âLetâs just not think about it in advance, weâll wing itâ.
For example, compare (1) a researcher spends some time thinking about what happens if a cosmic ray flips a bit (or a programmer makes a sign error, like in the famous GPT-2 incident), versus (2) nobody spends any time thinking about that. (1) is clearly better, right? We can always be concerned that the person wonât do a great job, or that it will be counterproductive because theyâll happen across very dangerous information and then publish it, etc. But still, the expected value here is clearly positive, right?
You also bring up the idea that (IIUC) there may be objectively good safety ideas but they might not actually get implemented because there wonât be a âstrong and justified consensusâ to do them. But again, the alternative is ânobody comes up with those objectively good safety ideas in the first placeâ. Thatâs even worse, right? (FWIW I consider âcome up with crisp and rigorous and legible arguments for true facts about AGI safetyâ to be a major goal of AGI safety research.)
Anyway, Iâm objecting to undirected general feelings of âgahhhh weâll never know if weâre helping at allâ, etc. I think thereâs just a lot of stuff in the AGI safety research field which is unambiguously good in expectation, where we donât have to feel that way. What I donât object toâand indeed what I strongly endorseâis taking a more directed approach and say âFor AGI safety research project #732, what are the downside risks of this research, and how do they compare to the upsides?â
So that brings us to âambitious value alignmentâ. I agree that an ambitiously-aligned AGI comes with a couple potential sources of s-risk that other types of AGI wouldnât have, specifically via (1) sign flip errors, and (2) threats from other AGIs. (Although I think (1) is less obviously a problem than it sounds, at least in the architectures I think about.) On the other hand, (A) Iâm not sure anyone is really working on ambitious alignment these days ⌠at least Rohin Shah & Paul Christiano have stated that narrow (task-limited) alignment is a better thing to shoot for (and last anyone heard MIRI was shooting for task-limited AGIs too) (UPDATE: actually this was an overstatement, see e.g. 1,2,3); (B) my sense is that current value-learning work (e.g. at CHAI) is more about gaining conceptual understanding then creating practical algorithms /â approaches that will scale to AGI. That said, Iâm far from an expert on the current value learning literature; frankly Iâm often confused by what such researchers are imagining for their longer-term game-plan.
BTW I put a note on my top comment that I have a COI. If you didnât notice. :)
For example, compare (1) a researcher spends some time thinking about what happens if a cosmic ray flips a bit (or a programmer makes a sign error, like in the famous GPT-2 incident), versus (2) nobody spends any time thinking about that. (1) is clearly better, right? We can always be concerned that the person wonât do a great job, or that it will be counterproductive because theyâll happen across very dangerous information and then publish it, etc. But still, the expected value here is clearly positive, right?
If you arenât publishing anything, then sure, research into what to do seems mostly harmless (other than opportunity costs) in expectation, but it doesnât actually follow that it would necessarily be good in expectation, if you have enough deep uncertainty (or complex cluelessness); I think this example illustrates this well, and is basically the kind of thing Iâm worried about all of the time now. In the particular case of sign flip errors, I do think it was useful for me to know about this consideration and similar ones, and I act differently than I would have otherwise as a result, but one of the main effects since learning about these kinds of s-risks is that Iâm (more) clueless about basically every intervention now, and am looking to portfolios and hedging.
If you are publishing, and your ethical or empirical views are sufficiently different from others working on the problem so that you make very different tradeoffs, then that could be good, bad or ambiguous. For example, if you didnât really care about s-risks, then publishing a useful considerations for those who are concerned about s-risks might take attention away from your own priorities, or it might increase cooperation, and the default position to me should be deep uncertainty/âcluelessness here, not that itâs good in expectation or bad in expectation or 0 in expectation.
Maybe you can eliminate this ambiguity or at least constrain its range to something relatively insignificant by building a model, doing a sensitivity analysis, etc., but a lot of things donât work out, and the ambiguity could be so bad that it infects everything else. This is roughly where I am now: I have considerations that result in complex cluelessness about AI-related interventions and I want to know how people work through this.
Re: cost-effectiveness analyses always turning up positive, perhaps especially in longtermism. FWIW that hasnât been my experience. Instead, my experience is that every time I investigate the case for some AI-related intervention being worth funding under longtermism, I conclude that itâs nearly as likely to be net-negative as net-positive given our great uncertainty and therefore I end up stuck doing almost entirely âmetaâ things like creating knowledge and talent pipelines.
Of course, that doesnât mean he never finds good âdirect workâ, or that the âdirect workâ already being funded isnât better than nothing in expectation overall, and I would guess he thinks it is.
Hmm, it seems to me (and you can correct me) that we should be able to agree that there are SOME technical AGI safety research publications that are positive under some plausible beliefs/âvalues and harmless under all plausible beliefs/âvalues, and then we donât have to talk about cluelessness and tradeoffs, we can just publish them.
And we both agree that there are OTHER technical AGI safety research publications that are positive under some plausible beliefs/âvalues and negative under others. And then we should talk about your portfolios etc. Or more simply, on a case-by-case basis, we can go looking for narrowly-tailored approaches to modifying the publication in order to remove the downside risks while maintaining the upside.
I feel like weâre arguing past each other: I keep saying the first category exists, and you keep saying the second category exists. We should just agree that both categories exist! :-)
Perhaps the more substantive disagreement is what fraction of the work is in which category. I see most but not all ongoing technical work as being in the first category, and I think you see almost all ongoing technical work as being in the second category. (I think you agreed that âpublishing an analysis about what happens if a cosmic ray flips a bitâ goes in the first category.)
(Luke says âAI-relatedâ but my impression is that he mostly works on AGI governance not technical, and the link is definitely about governance not technical. I would not be at all surprised if proposed governance-related projects were much more heavily weighted towards the second category, and am only saying that technical safety research is mostly first-category.)
For example, if you didnât really care about s-risks, then publishing a useful considerations for those who are concerned about s-risks might take attention away from your own priorities, or it might increase cooperation, and the default position to me should be deep uncertainty/âcluelessness here, not that itâs good in expectation or bad in expectation or 0 in expectation.
This points to another (possible?) disagreement. I think maybe you have the attitude where (to caricature somewhat) if thereâs any downside risk whatsoever, no matter how minor or far-fetched, you immediately jump to âIâm clueless!â. Whereas Iâm much more willing to say: OK, I mean, if you do anything at all thereâs a âdownside riskâ in a sense, just because life is uncertain, who knows what will happen, but thatâs not a good reason to let just sit on the sidelines and let nature take its course and hope for the best. If I have a project whose first-order effect is a clear and specific and strong upside opportunity, I donât want to throw that project out unless thereâs a comparably clear and specific and strong downside risk. (And of course we are obligated to try hard to brainstorm what such a risk might be.) Like if a firefighter is trying to put out a fire, and they aim their hose at the burning interior wall, they donât stop and think, âWell I donât know what will happen if the wall gets wet, anything could happen, so Iâll just not pour water on the fire, yâknow, donât want to mess things up.â
The âcluelessnessâ intuition gets its force from having a strong and compelling upside story weighed against a strong and compelling downside story, I think.
If the first-order effect of a project is âdirectly mitigating an important known s-riskâ, and the second-order effects of the same project are âI dunno, itâs a complicated world, anything could happenâ, then I say we should absolutely do that project.
Perhaps the more substantive disagreement is what fraction of the work is in which category. I see most but not all ongoing technical work as being in the first category, and I think you see almost all ongoing technical work as being in the second category. (I think you agreed that âpublishing an analysis about what happens if a cosmic ray flips a bitâ goes in the first category.)
Ya, I think this is the crux. Also, considerations like the cosmic ray flips a bit tend to force a lot of things into the second category when they otherwise wouldnât have been, although Iâm not specifically worried about cosmic ray bit flips, since they seems sufficiently unlikely and easy to avoid.
(Luke says âAI-relatedâ but my impression is that he mostly works on AGI governance not technical, and the link is definitely about governance not technical. I would not be at all surprised if proposed governance-related projects were much more heavily weighted towards the second category, and am only saying that technical safety research is mostly first-category.)
(Fair.)
The âcluelessnessâ intuition gets its force from having a strong and compelling upside story weighed against a strong and compelling downside story, I think.
This is actually what Iâm thinking is happening, though (not like the firefighter example), but we arenât really talking much about the specifics. There might indeed be specific cases where I agree that we shouldnât be clueless if we worked through them, but I think there are important potential tradeoffs between incidental and agential s-risks, between s-risks and other existential risks, even between the same kinds of s-risks, etc., and there is a ton of uncertainty in the expected harm from these risks, so much that itâs inappropriate to use a single distribution (without sensitivity analysis to âreasonableâ distributions, and with this sensitivity analysis, things look ambiguous), similar to this example, and weâre talking about âsweeteningâ one side or the other i, but thatâs totally swamped by our uncertainty.
If the first-order effect of a project is âdirectly mitigating an important known s-riskâ, and the second-order effects of the same project are âI dunno, itâs a complicated world, anything could happenâ, then I say we should absolutely do that project.
What I have in mind is more symmetric in upsides and downsides (or at least, Iâm interested in hearing why people think it isnât in practice), and I donât really distinguish between effects by order*. My post points out potential reasons that I actually think could dominate. The standard Iâm aiming for is âCould a reasonable person disagree?â, and I default to believing a reasonable person could disagree when I point out such tradeoffs until we actually carefully work through them in detail and it turns out itâs pretty unreasonable to disagree.
*Although thinking more about it now, I suppose longer chains are more fragile and likely to have unaccounted for effects going in the opposite direction, so maybe we ought to give them less weight, and maybe this solves the issue if we did this formally? I think ignoring higher-order effects is formally irrational using vNM rationality or stochastic dominance, although maybe fine in practice, if what weâre actually doing is just an approximation of giving them far less weight with a skeptical prior and then they actually just get dominated completely by more direct effects.
I donât really distinguish between effects by order*
I agree that direct and indirect effects of an action are fundamentally equally important (in this kind of outcome-focused context) and I hadnât intended to imply otherwise.
I agree that this is an important distinction, but it seems hard to separate them in practice. In practice, we canât really know with certainty that weâre making AI safer, and without strong evidence/âfeedback, our judgements of tradeoffs may be prone to fairly arbitrary subjective judgements, motivated reasoning and selection effects. Some AI safety researchers are doing technical research on value learning/âalignment, like (cooperative) inverse reinforcement learning, and doing this research may contribute to further research on the topic down the line and eventual risky ambitious value alignment, whether or not âweâ end up concluding that itâs too risky.
Furthermore, when it matters most, I think itâs unlikely there will be a strong and justified consensus in favour of this kind of research (given wide differences in beliefs about the likelihood of worst cases and/âor differences in ethical views), and I think thereâs at least a good chance there wonât be any strong and justified consensus at all. To me, the appropriate epistemic state with regards to value learning research (or at least its publication) is one of complex cluelessness, and itâs possible this cluelessness could end up infecting AGI safety as a cause in general, depending on how large the downside risks could be (which explicit modelling with sensitivity analysis could help us check).
Also, itâs not just AI alignment research that Iâm worried about, since I see potential for tradeoffs more generally between failure modes. Preventing unipolar takeover or extinction may lead to worse outcomes (s-risk/âhyperexistential risks), but maybe (this is something to check) those worse outcomes are easier to prevent with different kinds of targeted work and weâre sufficiently invested in those. I guess the question would be whether, looking at the portfolio of things the AI safety community is working on, are we increasing any risks (in a way that isnât definitely made up for by reductions in other risks)? Each time we make a potential tradeoff with something in that portfolio, would (almost) every reasonable and informed person think itâs a good tradeoff, or if itâs ambiguous, is the downside made up for with something else?
This strikes me as too pessimistic. Suppose I bring a complicated new board game to a party. Two equally-skilled opposing teams each get a copy of the rulebook to study for an hour before the game starts. Team A spends the whole hour poring over the rulebook and doing scenario planning exercises. Team B immediately throws the rulebook in the trash and spends the hour watching TV.
Neither team has âstrong evidence/âfeedbackââthey havenât started playing yet. Team A could think they have good strategy ideas but in fact they are engaging in arbitrary subjective judgments and motivated reasoning. In fact, their strategy ideas, which seemed good on paper, could in fact turn out to be counterproductive!
Still, I would put my money on Team A beating Team B. Because Team A is trying. Their planning abilities donât have to be all that good to be strictly better (in expectation) than ânot doing any planning whatsoever, weâll just wing itâ. Thatâs a low bar to overcome!
So by the same token, it seems to me that vast swathes of AGI safety research easily surpasses the (low) bar of doing better in expectation than the alternative of âLetâs just not think about it in advance, weâll wing itâ.
For example, compare (1) a researcher spends some time thinking about what happens if a cosmic ray flips a bit (or a programmer makes a sign error, like in the famous GPT-2 incident), versus (2) nobody spends any time thinking about that. (1) is clearly better, right? We can always be concerned that the person wonât do a great job, or that it will be counterproductive because theyâll happen across very dangerous information and then publish it, etc. But still, the expected value here is clearly positive, right?
You also bring up the idea that (IIUC) there may be objectively good safety ideas but they might not actually get implemented because there wonât be a âstrong and justified consensusâ to do them. But again, the alternative is ânobody comes up with those objectively good safety ideas in the first placeâ. Thatâs even worse, right? (FWIW I consider âcome up with crisp and rigorous and legible arguments for true facts about AGI safetyâ to be a major goal of AGI safety research.)
Anyway, Iâm objecting to undirected general feelings of âgahhhh weâll never know if weâre helping at allâ, etc. I think thereâs just a lot of stuff in the AGI safety research field which is unambiguously good in expectation, where we donât have to feel that way. What I donât object toâand indeed what I strongly endorseâis taking a more directed approach and say âFor AGI safety research project #732, what are the downside risks of this research, and how do they compare to the upsides?â
So that brings us to âambitious value alignmentâ. I agree that an ambitiously-aligned AGI comes with a couple potential sources of s-risk that other types of AGI wouldnât have, specifically via (1) sign flip errors, and (2) threats from other AGIs. (Although I think (1) is less obviously a problem than it sounds, at least in the architectures I think about.) On the other hand, (A) Iâm not sure anyone is really working on ambitious alignment these days ⌠at least Rohin Shah & Paul Christiano have stated that narrow (task-limited) alignment is a better thing to shoot for (and last anyone heard MIRI was shooting for task-limited AGIs too) (UPDATE: actually this was an overstatement, see e.g. 1,2,3); (B) my sense is that current value-learning work (e.g. at CHAI) is more about gaining conceptual understanding then creating practical algorithms /â approaches that will scale to AGI. That said, Iâm far from an expert on the current value learning literature; frankly Iâm often confused by what such researchers are imagining for their longer-term game-plan.
BTW I put a note on my top comment that I have a COI. If you didnât notice. :)
If you arenât publishing anything, then sure, research into what to do seems mostly harmless (other than opportunity costs) in expectation, but it doesnât actually follow that it would necessarily be good in expectation, if you have enough deep uncertainty (or complex cluelessness); I think this example illustrates this well, and is basically the kind of thing Iâm worried about all of the time now. In the particular case of sign flip errors, I do think it was useful for me to know about this consideration and similar ones, and I act differently than I would have otherwise as a result, but one of the main effects since learning about these kinds of s-risks is that Iâm (more) clueless about basically every intervention now, and am looking to portfolios and hedging.
If you are publishing, and your ethical or empirical views are sufficiently different from others working on the problem so that you make very different tradeoffs, then that could be good, bad or ambiguous. For example, if you didnât really care about s-risks, then publishing a useful considerations for those who are concerned about s-risks might take attention away from your own priorities, or it might increase cooperation, and the default position to me should be deep uncertainty/âcluelessness here, not that itâs good in expectation or bad in expectation or 0 in expectation.
Maybe you can eliminate this ambiguity or at least constrain its range to something relatively insignificant by building a model, doing a sensitivity analysis, etc., but a lot of things donât work out, and the ambiguity could be so bad that it infects everything else. This is roughly where I am now: I have considerations that result in complex cluelessness about AI-related interventions and I want to know how people work through this.
For another source of pessimism, Luke Muehlhauser from Open Phil wrote:
Of course, that doesnât mean he never finds good âdirect workâ, or that the âdirect workâ already being funded isnât better than nothing in expectation overall, and I would guess he thinks it is.
Hmm, it seems to me (and you can correct me) that we should be able to agree that there are SOME technical AGI safety research publications that are positive under some plausible beliefs/âvalues and harmless under all plausible beliefs/âvalues, and then we donât have to talk about cluelessness and tradeoffs, we can just publish them.
And we both agree that there are OTHER technical AGI safety research publications that are positive under some plausible beliefs/âvalues and negative under others. And then we should talk about your portfolios etc. Or more simply, on a case-by-case basis, we can go looking for narrowly-tailored approaches to modifying the publication in order to remove the downside risks while maintaining the upside.
I feel like weâre arguing past each other: I keep saying the first category exists, and you keep saying the second category exists. We should just agree that both categories exist! :-)
Perhaps the more substantive disagreement is what fraction of the work is in which category. I see most but not all ongoing technical work as being in the first category, and I think you see almost all ongoing technical work as being in the second category. (I think you agreed that âpublishing an analysis about what happens if a cosmic ray flips a bitâ goes in the first category.)
(Luke says âAI-relatedâ but my impression is that he mostly works on AGI governance not technical, and the link is definitely about governance not technical. I would not be at all surprised if proposed governance-related projects were much more heavily weighted towards the second category, and am only saying that technical safety research is mostly first-category.)
This points to another (possible?) disagreement. I think maybe you have the attitude where (to caricature somewhat) if thereâs any downside risk whatsoever, no matter how minor or far-fetched, you immediately jump to âIâm clueless!â. Whereas Iâm much more willing to say: OK, I mean, if you do anything at all thereâs a âdownside riskâ in a sense, just because life is uncertain, who knows what will happen, but thatâs not a good reason to let just sit on the sidelines and let nature take its course and hope for the best. If I have a project whose first-order effect is a clear and specific and strong upside opportunity, I donât want to throw that project out unless thereâs a comparably clear and specific and strong downside risk. (And of course we are obligated to try hard to brainstorm what such a risk might be.) Like if a firefighter is trying to put out a fire, and they aim their hose at the burning interior wall, they donât stop and think, âWell I donât know what will happen if the wall gets wet, anything could happen, so Iâll just not pour water on the fire, yâknow, donât want to mess things up.â
The âcluelessnessâ intuition gets its force from having a strong and compelling upside story weighed against a strong and compelling downside story, I think.
If the first-order effect of a project is âdirectly mitigating an important known s-riskâ, and the second-order effects of the same project are âI dunno, itâs a complicated world, anything could happenâ, then I say we should absolutely do that project.
Ya, I think this is the crux. Also, considerations like the cosmic ray flips a bit tend to force a lot of things into the second category when they otherwise wouldnât have been, although Iâm not specifically worried about cosmic ray bit flips, since they seems sufficiently unlikely and easy to avoid.
(Fair.)
This is actually what Iâm thinking is happening, though (not like the firefighter example), but we arenât really talking much about the specifics. There might indeed be specific cases where I agree that we shouldnât be clueless if we worked through them, but I think there are important potential tradeoffs between incidental and agential s-risks, between s-risks and other existential risks, even between the same kinds of s-risks, etc., and there is a ton of uncertainty in the expected harm from these risks, so much that itâs inappropriate to use a single distribution (without sensitivity analysis to âreasonableâ distributions, and with this sensitivity analysis, things look ambiguous), similar to this example, and weâre talking about âsweeteningâ one side or the other i, but thatâs totally swamped by our uncertainty.
What I have in mind is more symmetric in upsides and downsides (or at least, Iâm interested in hearing why people think it isnât in practice), and I donât really distinguish between effects by order*. My post points out potential reasons that I actually think could dominate. The standard Iâm aiming for is âCould a reasonable person disagree?â, and I default to believing a reasonable person could disagree when I point out such tradeoffs until we actually carefully work through them in detail and it turns out itâs pretty unreasonable to disagree.
*Although thinking more about it now, I suppose longer chains are more fragile and likely to have unaccounted for effects going in the opposite direction, so maybe we ought to give them less weight, and maybe this solves the issue if we did this formally? I think ignoring higher-order effects is formally irrational using vNM rationality or stochastic dominance, although maybe fine in practice, if what weâre actually doing is just an approximation of giving them far less weight with a skeptical prior and then they actually just get dominated completely by more direct effects.
I agree that direct and indirect effects of an action are fundamentally equally important (in this kind of outcome-focused context) and I hadnât intended to imply otherwise.