at present they represent deep theoretical limitations of current methods
+1 on disagreeing with this. It’s not clear that there’s enough deep theory of current methods for them to have deep theoretical limitations :P
More generally, I broadly agree with Rohin, but (as I think we’ve discussed) find this argument pretty dubious:
Almost every AI system we’ve created so far (not just deep RL systems) have some predefined, hardcoded, certain specification that the AI is trying to optimize for.
A superintelligent agent pursuing a known specification has convergent instrumental subgoals (the thing that Toby is worried about).
Therefore, if we want superintelligent AI systems that don’t have these problems, we need to change how AI is done.
Convergent instrumental subgoals aren’t the problem. Large-scale misaligned goals (instrumental or not) are the problem. Whether or not a predefined specification gives rise to those sorts of goals depends on the AI architecture and training process in a complicated way. Once you describe in more detail what it actually means for an AI system to “have some specification”, the “certain” bit also stop seeming like a problem.
I’d like to refer to a better argument here, but unfortunately there is no source online that makes the case that AGI will be dangerous in a satisfactory way. I think there are enough pieces floating around in people’s heads/private notes to make a compelling argument, but the fact that they haven’t been collated publicly is a clear failure of the field.
We have discussed this, so I’ll just give brief responses so that others know what my position is. (My response to you is mostly in the last section, the others are primarily explanation for other readers.)
Convergent instrumental subgoals aren’t the problem. Large-scale misaligned goals (instrumental or not) are the problem.
I’m not entirely sure what you mean by “large-scale”, but misaligned goals simply argues for “the agent doesn’t do what you want”. To get to “the agent kills everyone”, you need to bring in convergent instrumental subgoals.
Once you describe in more detail what it actually means for an AI system to “have some specification”, the “certain” bit also stop seeming like a problem.
The model of “there is an POMDP, it has a reward function, the specification is to maximize expected reward” is fully formal and precise (once you spell out the MDP and reward), and the optimal solution usually involves convergent instrumental subgoals.
Whether or not a predefined specification gives rise to those sorts of goals depends on the AI architecture and training process in a complicated way.
I’m assuming you agree with:
1. The stated goal of AI research would very likely lead to human extinction
I agree that it is unclear whether AI systems actually get anywhere close to optimal for the tasks we train them for. However, if you think that we will get AGI and be fine, but we’ll continue to give certain specifications of what we want, it seems like you also have to believe:
2. We will build AGI without changing the stated goal of AI research
3. AI research will not achieve its stated goal
The combination of 2 + 3 seems like a strange set of beliefs to have. (Not impossible, but unlikely.)
This discussion (incl. child comments) was one of the most interesting things I read in the last weeks, maybe months. - Thank you for having it publicly. :)
1. The stated goal of AI research would very likely lead to human extinction
I disagree pretty strongly with this. What does it even mean for a whole field to have a “stated goal”? Who stated it? Russell says in his book that “From the very beginnings of AI, intelligence in machines has been defined in the same way”, but then a) doesn’t give any citations or references to the definition he uses (I can’t find the quoted definition online from before his book); and b) doesn’t establish that building “intelligent machines” is the only goal of the field of AI. In fact there are lots of AI researchers concerned with fairness, accountability, transparency, and so on—not just intelligence. Insofar as those researchers aren’t concerned about existential risk from AI, it’s because they don’t think it’ll happen, not because they think it’s somehow outside their remit.
Now in practice, a lot of AI researcher time is spent trying to make things that better optimise objective functions. But that’s because this has been the hardest part so far—specification problems have just not been a big issue in such limited domains (and insofar as they are, that’s what all the FATE researchers are working on). So this observed fact doesn’t help us distinguish between “everyone in AI thinks that making AIs which intend to do what we want is an integral part of their mission, but that the ‘intend’ bit will be easy” vs “everyone in AI is just trying to build machines that can achieve hardcoded literal objectives even if it’s very difficult to hardcode what we actually want”. And without distinguishing them, then the “stated goal of AI” has no predictive power (if it even exists).
We’ll continue to give certain specifications of what we want
What is a “certain specification”? Is training an AI to follow instructions, giving it strong negative rewards every time it misinterprets us, then telling it to do X, a “certain specification” of X? I just don’t think this concept makes sense in modern ML, because it’s the optimiser, not the AI, that is given the specification. There may be something to the general idea regardless, but it needs a lot more fleshing out, in a way that I don’t think anyone has done.
More constructively, I just put this post online. It’s far from comprehensive, but it points at what I’m concerned about more specifically than anything else.
I agree this is a fuzzy concept, in the same way that “human” is a fuzzy concept.
Is training an AI to follow instructions, giving it strong negative rewards every time it misinterprets us, then telling it to do X, a “certain specification” of X?
No, the specification there is to follow instructions. I am optimistic about these sorts of “meta” specifications; CIRL / assistance games can also be thought of as a “meta” specification to assist the human. But like, afaict this sort of idea has only recently become common in the AI community; I would guess partly because of people pointing out problems with the regular method of writing down specifications.
Broadly speaking, think of certain specifications as things that you plug in to hardcoded optimization algorithms (not learned ones which can have “common sense” and interpret you correctly).
I just don’t think this concept makes sense in modern ML, because it’s the optimiser, not the AI, that is given the specification.
If you use a perfect optimizer and train in the real world with what you would intuitively call a “certain specification”, an existential catastrophe almost certainly happens. Given agreement on this fact, I’m just saying that I want a better argument for safety than “it’s fine because we have a less-than-perfect optimizer”, which as far as I can tell is ~the argument we have right now, especially since in the future we will presumably have better optimizers (where more compute during training is a type of better optimization).
More constructively, I just put this post online. It’s far from comprehensive, but it points at what I’m concerned about more specifically than anything else.
I also find that the most plausible route by which you actually get to extinction, but it’s way more speculative (to me) than the arguments I’m using above.
So this observed fact doesn’t help us distinguish between “everyone in AI thinks that making AIs which intend to do what we want is an integral part of their mission, but that the ‘intend’ bit will be easy” vs “everyone in AI is just trying to build machines that can achieve hardcoded literal objectives even if it’s very difficult to hardcode what we actually want”.
??? I agree that you can’t literally rule the first position out, but I’ve talked to many people in AI, and the closest people get to this position is saying “well maybe the ‘intend’ bit will be easy”; I haven’t seen anyone argue for it.
I feel like you’re equivocating between what AI researchers want (obviously they don’t want extinction) and what they actually do (things that, if extrapolated naively, would lead to extinction).
I agree that they will start (and have started) working on the ‘intend’ bit once it’s important, but to my mind that means at that point they will have started working on the category of work that we call “AI safety”. This is consistent with my statement above:
Therefore, if we want superintelligent AI systems that don’t have these problems, we need to change how AI is done.
(We in that statement was meant to refer to humanity as a whole.)
And without distinguishing them, then the “stated goal of AI” has no predictive power (if it even exists).
I specifically said this was not a prediction for this reason:
This doesn’t tell you the probability with which superintelligent AI has convergent instrumental subgoals, since maybe we were always going to change how AI is done
Nonetheless, it still establishes “AI safety work needs to be done by someone”, which seems like the important bit.
Perhaps you think that to motivate work by EAs on AI safety, you need to robustly demonstrate that a) there is a problem AND b) the problem won’t be solved by default. I think this standard eliminates basically all x-risk prevention efforts, because you can always say “but if it’s so important, someone else will probably solve it” (a thing that I think is approximately true).
(I don’t think this is actually your position though, because the same critique could be applied to your new post.)
If you use a perfect optimizer and train in the real world with what you would intuitively call a “certain specification”, an existential catastrophe almost certainly happens. Given agreement on this fact, I’m just saying that I want a better argument for safety than “it’s fine because we have a less-than-perfect optimizer”
I think this is the central point of disagreement. I agree that perfect optimisers are pathological. But we are not going to train anything that is within light-years of perfect optimisation. Perfect optimisation is a totally different type of thing to what we’re doing. This argument feels to me like saying “We shouldn’t keep building bigger and bigger bombs because in the limit of size they’ll form a black hole and destroy the Earth.” It may be true that building sufficiently big bombs will destroy the earth, but the mechanism in the limit of size is not the relevant one, and is only very loosely analogous to the mechanism we’re actually worried about. (In the case of AI, to be very explicit, I’m saying that inner misalignment is the thing which might kill us, and that outer misalignment of perfect optimizers is the thing that’s only very loosely analogous to it. Outer misalignment of imperfect optimisers is somewhere in the middle).
The rest of this comment is more meta.
The reason I am particularly concerned about spreading arguments related to perfect optimisers is threefold. Firstly because it feels reminiscent of the utility-maximisation arguments made by Yudkowsky—in both cases the arguments are based on theoretical claims which are literally true but in practice irrelevant or vacuous. This is specifically what made the utility-maximisation argument so misleading, and why I don’t want another argument of this type to gain traction.
Secondly because I think that five years ago, if you’d asked a top ML researcher why they didn’t believe in the existing arguments for AI risk, they’d have said something like:
Well, the utility function thing is a trivial mathematical result. And the argument about paperclips is dumb because the way we train AIs is by giving them rewards when they do things we like, and we’re not going to give them arbitrarily high rewards for building arbitrarily many paperclips. What if we write down the wrong specification? Well, we do that in RL but in supervised learning we use human-labeled data, so if there’s any issue with written specifications we can use that approach.
I think that these arguments would have been correct rebuttals to the public arguments for AI risk which existed at that time. We may have an object-level disagreement about whether a top ML researcher would actually have said something like this, but I am now strongly inclined to give the benefit of the doubt to mainstream ML researchers when I try to understand their positions. In particular, if I were in their epistemic position, I’m not sure I would make specific arguments for why the “intends” bit will be easy either, because it’s just the default hypothesis: we train things, then if they don’t do what we want, we train them better.
Thirdly, because I am epistemically paranoid about giving arguments which aren’t actually the main reason to believe in a thing. I agree that the post I linked is super speculative, but if someone disproved the core intuitions that the post is based on that’d make a huge dent in my estimates of AI risk. Whereas I suspect that the same is not really the case for you and the argument you give (although I feel a bit weird asserting things about your beliefs, so I’m happy to concede this point if you disagree). Firstly because (even disregarding my other objections) it doesn’t establish that AI safety work needs to be done by someone, it just establishes that AI researchers have to avoid naively extrapolating their current work. Maybe they could extrapolate it in non-naive ways that doesn’t look anything like safety work. “Don’t continue on the naively extrapolated path” is often a really low bar, because naive extrapolations can be very dubious (if we naively extrapolate a baby’s growth, it’ll end up the size of the earth pretty quickly). Secondly because the argument is also true for image classifiers, since under perfect optimisation they could hack their loss functions. Insofar as we’re much less worried about them than RL agents, most of the work needed to establish the danger of the latter must be done by some other argument. Thirdly because I do think that counterfactual impact is the important bit, not “AI safety work needs to be done by someone.” I don’t think there needs to be a robust demonstration that the problem won’t be solved by default, but there do need to be some nontrivial arguments. In my scenario, one such argument is that we won’t know what effects our labels will have on the agent’s learned goals, so there’s no easy way to pay more to get more safety. Other arguments that fill this role are appeals to fast takeoff, competitive pressures, etc.
I specifically said this was not a prediction for this reason
I didn’t read this bit carefully enough, mea culpa. I’m still not sure what the value of a “default assumption” is if it’s not predictive, though.
(We in that statement was meant to refer to humanity as a whole.)
I also didn’t pick up on the we = humanity thing, sorry. Makes more sense now.
I agree that perfect optimisers are pathological. But we are not going to train anything that is within light-years of perfect optimisation. Perfect optimisation is a totally different type of thing to what we’re doing.
If you replace “perfect optimization” with “significantly-better-than-human optimization” in all of my claims, I’d continue to agree with them.
This argument feels to me like saying “We shouldn’t keep building bigger and bigger bombs because in the limit of size they’ll form a black hole and destroy the Earth.”
If somehow I knew that this fact were true, but I didn’t know at what size the bombs form a black hole and destroy us all, I would in fact see this as a valid and motivating argument for not building bigger bombs, and for trying to figure out how to build bombs that don’t destroy the Earth (or coordinate to not build them at all).
Firstly because it feels reminiscent of the utility-maximisation arguments made by Yudkowsky—in both cases the arguments are based on theoretical claims which are literally true but in practice irrelevant or vacuous.
I strongly disagree with this.
The utility-maximization argument that I disagree with is something like:
“AI is superintelligent” implies “AI is EU-maximizing” implies “AI has convergent instrumental subgoals”.
This claim is not true even theoretically. It’s not a question of what’s happening in practice.
There is a separate argument which goes
“Superintelligent AI is built by humans” implies “AI is goal-directed” implies “AI has convergent instrumental subgoals”
And I place non-trivial weight on this claim, even though it is a conceptual, fuzzy claim that we’re not sure yet will be relevant in practice, and one of the implications doesn’t apply in the case where the AI is pursuing some “meta” goal that refers to the human’s goals.
(You might disagree with this analysis as well, but I’d guess you’d be in the minority amongst AI safety researchers.)
The argument I gave is much more like the second kind—a conceptual claim that depends on fuzzy categories like “certain specifications”.
Secondly [...]
Sorry, I don’t understand your point here. It sounds like “the last time we made an argument, we were wrong, therefore we shouldn’t make more arguments”, but that can’t be what you’re saying.
Maybe your point is that ML researchers are more competent than we give them credit for, and so we should lower our probability of x-risk? If so, I mostly just want to ignore this; I’m really not making a probabilistic argument. I’m making an argument “from the perspective of humanity / the full AI community”.
I think spreading the argument “if we don’t do X, then we are in trouble because of problem Y” seems better than spreading something like “there is a p% of having problem Y, where I’ve taken into account the fact that people will try to solve Y, and that won’t be sufficient because of Z; therefore we need to put more effort into X”. The former is easier to understand and more likely to be true / correctly reasoned.
(I would also defend “the chance is not so low that EAs should ignore it”, but that’s a separate conversation, and seems not very relevant to what arguments we should spread amongst the AI community.)
Thirdly, because I am epistemically paranoid about giving arguments which aren’t actually the main reason to believe in a thing. [...] I suspect that the same is not really the case for you and the argument you give.
It totally is. I have basically two main concerns with AI alignment:
We’re aiming for the wrong thing (outer alignment)
Even if we aim for the right thing, we might generalize poorly (inner alignment)
If you told me that inner alignment was magically not a problem—we always generalize in the way that the reward function would have incentivized—I would still be worried; though it would make a significant dent in my AI risk estimate.
If you told me that outer alignment was magically not a problem (we’re actually aiming for the right thing), that would make a smaller but still significant dent in my estimate of AI risk. It’s only smaller because I expect the work to solve this problem to be done by default, whereas I feel less confident about that for inner alignment.
it doesn’t establish that AI safety work needs to be done by someone, it just establishes that AI researchers have to avoid naively extrapolating their current work.
Why is “not naively extrapolating their current work” not an example of AI safety work? Like, presumably they need to extrapolate in some as-yet-unknown way, figuring out that way sounds like a central example of AI safety work.
It seems analogous to “biologists just have to not publish infohazards, therefore there’s no need to work on the malicious use category of biorisk”.
Secondly because the argument is also true for image classifiers, since under perfect optimisation they could hack their loss functions. So insofar as we’re not worried about them, then the actual work is being done by some other argument.
I’m not worried about them because there are riskier systems that will be built first, and because there isn’t much economic value in having strongly superintelligent image classifiers. If we really tried to build strongly superintelligent image classifiers, I would be somewhat worried (though less so, since the restricted action space provides some safety).
(You might also think that image classifiers are safe because they are myopic, but in this world I’m imagining that we make non-myopic image classifiers, because they will be better at classifying images than myopic ones.)
Thirdly because I do think that counterfactual impact is the important bit, not “AI safety work needs to be done by someone.”
I do think that there is counterfactual impact in expectation. I don’t know why you think there isn’t counterfactual impact. So far it sounds to me like “we should give the benefit of the doubt to ML researchers and assume they’ll solve outer alignment”, which sounds like a claim about norms, not a claim about the world.
I think the better argument against counterfactual impact is “there will be a strong economic incentive to solve these problems” (see e.g. here), and that might reduce it by an order of magnitude, but that still leaves a lot of possible impact. But also, I think this argument applies to inner alignment as well (though less strongly).
I’m very surprised that we’re six levels deep into a disagreement and still actively confused about each other’s arguments. I thought our opinions were much more similar. This suggests that we should schedule a time to talk in person, and/or an adversarial collaboration trying to write a version of the argument that you’re thinking of. (The latter might be more efficient than this exchange, while also producing useful public records).
Thanks for the thorough + high-quality engagement, I really appreciate it.
Due to time constraints I’ll just try hit two key points in this reply (even though I don’t think your responses resolved any of the other points for me, which I’m still very surprised by).
If you replace “perfect optimization” with “significantly-better-than-human optimization” in all of my claims, I’d continue to agree with them.
We are already at significantly-better-than-human optimisation, because none of us can take an environment and output a neural network that does well in that environment, but stochastic gradient descent can. We could make SGD many many times better and it still wouldn’t produce a malicious superintelligence when trained on CIFAR, because there just isn’t any gradient pushing it in the direction of intelligence; it’ll train an agent to memorise the dataset far before that. And if the path to tampering is a few dozen steps long, the optimiser won’t find it before the heat death of the universe (because the agent has no concept of tampering to work from, all it knows is CIFAR). So when we’re talking about not-literally-perfect optimisers, you definitely need more than just amazing optimisation and hard-coded objective functions for trouble to occur—you also need lots of information about the world, maybe a bunch of interaction with it, maybe a curriculum. This is where the meat of the argument is, to me.
I think spreading the argument “if we don’t do X, then we are in trouble because of problem Y” seems better. … The former is easier to understand and more likely to be true / correctly reasoned.
I previously said:
I’m still not sure what the value of a “default assumption” is if it’s not predictive, though.
And I still have this confusion. It doesn’t matter if the argument is true and easy to understand if it’s not action-guiding for anyone. Compare the argument: “if we (=humanity) don’t remember to eat food in 2021, then everyone will die”. Almost certainly true. Very easy to understand. Totally skips the key issue, which is why we should assign high enough probability to this specific hypothetical to bother worrying about it.
So then I guess your response is something like “But everyone forgetting to eat food is a crazy scenario, whereas the naive extrapolation of the thing we’re currently doing is the default scenario”. (Also, sorry if this dialogue format is annoying, I found it an easy way to organise my thoughts, but I appreciate that it run the risk of strawmanning you).
To which I respond: there are many ways of naively extrapolating “the thing we are currently doing”. For example, the thing we’re currently doing is building AI with a 100% success record at not taking over the world. So my naive extrapolation says we’ll definitely be fine. Why should I pay any attention to your naive extrapolation?
I then picture you saying: “I’m not using these extrapolations to make probabilistic predictions, so I don’t need to argue that mine is more relevant than yours. I’m merely saying: once our optimisers get really really good, if we give them a hard-coded objective function, things will go badly. Therefore we, as humanity, should do {the set of things which will not lead to really good optimisers training on hard-coded objective functions}.”
To which I firstly say: no, I don’t buy the claim that once our optimisers get really really good, if we give them a hard-coded objective function, “an existential catastrophe almost certainly happens”. For reasons which I described above.
Secondly, even if I do accept your claim, I think I could just point out: “You’ve defined what we should do in terms of its outcomes, but in an explicitly non-probabilistic way. So if the entire ML community hears your argument, agrees with it, and then commits to doing exactly what they were already doing for the next fifty years, you have no grounds to complain, because you have not actually made any probabilistic claims about whether “exactly what they were already doing for the next fifty years” will lead to catastrophe.” So again, why is this argument worth making?
Man, this last point felt really nitpicky, but I don’t know how else to convey my intuitive feeling that there’s some sort of motte and bailey happening in your argument. Again, let’s discuss this higher-bandwidth.
This suggests that we should schedule a time to talk in person, and/or an adversarial collaboration trying to write a version of the argument that you’re thinking of.
Sounds good, I’ll just clarify my position in this response, rather than arguing against your claims.
So then I guess your response is something like “But everyone forgetting to eat food is a crazy scenario, whereas the naive extrapolation of the thing we’re currently doing is the default scenario”.
It’s more like “there isn’t any intellectual work to be done / field building to do / actors to coordinate to get everyone to eat”.
Whereas in the AI case, I don’t know how we’re going to fix the problem I outlined; and as far as I can tell nor does anyone else in the AI community, and therefore there is intellectual work to be done.
We are already at significantly-better-than-human optimisation
Sorry, by optimization there I meant something more like “intelligence”. I don’t really care whether it comes from better SGD, some hardcoded planning algorithm, or a mesa optimizer; the question is whether it is significantly more capable than humans at pursuing goals.
I thought our opinions were much more similar.
I think our predictions of how the world will go concretely are similar; but I’d guess that I’m happier with abstract arguments that depend on fuzzy intuitive concepts than you are, and find them more compelling than more concrete ones that depend on a lot of specific details.
(FWIW, when reading the above discussion I independently had almost exactly the same reaction as the following before reading it in Richard’s latest comment:
This argument feels to me like saying “We shouldn’t keep building bigger and bigger bombs because in the limit of size they’ll form a black hole and destroy the Earth.”
+1 on disagreeing with this. It’s not clear that there’s enough deep theory of current methods for them to have deep theoretical limitations :P
More generally, I broadly agree with Rohin, but (as I think we’ve discussed) find this argument pretty dubious:
Convergent instrumental subgoals aren’t the problem. Large-scale misaligned goals (instrumental or not) are the problem. Whether or not a predefined specification gives rise to those sorts of goals depends on the AI architecture and training process in a complicated way. Once you describe in more detail what it actually means for an AI system to “have some specification”, the “certain” bit also stop seeming like a problem.
I’d like to refer to a better argument here, but unfortunately there is no source online that makes the case that AGI will be dangerous in a satisfactory way. I think there are enough pieces floating around in people’s heads/private notes to make a compelling argument, but the fact that they haven’t been collated publicly is a clear failure of the field.
We have discussed this, so I’ll just give brief responses so that others know what my position is. (My response to you is mostly in the last section, the others are primarily explanation for other readers.)
I’m not entirely sure what you mean by “large-scale”, but misaligned goals simply argues for “the agent doesn’t do what you want”. To get to “the agent kills everyone”, you need to bring in convergent instrumental subgoals.
The model of “there is an POMDP, it has a reward function, the specification is to maximize expected reward” is fully formal and precise (once you spell out the MDP and reward), and the optimal solution usually involves convergent instrumental subgoals.
I’m assuming you agree with:
1. The stated goal of AI research would very likely lead to human extinction
I agree that it is unclear whether AI systems actually get anywhere close to optimal for the tasks we train them for. However, if you think that we will get AGI and be fine, but we’ll continue to give certain specifications of what we want, it seems like you also have to believe:
2. We will build AGI without changing the stated goal of AI research
3. AI research will not achieve its stated goal
The combination of 2 + 3 seems like a strange set of beliefs to have. (Not impossible, but unlikely.)
This discussion (incl. child comments) was one of the most interesting things I read in the last weeks, maybe months. - Thank you for having it publicly. :)
I disagree pretty strongly with this. What does it even mean for a whole field to have a “stated goal”? Who stated it? Russell says in his book that “From the very beginnings of AI, intelligence in machines has been defined in the same way”, but then a) doesn’t give any citations or references to the definition he uses (I can’t find the quoted definition online from before his book); and b) doesn’t establish that building “intelligent machines” is the only goal of the field of AI. In fact there are lots of AI researchers concerned with fairness, accountability, transparency, and so on—not just intelligence. Insofar as those researchers aren’t concerned about existential risk from AI, it’s because they don’t think it’ll happen, not because they think it’s somehow outside their remit.
Now in practice, a lot of AI researcher time is spent trying to make things that better optimise objective functions. But that’s because this has been the hardest part so far—specification problems have just not been a big issue in such limited domains (and insofar as they are, that’s what all the FATE researchers are working on). So this observed fact doesn’t help us distinguish between “everyone in AI thinks that making AIs which intend to do what we want is an integral part of their mission, but that the ‘intend’ bit will be easy” vs “everyone in AI is just trying to build machines that can achieve hardcoded literal objectives even if it’s very difficult to hardcode what we actually want”. And without distinguishing them, then the “stated goal of AI” has no predictive power (if it even exists).
What is a “certain specification”? Is training an AI to follow instructions, giving it strong negative rewards every time it misinterprets us, then telling it to do X, a “certain specification” of X? I just don’t think this concept makes sense in modern ML, because it’s the optimiser, not the AI, that is given the specification. There may be something to the general idea regardless, but it needs a lot more fleshing out, in a way that I don’t think anyone has done.
More constructively, I just put this post online. It’s far from comprehensive, but it points at what I’m concerned about more specifically than anything else.
I agree this is a fuzzy concept, in the same way that “human” is a fuzzy concept.
No, the specification there is to follow instructions. I am optimistic about these sorts of “meta” specifications; CIRL / assistance games can also be thought of as a “meta” specification to assist the human. But like, afaict this sort of idea has only recently become common in the AI community; I would guess partly because of people pointing out problems with the regular method of writing down specifications.
Broadly speaking, think of certain specifications as things that you plug in to hardcoded optimization algorithms (not learned ones which can have “common sense” and interpret you correctly).
If you use a perfect optimizer and train in the real world with what you would intuitively call a “certain specification”, an existential catastrophe almost certainly happens. Given agreement on this fact, I’m just saying that I want a better argument for safety than “it’s fine because we have a less-than-perfect optimizer”, which as far as I can tell is ~the argument we have right now, especially since in the future we will presumably have better optimizers (where more compute during training is a type of better optimization).
I also find that the most plausible route by which you actually get to extinction, but it’s way more speculative (to me) than the arguments I’m using above.
??? I agree that you can’t literally rule the first position out, but I’ve talked to many people in AI, and the closest people get to this position is saying “well maybe the ‘intend’ bit will be easy”; I haven’t seen anyone argue for it.
I feel like you’re equivocating between what AI researchers want (obviously they don’t want extinction) and what they actually do (things that, if extrapolated naively, would lead to extinction).
I agree that they will start (and have started) working on the ‘intend’ bit once it’s important, but to my mind that means at that point they will have started working on the category of work that we call “AI safety”. This is consistent with my statement above:
(We in that statement was meant to refer to humanity as a whole.)
I specifically said this was not a prediction for this reason:
Nonetheless, it still establishes “AI safety work needs to be done by someone”, which seems like the important bit.
Perhaps you think that to motivate work by EAs on AI safety, you need to robustly demonstrate that a) there is a problem AND b) the problem won’t be solved by default. I think this standard eliminates basically all x-risk prevention efforts, because you can always say “but if it’s so important, someone else will probably solve it” (a thing that I think is approximately true).
(I don’t think this is actually your position though, because the same critique could be applied to your new post.)
I think this is the central point of disagreement. I agree that perfect optimisers are pathological. But we are not going to train anything that is within light-years of perfect optimisation. Perfect optimisation is a totally different type of thing to what we’re doing. This argument feels to me like saying “We shouldn’t keep building bigger and bigger bombs because in the limit of size they’ll form a black hole and destroy the Earth.” It may be true that building sufficiently big bombs will destroy the earth, but the mechanism in the limit of size is not the relevant one, and is only very loosely analogous to the mechanism we’re actually worried about. (In the case of AI, to be very explicit, I’m saying that inner misalignment is the thing which might kill us, and that outer misalignment of perfect optimizers is the thing that’s only very loosely analogous to it. Outer misalignment of imperfect optimisers is somewhere in the middle).
The rest of this comment is more meta.
The reason I am particularly concerned about spreading arguments related to perfect optimisers is threefold. Firstly because it feels reminiscent of the utility-maximisation arguments made by Yudkowsky—in both cases the arguments are based on theoretical claims which are literally true but in practice irrelevant or vacuous. This is specifically what made the utility-maximisation argument so misleading, and why I don’t want another argument of this type to gain traction.
Secondly because I think that five years ago, if you’d asked a top ML researcher why they didn’t believe in the existing arguments for AI risk, they’d have said something like:
I think that these arguments would have been correct rebuttals to the public arguments for AI risk which existed at that time. We may have an object-level disagreement about whether a top ML researcher would actually have said something like this, but I am now strongly inclined to give the benefit of the doubt to mainstream ML researchers when I try to understand their positions. In particular, if I were in their epistemic position, I’m not sure I would make specific arguments for why the “intends” bit will be easy either, because it’s just the default hypothesis: we train things, then if they don’t do what we want, we train them better.
Thirdly, because I am epistemically paranoid about giving arguments which aren’t actually the main reason to believe in a thing. I agree that the post I linked is super speculative, but if someone disproved the core intuitions that the post is based on that’d make a huge dent in my estimates of AI risk. Whereas I suspect that the same is not really the case for you and the argument you give (although I feel a bit weird asserting things about your beliefs, so I’m happy to concede this point if you disagree). Firstly because (even disregarding my other objections) it doesn’t establish that AI safety work needs to be done by someone, it just establishes that AI researchers have to avoid naively extrapolating their current work. Maybe they could extrapolate it in non-naive ways that doesn’t look anything like safety work. “Don’t continue on the naively extrapolated path” is often a really low bar, because naive extrapolations can be very dubious (if we naively extrapolate a baby’s growth, it’ll end up the size of the earth pretty quickly). Secondly because the argument is also true for image classifiers, since under perfect optimisation they could hack their loss functions. Insofar as we’re much less worried about them than RL agents, most of the work needed to establish the danger of the latter must be done by some other argument. Thirdly because I do think that counterfactual impact is the important bit, not “AI safety work needs to be done by someone.” I don’t think there needs to be a robust demonstration that the problem won’t be solved by default, but there do need to be some nontrivial arguments. In my scenario, one such argument is that we won’t know what effects our labels will have on the agent’s learned goals, so there’s no easy way to pay more to get more safety. Other arguments that fill this role are appeals to fast takeoff, competitive pressures, etc.
I didn’t read this bit carefully enough, mea culpa. I’m still not sure what the value of a “default assumption” is if it’s not predictive, though.
I also didn’t pick up on the we = humanity thing, sorry. Makes more sense now.
If you replace “perfect optimization” with “significantly-better-than-human optimization” in all of my claims, I’d continue to agree with them.
If somehow I knew that this fact were true, but I didn’t know at what size the bombs form a black hole and destroy us all, I would in fact see this as a valid and motivating argument for not building bigger bombs, and for trying to figure out how to build bombs that don’t destroy the Earth (or coordinate to not build them at all).
I strongly disagree with this.
The utility-maximization argument that I disagree with is something like:
“AI is superintelligent” implies “AI is EU-maximizing” implies “AI has convergent instrumental subgoals”.
This claim is not true even theoretically. It’s not a question of what’s happening in practice.
There is a separate argument which goes
“Superintelligent AI is built by humans” implies “AI is goal-directed” implies “AI has convergent instrumental subgoals”
And I place non-trivial weight on this claim, even though it is a conceptual, fuzzy claim that we’re not sure yet will be relevant in practice, and one of the implications doesn’t apply in the case where the AI is pursuing some “meta” goal that refers to the human’s goals.
(You might disagree with this analysis as well, but I’d guess you’d be in the minority amongst AI safety researchers.)
The argument I gave is much more like the second kind—a conceptual claim that depends on fuzzy categories like “certain specifications”.
Sorry, I don’t understand your point here. It sounds like “the last time we made an argument, we were wrong, therefore we shouldn’t make more arguments”, but that can’t be what you’re saying.
Maybe your point is that ML researchers are more competent than we give them credit for, and so we should lower our probability of x-risk? If so, I mostly just want to ignore this; I’m really not making a probabilistic argument. I’m making an argument “from the perspective of humanity / the full AI community”.
I think spreading the argument “if we don’t do X, then we are in trouble because of problem Y” seems better than spreading something like “there is a p% of having problem Y, where I’ve taken into account the fact that people will try to solve Y, and that won’t be sufficient because of Z; therefore we need to put more effort into X”. The former is easier to understand and more likely to be true / correctly reasoned.
(I would also defend “the chance is not so low that EAs should ignore it”, but that’s a separate conversation, and seems not very relevant to what arguments we should spread amongst the AI community.)
It totally is. I have basically two main concerns with AI alignment:
We’re aiming for the wrong thing (outer alignment)
Even if we aim for the right thing, we might generalize poorly (inner alignment)
If you told me that inner alignment was magically not a problem—we always generalize in the way that the reward function would have incentivized—I would still be worried; though it would make a significant dent in my AI risk estimate.
If you told me that outer alignment was magically not a problem (we’re actually aiming for the right thing), that would make a smaller but still significant dent in my estimate of AI risk. It’s only smaller because I expect the work to solve this problem to be done by default, whereas I feel less confident about that for inner alignment.
Why is “not naively extrapolating their current work” not an example of AI safety work? Like, presumably they need to extrapolate in some as-yet-unknown way, figuring out that way sounds like a central example of AI safety work.
It seems analogous to “biologists just have to not publish infohazards, therefore there’s no need to work on the malicious use category of biorisk”.
I’m not worried about them because there are riskier systems that will be built first, and because there isn’t much economic value in having strongly superintelligent image classifiers. If we really tried to build strongly superintelligent image classifiers, I would be somewhat worried (though less so, since the restricted action space provides some safety).
(You might also think that image classifiers are safe because they are myopic, but in this world I’m imagining that we make non-myopic image classifiers, because they will be better at classifying images than myopic ones.)
I do think that there is counterfactual impact in expectation. I don’t know why you think there isn’t counterfactual impact. So far it sounds to me like “we should give the benefit of the doubt to ML researchers and assume they’ll solve outer alignment”, which sounds like a claim about norms, not a claim about the world.
I think the better argument against counterfactual impact is “there will be a strong economic incentive to solve these problems” (see e.g. here), and that might reduce it by an order of magnitude, but that still leaves a lot of possible impact. But also, I think this argument applies to inner alignment as well (though less strongly).
A few more meta points:
I’m very surprised that we’re six levels deep into a disagreement and still actively confused about each other’s arguments. I thought our opinions were much more similar. This suggests that we should schedule a time to talk in person, and/or an adversarial collaboration trying to write a version of the argument that you’re thinking of. (The latter might be more efficient than this exchange, while also producing useful public records).
Thanks for the thorough + high-quality engagement, I really appreciate it.
Due to time constraints I’ll just try hit two key points in this reply (even though I don’t think your responses resolved any of the other points for me, which I’m still very surprised by).
We are already at significantly-better-than-human optimisation, because none of us can take an environment and output a neural network that does well in that environment, but stochastic gradient descent can. We could make SGD many many times better and it still wouldn’t produce a malicious superintelligence when trained on CIFAR, because there just isn’t any gradient pushing it in the direction of intelligence; it’ll train an agent to memorise the dataset far before that. And if the path to tampering is a few dozen steps long, the optimiser won’t find it before the heat death of the universe (because the agent has no concept of tampering to work from, all it knows is CIFAR). So when we’re talking about not-literally-perfect optimisers, you definitely need more than just amazing optimisation and hard-coded objective functions for trouble to occur—you also need lots of information about the world, maybe a bunch of interaction with it, maybe a curriculum. This is where the meat of the argument is, to me.
I previously said:
And I still have this confusion. It doesn’t matter if the argument is true and easy to understand if it’s not action-guiding for anyone. Compare the argument: “if we (=humanity) don’t remember to eat food in 2021, then everyone will die”. Almost certainly true. Very easy to understand. Totally skips the key issue, which is why we should assign high enough probability to this specific hypothetical to bother worrying about it.
So then I guess your response is something like “But everyone forgetting to eat food is a crazy scenario, whereas the naive extrapolation of the thing we’re currently doing is the default scenario”. (Also, sorry if this dialogue format is annoying, I found it an easy way to organise my thoughts, but I appreciate that it run the risk of strawmanning you).
To which I respond: there are many ways of naively extrapolating “the thing we are currently doing”. For example, the thing we’re currently doing is building AI with a 100% success record at not taking over the world. So my naive extrapolation says we’ll definitely be fine. Why should I pay any attention to your naive extrapolation?
I then picture you saying: “I’m not using these extrapolations to make probabilistic predictions, so I don’t need to argue that mine is more relevant than yours. I’m merely saying: once our optimisers get really really good, if we give them a hard-coded objective function, things will go badly. Therefore we, as humanity, should do {the set of things which will not lead to really good optimisers training on hard-coded objective functions}.”
To which I firstly say: no, I don’t buy the claim that once our optimisers get really really good, if we give them a hard-coded objective function, “an existential catastrophe almost certainly happens”. For reasons which I described above.
Secondly, even if I do accept your claim, I think I could just point out: “You’ve defined what we should do in terms of its outcomes, but in an explicitly non-probabilistic way. So if the entire ML community hears your argument, agrees with it, and then commits to doing exactly what they were already doing for the next fifty years, you have no grounds to complain, because you have not actually made any probabilistic claims about whether “exactly what they were already doing for the next fifty years” will lead to catastrophe.” So again, why is this argument worth making?
Man, this last point felt really nitpicky, but I don’t know how else to convey my intuitive feeling that there’s some sort of motte and bailey happening in your argument. Again, let’s discuss this higher-bandwidth.
Just want to say that I’ve found this exchange quite interesting, and would be keen to read an adversarial collaboration between you two on this sort of thing. Seems like that would be a good addition to the set of discussions there’ve been about key cruxes related to AI safety/alignment.
(ETA: Actually, I’ve gone ahead and linked to this comment thread in that list as well, for now, as it was already quite interesting.)
Sounds good, I’ll just clarify my position in this response, rather than arguing against your claims.
It’s more like “there isn’t any intellectual work to be done / field building to do / actors to coordinate to get everyone to eat”.
Whereas in the AI case, I don’t know how we’re going to fix the problem I outlined; and as far as I can tell nor does anyone else in the AI community, and therefore there is intellectual work to be done.
Sorry, by optimization there I meant something more like “intelligence”. I don’t really care whether it comes from better SGD, some hardcoded planning algorithm, or a mesa optimizer; the question is whether it is significantly more capable than humans at pursuing goals.
I think our predictions of how the world will go concretely are similar; but I’d guess that I’m happier with abstract arguments that depend on fuzzy intuitive concepts than you are, and find them more compelling than more concrete ones that depend on a lot of specific details.
(FWIW, when reading the above discussion I independently had almost exactly the same reaction as the following before reading it in Richard’s latest comment:
)