If you use a perfect optimizer and train in the real world with what you would intuitively call a “certain specification”, an existential catastrophe almost certainly happens. Given agreement on this fact, I’m just saying that I want a better argument for safety than “it’s fine because we have a less-than-perfect optimizer”
I think this is the central point of disagreement. I agree that perfect optimisers are pathological. But we are not going to train anything that is within light-years of perfect optimisation. Perfect optimisation is a totally different type of thing to what we’re doing. This argument feels to me like saying “We shouldn’t keep building bigger and bigger bombs because in the limit of size they’ll form a black hole and destroy the Earth.” It may be true that building sufficiently big bombs will destroy the earth, but the mechanism in the limit of size is not the relevant one, and is only very loosely analogous to the mechanism we’re actually worried about. (In the case of AI, to be very explicit, I’m saying that inner misalignment is the thing which might kill us, and that outer misalignment of perfect optimizers is the thing that’s only very loosely analogous to it. Outer misalignment of imperfect optimisers is somewhere in the middle).
The rest of this comment is more meta.
The reason I am particularly concerned about spreading arguments related to perfect optimisers is threefold. Firstly because it feels reminiscent of the utility-maximisation arguments made by Yudkowsky—in both cases the arguments are based on theoretical claims which are literally true but in practice irrelevant or vacuous. This is specifically what made the utility-maximisation argument so misleading, and why I don’t want another argument of this type to gain traction.
Secondly because I think that five years ago, if you’d asked a top ML researcher why they didn’t believe in the existing arguments for AI risk, they’d have said something like:
Well, the utility function thing is a trivial mathematical result. And the argument about paperclips is dumb because the way we train AIs is by giving them rewards when they do things we like, and we’re not going to give them arbitrarily high rewards for building arbitrarily many paperclips. What if we write down the wrong specification? Well, we do that in RL but in supervised learning we use human-labeled data, so if there’s any issue with written specifications we can use that approach.
I think that these arguments would have been correct rebuttals to the public arguments for AI risk which existed at that time. We may have an object-level disagreement about whether a top ML researcher would actually have said something like this, but I am now strongly inclined to give the benefit of the doubt to mainstream ML researchers when I try to understand their positions. In particular, if I were in their epistemic position, I’m not sure I would make specific arguments for why the “intends” bit will be easy either, because it’s just the default hypothesis: we train things, then if they don’t do what we want, we train them better.
Thirdly, because I am epistemically paranoid about giving arguments which aren’t actually the main reason to believe in a thing. I agree that the post I linked is super speculative, but if someone disproved the core intuitions that the post is based on that’d make a huge dent in my estimates of AI risk. Whereas I suspect that the same is not really the case for you and the argument you give (although I feel a bit weird asserting things about your beliefs, so I’m happy to concede this point if you disagree). Firstly because (even disregarding my other objections) it doesn’t establish that AI safety work needs to be done by someone, it just establishes that AI researchers have to avoid naively extrapolating their current work. Maybe they could extrapolate it in non-naive ways that doesn’t look anything like safety work. “Don’t continue on the naively extrapolated path” is often a really low bar, because naive extrapolations can be very dubious (if we naively extrapolate a baby’s growth, it’ll end up the size of the earth pretty quickly). Secondly because the argument is also true for image classifiers, since under perfect optimisation they could hack their loss functions. Insofar as we’re much less worried about them than RL agents, most of the work needed to establish the danger of the latter must be done by some other argument. Thirdly because I do think that counterfactual impact is the important bit, not “AI safety work needs to be done by someone.” I don’t think there needs to be a robust demonstration that the problem won’t be solved by default, but there do need to be some nontrivial arguments. In my scenario, one such argument is that we won’t know what effects our labels will have on the agent’s learned goals, so there’s no easy way to pay more to get more safety. Other arguments that fill this role are appeals to fast takeoff, competitive pressures, etc.
I specifically said this was not a prediction for this reason
I didn’t read this bit carefully enough, mea culpa. I’m still not sure what the value of a “default assumption” is if it’s not predictive, though.
(We in that statement was meant to refer to humanity as a whole.)
I also didn’t pick up on the we = humanity thing, sorry. Makes more sense now.
I agree that perfect optimisers are pathological. But we are not going to train anything that is within light-years of perfect optimisation. Perfect optimisation is a totally different type of thing to what we’re doing.
If you replace “perfect optimization” with “significantly-better-than-human optimization” in all of my claims, I’d continue to agree with them.
This argument feels to me like saying “We shouldn’t keep building bigger and bigger bombs because in the limit of size they’ll form a black hole and destroy the Earth.”
If somehow I knew that this fact were true, but I didn’t know at what size the bombs form a black hole and destroy us all, I would in fact see this as a valid and motivating argument for not building bigger bombs, and for trying to figure out how to build bombs that don’t destroy the Earth (or coordinate to not build them at all).
Firstly because it feels reminiscent of the utility-maximisation arguments made by Yudkowsky—in both cases the arguments are based on theoretical claims which are literally true but in practice irrelevant or vacuous.
I strongly disagree with this.
The utility-maximization argument that I disagree with is something like:
“AI is superintelligent” implies “AI is EU-maximizing” implies “AI has convergent instrumental subgoals”.
This claim is not true even theoretically. It’s not a question of what’s happening in practice.
There is a separate argument which goes
“Superintelligent AI is built by humans” implies “AI is goal-directed” implies “AI has convergent instrumental subgoals”
And I place non-trivial weight on this claim, even though it is a conceptual, fuzzy claim that we’re not sure yet will be relevant in practice, and one of the implications doesn’t apply in the case where the AI is pursuing some “meta” goal that refers to the human’s goals.
(You might disagree with this analysis as well, but I’d guess you’d be in the minority amongst AI safety researchers.)
The argument I gave is much more like the second kind—a conceptual claim that depends on fuzzy categories like “certain specifications”.
Secondly [...]
Sorry, I don’t understand your point here. It sounds like “the last time we made an argument, we were wrong, therefore we shouldn’t make more arguments”, but that can’t be what you’re saying.
Maybe your point is that ML researchers are more competent than we give them credit for, and so we should lower our probability of x-risk? If so, I mostly just want to ignore this; I’m really not making a probabilistic argument. I’m making an argument “from the perspective of humanity / the full AI community”.
I think spreading the argument “if we don’t do X, then we are in trouble because of problem Y” seems better than spreading something like “there is a p% of having problem Y, where I’ve taken into account the fact that people will try to solve Y, and that won’t be sufficient because of Z; therefore we need to put more effort into X”. The former is easier to understand and more likely to be true / correctly reasoned.
(I would also defend “the chance is not so low that EAs should ignore it”, but that’s a separate conversation, and seems not very relevant to what arguments we should spread amongst the AI community.)
Thirdly, because I am epistemically paranoid about giving arguments which aren’t actually the main reason to believe in a thing. [...] I suspect that the same is not really the case for you and the argument you give.
It totally is. I have basically two main concerns with AI alignment:
We’re aiming for the wrong thing (outer alignment)
Even if we aim for the right thing, we might generalize poorly (inner alignment)
If you told me that inner alignment was magically not a problem—we always generalize in the way that the reward function would have incentivized—I would still be worried; though it would make a significant dent in my AI risk estimate.
If you told me that outer alignment was magically not a problem (we’re actually aiming for the right thing), that would make a smaller but still significant dent in my estimate of AI risk. It’s only smaller because I expect the work to solve this problem to be done by default, whereas I feel less confident about that for inner alignment.
it doesn’t establish that AI safety work needs to be done by someone, it just establishes that AI researchers have to avoid naively extrapolating their current work.
Why is “not naively extrapolating their current work” not an example of AI safety work? Like, presumably they need to extrapolate in some as-yet-unknown way, figuring out that way sounds like a central example of AI safety work.
It seems analogous to “biologists just have to not publish infohazards, therefore there’s no need to work on the malicious use category of biorisk”.
Secondly because the argument is also true for image classifiers, since under perfect optimisation they could hack their loss functions. So insofar as we’re not worried about them, then the actual work is being done by some other argument.
I’m not worried about them because there are riskier systems that will be built first, and because there isn’t much economic value in having strongly superintelligent image classifiers. If we really tried to build strongly superintelligent image classifiers, I would be somewhat worried (though less so, since the restricted action space provides some safety).
(You might also think that image classifiers are safe because they are myopic, but in this world I’m imagining that we make non-myopic image classifiers, because they will be better at classifying images than myopic ones.)
Thirdly because I do think that counterfactual impact is the important bit, not “AI safety work needs to be done by someone.”
I do think that there is counterfactual impact in expectation. I don’t know why you think there isn’t counterfactual impact. So far it sounds to me like “we should give the benefit of the doubt to ML researchers and assume they’ll solve outer alignment”, which sounds like a claim about norms, not a claim about the world.
I think the better argument against counterfactual impact is “there will be a strong economic incentive to solve these problems” (see e.g. here), and that might reduce it by an order of magnitude, but that still leaves a lot of possible impact. But also, I think this argument applies to inner alignment as well (though less strongly).
I’m very surprised that we’re six levels deep into a disagreement and still actively confused about each other’s arguments. I thought our opinions were much more similar. This suggests that we should schedule a time to talk in person, and/or an adversarial collaboration trying to write a version of the argument that you’re thinking of. (The latter might be more efficient than this exchange, while also producing useful public records).
Thanks for the thorough + high-quality engagement, I really appreciate it.
Due to time constraints I’ll just try hit two key points in this reply (even though I don’t think your responses resolved any of the other points for me, which I’m still very surprised by).
If you replace “perfect optimization” with “significantly-better-than-human optimization” in all of my claims, I’d continue to agree with them.
We are already at significantly-better-than-human optimisation, because none of us can take an environment and output a neural network that does well in that environment, but stochastic gradient descent can. We could make SGD many many times better and it still wouldn’t produce a malicious superintelligence when trained on CIFAR, because there just isn’t any gradient pushing it in the direction of intelligence; it’ll train an agent to memorise the dataset far before that. And if the path to tampering is a few dozen steps long, the optimiser won’t find it before the heat death of the universe (because the agent has no concept of tampering to work from, all it knows is CIFAR). So when we’re talking about not-literally-perfect optimisers, you definitely need more than just amazing optimisation and hard-coded objective functions for trouble to occur—you also need lots of information about the world, maybe a bunch of interaction with it, maybe a curriculum. This is where the meat of the argument is, to me.
I think spreading the argument “if we don’t do X, then we are in trouble because of problem Y” seems better. … The former is easier to understand and more likely to be true / correctly reasoned.
I previously said:
I’m still not sure what the value of a “default assumption” is if it’s not predictive, though.
And I still have this confusion. It doesn’t matter if the argument is true and easy to understand if it’s not action-guiding for anyone. Compare the argument: “if we (=humanity) don’t remember to eat food in 2021, then everyone will die”. Almost certainly true. Very easy to understand. Totally skips the key issue, which is why we should assign high enough probability to this specific hypothetical to bother worrying about it.
So then I guess your response is something like “But everyone forgetting to eat food is a crazy scenario, whereas the naive extrapolation of the thing we’re currently doing is the default scenario”. (Also, sorry if this dialogue format is annoying, I found it an easy way to organise my thoughts, but I appreciate that it run the risk of strawmanning you).
To which I respond: there are many ways of naively extrapolating “the thing we are currently doing”. For example, the thing we’re currently doing is building AI with a 100% success record at not taking over the world. So my naive extrapolation says we’ll definitely be fine. Why should I pay any attention to your naive extrapolation?
I then picture you saying: “I’m not using these extrapolations to make probabilistic predictions, so I don’t need to argue that mine is more relevant than yours. I’m merely saying: once our optimisers get really really good, if we give them a hard-coded objective function, things will go badly. Therefore we, as humanity, should do {the set of things which will not lead to really good optimisers training on hard-coded objective functions}.”
To which I firstly say: no, I don’t buy the claim that once our optimisers get really really good, if we give them a hard-coded objective function, “an existential catastrophe almost certainly happens”. For reasons which I described above.
Secondly, even if I do accept your claim, I think I could just point out: “You’ve defined what we should do in terms of its outcomes, but in an explicitly non-probabilistic way. So if the entire ML community hears your argument, agrees with it, and then commits to doing exactly what they were already doing for the next fifty years, you have no grounds to complain, because you have not actually made any probabilistic claims about whether “exactly what they were already doing for the next fifty years” will lead to catastrophe.” So again, why is this argument worth making?
Man, this last point felt really nitpicky, but I don’t know how else to convey my intuitive feeling that there’s some sort of motte and bailey happening in your argument. Again, let’s discuss this higher-bandwidth.
This suggests that we should schedule a time to talk in person, and/or an adversarial collaboration trying to write a version of the argument that you’re thinking of.
Sounds good, I’ll just clarify my position in this response, rather than arguing against your claims.
So then I guess your response is something like “But everyone forgetting to eat food is a crazy scenario, whereas the naive extrapolation of the thing we’re currently doing is the default scenario”.
It’s more like “there isn’t any intellectual work to be done / field building to do / actors to coordinate to get everyone to eat”.
Whereas in the AI case, I don’t know how we’re going to fix the problem I outlined; and as far as I can tell nor does anyone else in the AI community, and therefore there is intellectual work to be done.
We are already at significantly-better-than-human optimisation
Sorry, by optimization there I meant something more like “intelligence”. I don’t really care whether it comes from better SGD, some hardcoded planning algorithm, or a mesa optimizer; the question is whether it is significantly more capable than humans at pursuing goals.
I thought our opinions were much more similar.
I think our predictions of how the world will go concretely are similar; but I’d guess that I’m happier with abstract arguments that depend on fuzzy intuitive concepts than you are, and find them more compelling than more concrete ones that depend on a lot of specific details.
(FWIW, when reading the above discussion I independently had almost exactly the same reaction as the following before reading it in Richard’s latest comment:
This argument feels to me like saying “We shouldn’t keep building bigger and bigger bombs because in the limit of size they’ll form a black hole and destroy the Earth.”
I think this is the central point of disagreement. I agree that perfect optimisers are pathological. But we are not going to train anything that is within light-years of perfect optimisation. Perfect optimisation is a totally different type of thing to what we’re doing. This argument feels to me like saying “We shouldn’t keep building bigger and bigger bombs because in the limit of size they’ll form a black hole and destroy the Earth.” It may be true that building sufficiently big bombs will destroy the earth, but the mechanism in the limit of size is not the relevant one, and is only very loosely analogous to the mechanism we’re actually worried about. (In the case of AI, to be very explicit, I’m saying that inner misalignment is the thing which might kill us, and that outer misalignment of perfect optimizers is the thing that’s only very loosely analogous to it. Outer misalignment of imperfect optimisers is somewhere in the middle).
The rest of this comment is more meta.
The reason I am particularly concerned about spreading arguments related to perfect optimisers is threefold. Firstly because it feels reminiscent of the utility-maximisation arguments made by Yudkowsky—in both cases the arguments are based on theoretical claims which are literally true but in practice irrelevant or vacuous. This is specifically what made the utility-maximisation argument so misleading, and why I don’t want another argument of this type to gain traction.
Secondly because I think that five years ago, if you’d asked a top ML researcher why they didn’t believe in the existing arguments for AI risk, they’d have said something like:
I think that these arguments would have been correct rebuttals to the public arguments for AI risk which existed at that time. We may have an object-level disagreement about whether a top ML researcher would actually have said something like this, but I am now strongly inclined to give the benefit of the doubt to mainstream ML researchers when I try to understand their positions. In particular, if I were in their epistemic position, I’m not sure I would make specific arguments for why the “intends” bit will be easy either, because it’s just the default hypothesis: we train things, then if they don’t do what we want, we train them better.
Thirdly, because I am epistemically paranoid about giving arguments which aren’t actually the main reason to believe in a thing. I agree that the post I linked is super speculative, but if someone disproved the core intuitions that the post is based on that’d make a huge dent in my estimates of AI risk. Whereas I suspect that the same is not really the case for you and the argument you give (although I feel a bit weird asserting things about your beliefs, so I’m happy to concede this point if you disagree). Firstly because (even disregarding my other objections) it doesn’t establish that AI safety work needs to be done by someone, it just establishes that AI researchers have to avoid naively extrapolating their current work. Maybe they could extrapolate it in non-naive ways that doesn’t look anything like safety work. “Don’t continue on the naively extrapolated path” is often a really low bar, because naive extrapolations can be very dubious (if we naively extrapolate a baby’s growth, it’ll end up the size of the earth pretty quickly). Secondly because the argument is also true for image classifiers, since under perfect optimisation they could hack their loss functions. Insofar as we’re much less worried about them than RL agents, most of the work needed to establish the danger of the latter must be done by some other argument. Thirdly because I do think that counterfactual impact is the important bit, not “AI safety work needs to be done by someone.” I don’t think there needs to be a robust demonstration that the problem won’t be solved by default, but there do need to be some nontrivial arguments. In my scenario, one such argument is that we won’t know what effects our labels will have on the agent’s learned goals, so there’s no easy way to pay more to get more safety. Other arguments that fill this role are appeals to fast takeoff, competitive pressures, etc.
I didn’t read this bit carefully enough, mea culpa. I’m still not sure what the value of a “default assumption” is if it’s not predictive, though.
I also didn’t pick up on the we = humanity thing, sorry. Makes more sense now.
If you replace “perfect optimization” with “significantly-better-than-human optimization” in all of my claims, I’d continue to agree with them.
If somehow I knew that this fact were true, but I didn’t know at what size the bombs form a black hole and destroy us all, I would in fact see this as a valid and motivating argument for not building bigger bombs, and for trying to figure out how to build bombs that don’t destroy the Earth (or coordinate to not build them at all).
I strongly disagree with this.
The utility-maximization argument that I disagree with is something like:
“AI is superintelligent” implies “AI is EU-maximizing” implies “AI has convergent instrumental subgoals”.
This claim is not true even theoretically. It’s not a question of what’s happening in practice.
There is a separate argument which goes
“Superintelligent AI is built by humans” implies “AI is goal-directed” implies “AI has convergent instrumental subgoals”
And I place non-trivial weight on this claim, even though it is a conceptual, fuzzy claim that we’re not sure yet will be relevant in practice, and one of the implications doesn’t apply in the case where the AI is pursuing some “meta” goal that refers to the human’s goals.
(You might disagree with this analysis as well, but I’d guess you’d be in the minority amongst AI safety researchers.)
The argument I gave is much more like the second kind—a conceptual claim that depends on fuzzy categories like “certain specifications”.
Sorry, I don’t understand your point here. It sounds like “the last time we made an argument, we were wrong, therefore we shouldn’t make more arguments”, but that can’t be what you’re saying.
Maybe your point is that ML researchers are more competent than we give them credit for, and so we should lower our probability of x-risk? If so, I mostly just want to ignore this; I’m really not making a probabilistic argument. I’m making an argument “from the perspective of humanity / the full AI community”.
I think spreading the argument “if we don’t do X, then we are in trouble because of problem Y” seems better than spreading something like “there is a p% of having problem Y, where I’ve taken into account the fact that people will try to solve Y, and that won’t be sufficient because of Z; therefore we need to put more effort into X”. The former is easier to understand and more likely to be true / correctly reasoned.
(I would also defend “the chance is not so low that EAs should ignore it”, but that’s a separate conversation, and seems not very relevant to what arguments we should spread amongst the AI community.)
It totally is. I have basically two main concerns with AI alignment:
We’re aiming for the wrong thing (outer alignment)
Even if we aim for the right thing, we might generalize poorly (inner alignment)
If you told me that inner alignment was magically not a problem—we always generalize in the way that the reward function would have incentivized—I would still be worried; though it would make a significant dent in my AI risk estimate.
If you told me that outer alignment was magically not a problem (we’re actually aiming for the right thing), that would make a smaller but still significant dent in my estimate of AI risk. It’s only smaller because I expect the work to solve this problem to be done by default, whereas I feel less confident about that for inner alignment.
Why is “not naively extrapolating their current work” not an example of AI safety work? Like, presumably they need to extrapolate in some as-yet-unknown way, figuring out that way sounds like a central example of AI safety work.
It seems analogous to “biologists just have to not publish infohazards, therefore there’s no need to work on the malicious use category of biorisk”.
I’m not worried about them because there are riskier systems that will be built first, and because there isn’t much economic value in having strongly superintelligent image classifiers. If we really tried to build strongly superintelligent image classifiers, I would be somewhat worried (though less so, since the restricted action space provides some safety).
(You might also think that image classifiers are safe because they are myopic, but in this world I’m imagining that we make non-myopic image classifiers, because they will be better at classifying images than myopic ones.)
I do think that there is counterfactual impact in expectation. I don’t know why you think there isn’t counterfactual impact. So far it sounds to me like “we should give the benefit of the doubt to ML researchers and assume they’ll solve outer alignment”, which sounds like a claim about norms, not a claim about the world.
I think the better argument against counterfactual impact is “there will be a strong economic incentive to solve these problems” (see e.g. here), and that might reduce it by an order of magnitude, but that still leaves a lot of possible impact. But also, I think this argument applies to inner alignment as well (though less strongly).
A few more meta points:
I’m very surprised that we’re six levels deep into a disagreement and still actively confused about each other’s arguments. I thought our opinions were much more similar. This suggests that we should schedule a time to talk in person, and/or an adversarial collaboration trying to write a version of the argument that you’re thinking of. (The latter might be more efficient than this exchange, while also producing useful public records).
Thanks for the thorough + high-quality engagement, I really appreciate it.
Due to time constraints I’ll just try hit two key points in this reply (even though I don’t think your responses resolved any of the other points for me, which I’m still very surprised by).
We are already at significantly-better-than-human optimisation, because none of us can take an environment and output a neural network that does well in that environment, but stochastic gradient descent can. We could make SGD many many times better and it still wouldn’t produce a malicious superintelligence when trained on CIFAR, because there just isn’t any gradient pushing it in the direction of intelligence; it’ll train an agent to memorise the dataset far before that. And if the path to tampering is a few dozen steps long, the optimiser won’t find it before the heat death of the universe (because the agent has no concept of tampering to work from, all it knows is CIFAR). So when we’re talking about not-literally-perfect optimisers, you definitely need more than just amazing optimisation and hard-coded objective functions for trouble to occur—you also need lots of information about the world, maybe a bunch of interaction with it, maybe a curriculum. This is where the meat of the argument is, to me.
I previously said:
And I still have this confusion. It doesn’t matter if the argument is true and easy to understand if it’s not action-guiding for anyone. Compare the argument: “if we (=humanity) don’t remember to eat food in 2021, then everyone will die”. Almost certainly true. Very easy to understand. Totally skips the key issue, which is why we should assign high enough probability to this specific hypothetical to bother worrying about it.
So then I guess your response is something like “But everyone forgetting to eat food is a crazy scenario, whereas the naive extrapolation of the thing we’re currently doing is the default scenario”. (Also, sorry if this dialogue format is annoying, I found it an easy way to organise my thoughts, but I appreciate that it run the risk of strawmanning you).
To which I respond: there are many ways of naively extrapolating “the thing we are currently doing”. For example, the thing we’re currently doing is building AI with a 100% success record at not taking over the world. So my naive extrapolation says we’ll definitely be fine. Why should I pay any attention to your naive extrapolation?
I then picture you saying: “I’m not using these extrapolations to make probabilistic predictions, so I don’t need to argue that mine is more relevant than yours. I’m merely saying: once our optimisers get really really good, if we give them a hard-coded objective function, things will go badly. Therefore we, as humanity, should do {the set of things which will not lead to really good optimisers training on hard-coded objective functions}.”
To which I firstly say: no, I don’t buy the claim that once our optimisers get really really good, if we give them a hard-coded objective function, “an existential catastrophe almost certainly happens”. For reasons which I described above.
Secondly, even if I do accept your claim, I think I could just point out: “You’ve defined what we should do in terms of its outcomes, but in an explicitly non-probabilistic way. So if the entire ML community hears your argument, agrees with it, and then commits to doing exactly what they were already doing for the next fifty years, you have no grounds to complain, because you have not actually made any probabilistic claims about whether “exactly what they were already doing for the next fifty years” will lead to catastrophe.” So again, why is this argument worth making?
Man, this last point felt really nitpicky, but I don’t know how else to convey my intuitive feeling that there’s some sort of motte and bailey happening in your argument. Again, let’s discuss this higher-bandwidth.
Just want to say that I’ve found this exchange quite interesting, and would be keen to read an adversarial collaboration between you two on this sort of thing. Seems like that would be a good addition to the set of discussions there’ve been about key cruxes related to AI safety/alignment.
(ETA: Actually, I’ve gone ahead and linked to this comment thread in that list as well, for now, as it was already quite interesting.)
Sounds good, I’ll just clarify my position in this response, rather than arguing against your claims.
It’s more like “there isn’t any intellectual work to be done / field building to do / actors to coordinate to get everyone to eat”.
Whereas in the AI case, I don’t know how we’re going to fix the problem I outlined; and as far as I can tell nor does anyone else in the AI community, and therefore there is intellectual work to be done.
Sorry, by optimization there I meant something more like “intelligence”. I don’t really care whether it comes from better SGD, some hardcoded planning algorithm, or a mesa optimizer; the question is whether it is significantly more capable than humans at pursuing goals.
I think our predictions of how the world will go concretely are similar; but I’d guess that I’m happier with abstract arguments that depend on fuzzy intuitive concepts than you are, and find them more compelling than more concrete ones that depend on a lot of specific details.
(FWIW, when reading the above discussion I independently had almost exactly the same reaction as the following before reading it in Richard’s latest comment:
)