The core argument of Nick Bostrom’s bestselling book Superintelligence has also aged quite poorly: In brief, the book mostly assumed we will manually program a set of values into an AGI, and argued that since human values are complex, our value specification will likely be wrong, and will cause a catastrophe when optimized by a superintelligence. But most researchers now recognize that this argument is not applicable to modern ML systems which learn values, along with everything else, from vast amounts of human-generated data.
For what it’s worth, the book does discuss value learning as a way of an AI acquiring values—you can see chapter 13 as being basically about this.
I would describe the core argument of the book as the following (going off of my notes of chapter 8, “Is the default outcome doom?”):
It is possible to build AI that’s much smarter than humans.
This process could loop in on itself, leading to takeoff that could be slow or fast.
A superintelligence could gain a decisive strategic advantage and form a singleton.
Due to the orthogonality thesis, this superintelligence would not necessarily be aligned with human interests.
Due to instrumental convergence, an unaligned superintelligence would likely take over the world.
Because of the possibility of a treacherous turn, we cannot reliably check the safety of an AI on a training set.
There are things to complain about in this argument (a lot of “could”s that don’t necessarily cash out to high probabilities), but I don’t think it (or the book) assumes that we will manually program a set of values into an AGI.
Yep I am aware of the value learning section of Chapter 12, which is why I used the “mostly” qualifier. That said he basically imagines something like Stuart Russell’s CIRL, rather than anything like LLMs or imitation learning.
If we treat the Orthogonality Thesis as the crux of the book, I also think the book has aged poorly. In fact it should have been obvious when the book was written that the Thesis is basically a motte-and-bailey where you argue for a super weak claim (any combo of intelligence and goals is logically possible), which is itself dubious IMO but easy to defend, and then pretend like you’ve proven something much stronger, like “intelligence and goals will be empirically uncorrelated in the systems we actually build” or something.
I do not think the orthogonality thesis is a motte-and-bailey. The only evidence I know of that suggests that the goals developed by an ASI trained with something resembling modern methods would by default be picked from a distribution that’s remotely favorable to us is the evidence we have from evolution[1], but I really think that ought to be screened off. The goals developed by various animal species (including humans) as a result of evolution are contingent on specific details of various evolutionary pressures and environmental circumstances, which we know with confidence won’t apply to any AI trained with something resembling modern methods.
Absent a specific reason to believe that we will be sampling from an extremely tiny section of an enormously broad space, why should we believe we will hit the target?
Anticipating the argument that, since we’re doing the training, we can shape the goals of the systems—this would certainly be reason for optimism if we had any idea what goals we would see emerge while training superintelligent systems, and had any way of actively steering those goals to our preferred ends. We don’t have either, right now.
Which, mind you, is still unfavorable; I think the goals of most animal species, were they to be extrapolated outward to superhuman levels of intelligence, would not result in worlds that we would consider very good. Just not nearly as unfavorable as what I think the actual distribution we’re facing is.
Anticipating the argument that, since we’re doing the training, we can shape the goals of the systems—this would certainly be reason for optimism if we had any idea what goals we would see emerge while training superintelligent systems, and had any way of actively steering those goals to our preferred ends. We don’t have either, right now.
What does this even mean? I’m pretty skeptical of the realist attitude toward “goals” that seems to be presupposed in this statement. Goals are just somewhat useful fictions for predicting a system’s behavior in some domains. But I think it’s a leaky abstraction that will lead you astray if you take it too seriously / apply it out of the domain in which it was designed for.
We clearly can steer AI’s behavior really well in the training environment. The question is just whether this generalizes. So it becomes a question of deep learning generalization. I think our current evidence from LLMs strongly suggests they’ll generalize pretty well to unseen domains. And as I said in the essay I don’t think the whole jailbreaking thing is any evidence for pessimism— it’s exactly what you’d expect of aligned human mind uploads in the same situation.
Absent a specific reason to believe that we will be sampling from an extremely tiny section of an enormously broad space, why should we believe we will hit the target?
I could make this same argument about capabilities, and be demonstratably wrong. The space of neural network values that don’t produce coherent grammar is unimaginably, ridiculously vast compared to the “tiny target” of ones that do. But this obviously doesn’t mean that chatGPT is impossible.
The reason is that we aren’t randomly throwing a dart at possibility space, but using a highly efficient search mechanism to rapidly toss out bad designs until we hit the target. But when these machines are trained, we simultaneously select for capabilities and for alignment (murderbots are not efficient translators). For chatGPT, this leads to an “aligned” machine, at least by some definitions.
Where I think the motte and bailey often occurs is jumping between “aligned enough not to exterminate us”, and “aligned with us nearly perfectly in every way” or “unable to be misused by bad actors”. The former seems like it might happen naturally over development, whereas the latter two seem nigh impossible.
Yes, the training process is running a search where our steering is (sort of) effective for getting capabilities—though note that with e.g. LLMs we have approximately zero ability to reliably translate known inputs [X] into known capabilities [Y].
We are not doing the same thing to select for alignment, because “alignment” is:
an internal representation that depends on multiple unsolved problems in philosophy, decision theory, epistemology, math, etc, rather than “observable external behavior” (which is what we use to evaluate capabilities & steer training)
something that might be inextricably tied to the form of general intelligence which by default puts us in the “dangerous capabilities” regime, or if not strongly bound in theory, then strongly bound in practice
I do think this disagreement is substantially downstream of a disagreement about what “alignment” represents, i.e. I think that you might attempt outer alignment of GPT-4 but not inner alignment, because GPT-4 doesn’t have the internal bits which make inner alignment a relevant concern.
GPT-4 doesn’t have the internal bits which make inner alignment a relevant concern.
Is this commonly agreed upon even after fine-tuning with RLHF? I assumed it’s an open empirical question. The way I understand is is that there’s a reward signal (human feedback) that’s shaping different parts of the neural network that determines GPT-4′s ouputs, and we don’t have good enough interpretability techniques to know whether some parts of the neural network are representations of “goals”, and even less so what specific goals they are.
I would’ve thought it’s an open question whether even base models have internal representations of “goals”, either always active or only active in some specific context. For example if we buy the simulacra (predictors?) frame, a goal could be active only when a certain simulacrum is active.
I don’t know if it’s commonly agreed upon; that’s just my current belief based on available evidence (to the extent that the claim is even philosophically sound enough to be pointing at a real thing).
Or another rephrase. How is the “secretly is planning to murder all humans” improving the models scores on a benchmark? If you think about it first, what gradient from the training set even led to this capability of an inner cognitive process looking for a chance to betray. What force is causing this cognitive process to come out of the random initial weights?
Humans seem to have such a force but it’s because modeling “if I kill this rival then the reward for me is...” was evolutionarily useful. Also it’s probably a behavior that is learned.
And second, yeah, SGD should push “neutral” weights inside the network that are not contributing to correct answers towards weights that do increase the odds of a correct output distribution. So it should actively destroy “unnecessary ” cognitive processes inside the model.
You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
How is the “secretly is planning to murder all humans” improving the models scores on a benchmark?
(I personally don’t find this likely, so this might accidentally be a strawman)
For example: planning and gaining knowledge are incentivized on many benchmarks → instrumental convergence makes model instrumentally value power among other things → a very advanced system that is great at long-term planning might conclude that “murdering all humans” is useful for power or other instrumentally convergent goals
You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
I think with our current interpretability techniques we wouldn’t be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments
Please stop saying that mind-space is an “enormously broad space.” What does that even mean? How have you established a measure on mind-space that isn’t totally arbitrary?
What if concepts and values are convergent when trained on similar data, just like we see convergent evolution in biology?
Please stop saying that mind-space is an “enormously broad space.” What does that even mean? How have you established a measure on mind-space that isn’t totally arbitrary?
Why don’t you make the positive case for the space of possible (or, if you wish, likely) minds being minds which have values compatible with the fulfillment of human values? I think we have pretty strong evidence that not all minds are like this even within the space of minds produced by evolution.
What if concepts and values are convergent when trained on similar data, just like we see convergent evolution in biology?
Concepts do seem to be convergent to some degree (though note that ontological shifts at increasing levels of intelligence seem likely), but I do in fact think that evidence from evolution suggests that values are strongly contingent on the kinds of selection pressures which produced various species.
The positive case is just super obvious, it’s that we’re trying very hard to make these systems aligned, and almost all the data we’re dumping into these systems is generated by humans and is therefore dripping with human values and concepts.
I also think we have strong evidence from ML research that ANN generalization is due to symmetries in the parameter-function map which seem generic enough that they would apply mutatis mutandis to human brains, which also have a singular parameter-function map (see e.g. here).
I do in fact think that evidence from evolution suggests that values are strongly contingent on the kinds of selection pressures which produced various species.
Not really sure what you’re getting at here/why this is supposed to help your side
evidence from evolution suggests that values are strongly contingent on the kinds of selection pressures which produced various species
The fact that natural selection produced species with different goals/values/whatever isn’t evidence that that’s the only way to get those values, because “selection pressure” isn’t a mechanistic explanation. You need more info about how values are actually implemented to rule out that a proposed alternative route to natural selection succeeds in reproducing them.
The fact that natural selection produced species with different goals/values/whatever isn’t evidence that that’s the only way to get those values, because “selection pressure” isn’t a mechanistic explanation. You need more info about how values are actually implemented to rule out that a proposed alternative route to natural selection succeeds in reproducing them.
I’m not claiming that evolution is the only way to get those values, merely that there’s no reason to expect you’ll get them by default by a totally different mechanism. The fact that we don’t have a good understanding of how values form even in the biological domain is a reason for pessimism, not optimism.
The point I was trying to make is that natural selection isn’t a “mechanism” in the right sense at all. it’s a causal/historical explanation not an account of how values are implemented. What is the evidence from evolution? The fact that species with different natural histories end up with different values really doesn’t tell us much without a discussion of mechanisms. We need to know 1) how different are the mechanisms actually used to point biological and artificial cognitive systems toward ends and 2) how many possible mechanisms to do so are there.
The fact that we don’t have a good understanding of how values form even in the biological domain is a reason for pessimism, not optimism.
One reason for pessimism would be that human value learning has too many messy details. But LLMs are already better behaved than anything in the animal kingdom besides humans and are pretty good at intuitively following instructions, so there is not much evidence for this problem. If you think they are not so brainlike, then this is evidence that not-so-brainlike mechanisms work. And there are also theories that value learning in current AI works roughly similarly to value learning in the brain.
Which is just to say I don’t see the prior for pessimism, just from looking at evolution.
The orthogonality thesis is trivially a motte and bailey—you’re using it as one right here! The original claim by Bostrom was a statement against logical necessity: ‘an artificial mind need not care intrinsically about any of those things’ (emphasis mine); yet in your comment you’re equivocating with a statement that’s effectively about probability: ‘sampling from an extremely tiny section of an enormously broad space’.
You might be right in your claim, but your claim is not what the arguments in the orthogonality thesis papers purport to show.
I would also like to make a stronger counterclaim: I think a priori arguments about ‘probability space’ (dis)prove way too much. If you disregard empirical data, you can use them to disprove anything, like ‘the height of Earth fauna is contingent on specific details of various evolutionary pressures and environmental circumstances, and is sampled from a tiny section on the number line, so we should expect that alien fauna we encounter will be arbitrarily tall (or perhaps have negative height)’. If Earth-evolved intelligence tends even weakly to have e.g. sympathy towards non-kin, that is evidence that Earth-evolved intelligence is a biased sample, but also evidence that there exists some pull towards non-kin-sympathy in intelligence space.
My sense is that (as your footnote hints at), the more intelligent animals are, the more examples we seem to see of individual non-reciprocal altruism to non-kin (there are many clear examples of non-reciprocal altruism across species in cetaceans for e.g., and less numerous but still convincing examples of it in corvids).
As a side note the actual things that break this loop are
(1) we don’t use superintelligent singletons and probably won’t, I hope. We instead create context limited model instances of a larger model and tell it only about our task and the model doesn’t retain information. This “break an ASI into a billion instances each which lives only in the moment” is a powerful alignment method
(2) it seems to take an absolutely immense amount of compute hardware to host even today’s models which are significantly below human intelligence in some expensive to fix ways. (For example how many H100s would you need for useful realtime video perception?)
This means a “rogue” Singleton would have nowhere to exist, as it would be too heavy in weights and required bandwidth to run on a botnet.
This breaks everything else.
It’s telling that Bostroms PhD is in philosophy and I don’t see any industry experience on his wiki page. He is correct if you ignore real world limitations on AI.
we don’t use superintelligent singletons and probably won’t, I hope. We instead create context limited model instances of a larger model and tell it only about our task and the model doesn’t retain information.
FYI, current cutting-edge large language models are trained on a massive amount of text on the internet (in the case of GPT-4, likely approximately all the text OpenAI could get their hands on). So they certainly have tons of information about stuff other than the task at hand.
What it means is the model has no context of its history since training. It has no context if the task it has been given is “real”. It does not know if other copies of itself or other AIs are checking it’s outputs for correctness, with serious consequences if it sabotages the output. It doesn’t know it’s not still in training.
It doesn’t know if there are a billion instances of it or just 1.
We can scrub all this information fairly easily and we already do this as of right now.
We can also make trick output where we try to elicit latent deception by giving information that would tell the model its time to betray.
We can also work backwards and find what the adversarial inputs are. When will the model change it’s answer for this question?
For what it’s worth, the book does discuss value learning as a way of an AI acquiring values—you can see chapter 13 as being basically about this.
I would describe the core argument of the book as the following (going off of my notes of chapter 8, “Is the default outcome doom?”):
It is possible to build AI that’s much smarter than humans.
This process could loop in on itself, leading to takeoff that could be slow or fast.
A superintelligence could gain a decisive strategic advantage and form a singleton.
Due to the orthogonality thesis, this superintelligence would not necessarily be aligned with human interests.
Due to instrumental convergence, an unaligned superintelligence would likely take over the world.
Because of the possibility of a treacherous turn, we cannot reliably check the safety of an AI on a training set.
There are things to complain about in this argument (a lot of “could”s that don’t necessarily cash out to high probabilities), but I don’t think it (or the book) assumes that we will manually program a set of values into an AGI.
Yep I am aware of the value learning section of Chapter 12, which is why I used the “mostly” qualifier. That said he basically imagines something like Stuart Russell’s CIRL, rather than anything like LLMs or imitation learning.
If we treat the Orthogonality Thesis as the crux of the book, I also think the book has aged poorly. In fact it should have been obvious when the book was written that the Thesis is basically a motte-and-bailey where you argue for a super weak claim (any combo of intelligence and goals is logically possible), which is itself dubious IMO but easy to defend, and then pretend like you’ve proven something much stronger, like “intelligence and goals will be empirically uncorrelated in the systems we actually build” or something.
I do not think the orthogonality thesis is a motte-and-bailey. The only evidence I know of that suggests that the goals developed by an ASI trained with something resembling modern methods would by default be picked from a distribution that’s remotely favorable to us is the evidence we have from evolution[1], but I really think that ought to be screened off. The goals developed by various animal species (including humans) as a result of evolution are contingent on specific details of various evolutionary pressures and environmental circumstances, which we know with confidence won’t apply to any AI trained with something resembling modern methods.
Absent a specific reason to believe that we will be sampling from an extremely tiny section of an enormously broad space, why should we believe we will hit the target?
Anticipating the argument that, since we’re doing the training, we can shape the goals of the systems—this would certainly be reason for optimism if we had any idea what goals we would see emerge while training superintelligent systems, and had any way of actively steering those goals to our preferred ends. We don’t have either, right now.
Which, mind you, is still unfavorable; I think the goals of most animal species, were they to be extrapolated outward to superhuman levels of intelligence, would not result in worlds that we would consider very good. Just not nearly as unfavorable as what I think the actual distribution we’re facing is.
What does this even mean? I’m pretty skeptical of the realist attitude toward “goals” that seems to be presupposed in this statement. Goals are just somewhat useful fictions for predicting a system’s behavior in some domains. But I think it’s a leaky abstraction that will lead you astray if you take it too seriously / apply it out of the domain in which it was designed for.
We clearly can steer AI’s behavior really well in the training environment. The question is just whether this generalizes. So it becomes a question of deep learning generalization. I think our current evidence from LLMs strongly suggests they’ll generalize pretty well to unseen domains. And as I said in the essay I don’t think the whole jailbreaking thing is any evidence for pessimism— it’s exactly what you’d expect of aligned human mind uploads in the same situation.
I could make this same argument about capabilities, and be demonstratably wrong. The space of neural network values that don’t produce coherent grammar is unimaginably, ridiculously vast compared to the “tiny target” of ones that do. But this obviously doesn’t mean that chatGPT is impossible.
The reason is that we aren’t randomly throwing a dart at possibility space, but using a highly efficient search mechanism to rapidly toss out bad designs until we hit the target. But when these machines are trained, we simultaneously select for capabilities and for alignment (murderbots are not efficient translators). For chatGPT, this leads to an “aligned” machine, at least by some definitions.
Where I think the motte and bailey often occurs is jumping between “aligned enough not to exterminate us”, and “aligned with us nearly perfectly in every way” or “unable to be misused by bad actors”. The former seems like it might happen naturally over development, whereas the latter two seem nigh impossible.
The argument w.r.t. capabilities is disanalogous.
Yes, the training process is running a search where our steering is (sort of) effective for getting capabilities—though note that with e.g. LLMs we have approximately zero ability to reliably translate known inputs [X] into known capabilities [Y].
We are not doing the same thing to select for alignment, because “alignment” is:
an internal representation that depends on multiple unsolved problems in philosophy, decision theory, epistemology, math, etc, rather than “observable external behavior” (which is what we use to evaluate capabilities & steer training)
something that might be inextricably tied to the form of general intelligence which by default puts us in the “dangerous capabilities” regime, or if not strongly bound in theory, then strongly bound in practice
I do think this disagreement is substantially downstream of a disagreement about what “alignment” represents, i.e. I think that you might attempt outer alignment of GPT-4 but not inner alignment, because GPT-4 doesn’t have the internal bits which make inner alignment a relevant concern.
Is this commonly agreed upon even after fine-tuning with RLHF? I assumed it’s an open empirical question. The way I understand is is that there’s a reward signal (human feedback) that’s shaping different parts of the neural network that determines GPT-4′s ouputs, and we don’t have good enough interpretability techniques to know whether some parts of the neural network are representations of “goals”, and even less so what specific goals they are.
I would’ve thought it’s an open question whether even base models have internal representations of “goals”, either always active or only active in some specific context. For example if we buy the simulacra (predictors?) frame, a goal could be active only when a certain simulacrum is active.
(would love to be corrected :D)
I don’t know if it’s commonly agreed upon; that’s just my current belief based on available evidence (to the extent that the claim is even philosophically sound enough to be pointing at a real thing).
Or another rephrase. How is the “secretly is planning to murder all humans” improving the models scores on a benchmark? If you think about it first, what gradient from the training set even led to this capability of an inner cognitive process looking for a chance to betray. What force is causing this cognitive process to come out of the random initial weights?
Humans seem to have such a force but it’s because modeling “if I kill this rival then the reward for me is...” was evolutionarily useful. Also it’s probably a behavior that is learned.
And second, yeah, SGD should push “neutral” weights inside the network that are not contributing to correct answers towards weights that do increase the odds of a correct output distribution. So it should actively destroy “unnecessary ” cognitive processes inside the model.
You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
(I personally don’t find this likely, so this might accidentally be a strawman)
For example: planning and gaining knowledge are incentivized on many benchmarks → instrumental convergence makes model instrumentally value power among other things → a very advanced system that is great at long-term planning might conclude that “murdering all humans” is useful for power or other instrumentally convergent goals
I think with our current interpretability techniques we wouldn’t be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments
Please stop saying that mind-space is an “enormously broad space.” What does that even mean? How have you established a measure on mind-space that isn’t totally arbitrary?
What if concepts and values are convergent when trained on similar data, just like we see convergent evolution in biology?
Why don’t you make the positive case for the space of possible (or, if you wish, likely) minds being minds which have values compatible with the fulfillment of human values? I think we have pretty strong evidence that not all minds are like this even within the space of minds produced by evolution.
Concepts do seem to be convergent to some degree (though note that ontological shifts at increasing levels of intelligence seem likely), but I do in fact think that evidence from evolution suggests that values are strongly contingent on the kinds of selection pressures which produced various species.
The positive case is just super obvious, it’s that we’re trying very hard to make these systems aligned, and almost all the data we’re dumping into these systems is generated by humans and is therefore dripping with human values and concepts.
I also think we have strong evidence from ML research that ANN generalization is due to symmetries in the parameter-function map which seem generic enough that they would apply mutatis mutandis to human brains, which also have a singular parameter-function map (see e.g. here).
Not really sure what you’re getting at here/why this is supposed to help your side
what you mean by this? (compare “we don’t know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model”. Is this the same thing?). Is there a good writeup anywhere of why we should expect this to happen? This seems speculative and unlikely to me
The fact that natural selection produced species with different goals/values/whatever isn’t evidence that that’s the only way to get those values, because “selection pressure” isn’t a mechanistic explanation. You need more info about how values are actually implemented to rule out that a proposed alternative route to natural selection succeeds in reproducing them.
Re: ontological shifts, see this arbital page: https://arbital.com/p/ontology_identification.
I’m not claiming that evolution is the only way to get those values, merely that there’s no reason to expect you’ll get them by default by a totally different mechanism. The fact that we don’t have a good understanding of how values form even in the biological domain is a reason for pessimism, not optimism.
The point I was trying to make is that natural selection isn’t a “mechanism” in the right sense at all. it’s a causal/historical explanation not an account of how values are implemented. What is the evidence from evolution? The fact that species with different natural histories end up with different values really doesn’t tell us much without a discussion of mechanisms. We need to know 1) how different are the mechanisms actually used to point biological and artificial cognitive systems toward ends and 2) how many possible mechanisms to do so are there.
One reason for pessimism would be that human value learning has too many messy details. But LLMs are already better behaved than anything in the animal kingdom besides humans and are pretty good at intuitively following instructions, so there is not much evidence for this problem. If you think they are not so brainlike, then this is evidence that not-so-brainlike mechanisms work. And there are also theories that value learning in current AI works roughly similarly to value learning in the brain.
Which is just to say I don’t see the prior for pessimism, just from looking at evolution.
The orthogonality thesis is trivially a motte and bailey—you’re using it as one right here! The original claim by Bostrom was a statement against logical necessity: ‘an artificial mind need not care intrinsically about any of those things’ (emphasis mine); yet in your comment you’re equivocating with a statement that’s effectively about probability: ‘sampling from an extremely tiny section of an enormously broad space’.
You might be right in your claim, but your claim is not what the arguments in the orthogonality thesis papers purport to show.
I would also like to make a stronger counterclaim: I think a priori arguments about ‘probability space’ (dis)prove way too much. If you disregard empirical data, you can use them to disprove anything, like ‘the height of Earth fauna is contingent on specific details of various evolutionary pressures and environmental circumstances, and is sampled from a tiny section on the number line, so we should expect that alien fauna we encounter will be arbitrarily tall (or perhaps have negative height)’. If Earth-evolved intelligence tends even weakly to have e.g. sympathy towards non-kin, that is evidence that Earth-evolved intelligence is a biased sample, but also evidence that there exists some pull towards non-kin-sympathy in intelligence space.
My sense is that (as your footnote hints at), the more intelligent animals are, the more examples we seem to see of individual non-reciprocal altruism to non-kin (there are many clear examples of non-reciprocal altruism across species in cetaceans for e.g., and less numerous but still convincing examples of it in corvids).
As a side note the actual things that break this loop are
(1) we don’t use superintelligent singletons and probably won’t, I hope. We instead create context limited model instances of a larger model and tell it only about our task and the model doesn’t retain information. This “break an ASI into a billion instances each which lives only in the moment” is a powerful alignment method
(2) it seems to take an absolutely immense amount of compute hardware to host even today’s models which are significantly below human intelligence in some expensive to fix ways. (For example how many H100s would you need for useful realtime video perception?)
This means a “rogue” Singleton would have nowhere to exist, as it would be too heavy in weights and required bandwidth to run on a botnet.
This breaks everything else.
It’s telling that Bostroms PhD is in philosophy and I don’t see any industry experience on his wiki page. He is correct if you ignore real world limitations on AI.
FYI, current cutting-edge large language models are trained on a massive amount of text on the internet (in the case of GPT-4, likely approximately all the text OpenAI could get their hands on). So they certainly have tons of information about stuff other than the task at hand.
This is not what that statement means.
What it means is the model has no context of its history since training. It has no context if the task it has been given is “real”. It does not know if other copies of itself or other AIs are checking it’s outputs for correctness, with serious consequences if it sabotages the output. It doesn’t know it’s not still in training. It doesn’t know if there are a billion instances of it or just 1.
We can scrub all this information fairly easily and we already do this as of right now.
We can also make trick output where we try to elicit latent deception by giving information that would tell the model its time to betray.
We can also work backwards and find what the adversarial inputs are. When will the model change it’s answer for this question?