A working attempt to sketch a simple three-premise argument for the claim: ‘TAI will result in human extinction’, and offer objections. Made mostly for my own benefit while working on another project, but I thought it might be useful to post here.
The structure of my preferred argument is similar to an earlier framing suggested by Katja Grace.
Goal-directed superhuman AI systems will be built (let’s say conditioned on TAI).
If goal-directed superhuman AI systems are built, their values will result in human extinction if realized.
If goal-directed superhuman AI systems are built, they’ll be able to realize their values — even if their values would result in human extinction if realized.
Thus: Humanity will go extinct.
I’ll offer some rough probabilities, but the probabilities I’m offering shouldn’t be taken seriously. I don’t think probabilities are the best way to adjudicate disputes of this kind, but I thought offering a more quantitative sense of my uncertainty (based on my immediate impressions) might be helpful in this case. For the (respective) premises, I might go for 98%, 7%, 83%, resulting in a ~6% chance of human extinction given TAI.
Some more specific objections:
Obviously Premise 2 is doing a lot of the work here. I think that one of the main arguments for believing in Premise 2 is a view like Rob’s, which holds that current ML is on track to produce systems which are, “in the ways that matter”, more like ‘randomly sample (simplicity-weighted) plans’ than anything recognizably human. If future systems are sampling from simplicity-weighted plans to achieve arbitrary goals, then Premise 2 does start to look very plausible.
This basically just seems like an extremely strong claim about the inductive biases of ML systems, and my (likely unsatisfying) response boils down to: (1) I don’t see any strong argument for believing it, and (2) I see some arguments for the alternative conclusion.
I find myself really confused when trying to think about this debate. In a discussion of Rob’s post, Daniel Kokotajlo says: “IMO the burden of proof is firmly on the side of whoever wants to say that therefore things will probably be fine.”
I think I just don’t get the intuition behind his argument (tagging @kokotajlod in case he wants to correct any misunderstandings). I don’t really like ‘burden of proof’ talk, but my instinct is to say “look, LLMs distill human cognition, much of this cognition implicitly contains plans, human-like value judgements, etc.” I start from a place where I currently believe “future systems have human-like inductive biases” will be a better predictive abstraction than “randomly sample from the space of simplicity-weighted plans”. And … I just don’t currently see the argument for rejecting my current view?
Perhaps there are near-term predictions which would help weigh on the dispute between the two hypotheses? I currently interpret the disagreement here as a disagreement about the relevant outcome space over which we should be uncertain, which feels hard to adjudicate. But, right now, I struggle to see the argument for the more doomy outcome space.
More on Premise 2: Paul Christiano offers various considerations which count against doom which appear to go through without having “solved alignment”. These considerations feel less forceful to me than the points in the bullet point above, but they still serve to make Premise 2 seem less likely.
“Given how small the [resource costs of keeping humans around are], even tiny preferences one way or the other will dominate incidental effects from grabbing more resources”.
“There are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you’ve interacted with, and which was prima facie plausibly an important actor in the world”
“Most humans and human societies would be willing to spend much more than 1 trillionth of their resources (= $100/year for all of humanity) for a ton of random different goals”
Paul also mentions “decision-theoretic arguments for cooperation”, including a passing reference to ECL.
I also think the story by Katja Grace below is plausible, in which superhuman AI systems are “goal-directed”, but don’t lead to human extinction.
AI systems proliferate, and have various goals. Some AI systems try to make money in the stock market. Some make movies. Some try to direct traffic optimally. Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable. These systems have no perceptible desire to optimize the universe for forwarding these goals because they aren’t maximizing a general utility function, they are more ‘behaving like someone who is trying to make Walmart profitable’. They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow. The world looks kind of like the current world, in that it is fairly non-obvious what any entity’s ‘utility function’ is. It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.
Perhaps the story above is unlikely because the AI systems in Grace’s story would (in the absence of strong preventative efforts) be dangerous maximizers. I think that this is most plausible on something like Eliezer’s model of agency, and if my views change my best bet is that I’ll have updated towards his view.
I believe: as you develop gradually more capable agentic systems, there are dynamic pressures towards a certain kind of coherency. I don’t think that claim alone establishes the existence of dynamic pressures towards ‘dangerous maximizing cognition’.
I think that AGI cognition (like our own) may well involve schemas, like (say) being loyal, or virtuous. We don’t argmax(virtue). Rather, the virtue schema also applies to the process by which we search over plans.
So I don’t see why ‘having superhuman AIs run Walmart’ necessarily leads to doom, because they might just be implementing schemas like “be a good business professional”, rather than “find the function f(.) which is most ‘business-professional-like’, then maximize f(.) — regardless of whether any human would consider f(.) to represent anything corresponding to ’good business professional.”
On Premise 3: I feel unsatisfied, so far, by accounts of AI takeover scenarios. Admittedly, it seems kinda mad for me to say “I’m confident that an AI with greater cognitive power than all of humanity couldn’t kill us if it wanted to”, which is one reason that I’m only at ~⅙ chance that we’d survive in that situation.
But I also don’t know how much my conclusion is swayed by a sense of wanting to avoid the hubris of “man, it would be Really Dumb if you said a highly capable AI couldn’t kill us if it wanted to, and then we end up dead”, rather than a more obviously virtuous form of cognition.
A system being ‘cognitively efficient wrt humanity’ doesn’t automatically entail ‘whatever goals the system has – and whatever constraints the system might otherwise face – the cognitively efficient system gets what it wants’. The arguments attempting to move from ‘AI with superhuman cognitive abilities’ to ‘human extinction’ feel fuzzier than I’d like.
If superhuman systems don’t foom, we might have marginally superhuman systems, who are able to be thwarted before they kill literally everyone (while still doing a lot of damage). Constraints like ‘accessing the relevant physical infrastructure’ might dominate the gains from greater cognitive efficiency.
I also feel pretty confused about how much actual real-world power would be afforded to AIs in light of their highly advanced cognition (a relevant recent discussion), which further brings down my confidence in Premise 3.
I’m also assuming that: conditioned on an AI instrumentally desiring to kill all humans, deceptive alignment is likely. I haven’t read posts like this one which might challenge that assumption. If I came to believe that deceptive alignment was highly unlikely, this could lower the probability of either Premise 2 or Premise 3.
Finally, I sometimes feel confused by the concept of ‘capabilities’ as it’s used in discussions about AGI. From Jenner and Treutlein’s response to Grace’s counterarguments:
Assuming it is feasible, the question becomes: why will there be incentives to build increasingly capable AI systems? We think there is a straightforward argument that is essentially correct: some of the things we care about are very difficult to achieve, and we will want to build AI systems that can achieve them. At some point, the objectives we want AI systems to achieve will be more difficult than disempowering humanity, which is why we will build AI systems that are sufficiently capable to be dangerous if unaligned.”
Maybe one thing I’m thinking here is that “more difficult” is hard to parse. The AI systems might be able to achieve some narrower outcome that we desire, without being “capable” of destroying humanity. I think this is compatible with having systems which are superhumanly capable of pursuing some broadly-scoped goals, without being capable of pursuing all broadly-scoped goals.
Psychologists talk of ‘g’ bc there’s correlation between performance on tasks we intuitively think of as cognitive, and correlations with some important life outcomes. I don’t know how well the unidimensional notion of intelligence will transfer to advanced AI systems. The fact that some AIs perform decently on IQ tests without being good at much else is at least some weak evidence against the generality of the more unidimensional ‘intelligence’ concept.
However, I agree that there’s a well-defined sense in which we can say that AIs are more cognitively capable than all of humanity combined. I also think that my earlier point about expecting future systems to exhibit human-like inductive biases makes the argument in the bullet point above substantially weaker.
I still remain uneasy about the extent to which unidimensional notion of ‘capabilities’ can feed into claims about takeoffs and takeover scenarios, and I’m currently unclear on whether this makes a practical difference.
(Also, I’m no doubt missing a bunch of relevant information here. But this is probably true for most people, and I think it’s good for people to share objections even if they’re missing important details)
I think a main point of disagreement is that I don’t think systems need to be “dangerous maximizers” in the sense you described in order to predictably disempower humanity and then kill everyone. Humans aren’t dangerous maximizers yet we’ve killed many species of animals, neanderthals, and various other human groups (genocide, wars, oppression of populations by governments, etc.) Katja’s scenario sounds plausible for me except for the part where somehow it all turns out OK in the end for humans. :)
Another, related point of disagreement:
“look, LLMs distill human cognition, much of this cognition implicitly contains plans, human-like value judgements, etc.” I start from a place where I currently believe “future systems have human-like inductive biases” will be a better predictive abstraction than “randomly sample from the space of simplicity-weighted plans”. And … I just don’t currently see the argument for rejecting my current view?
I actually agree that current and future systems will have human-like concepts, human-like inductive biases, etc. -- relative to the space of all possible minds at least. But their values will be sufficiently alien that humanity will be in deep trouble. (Analogy: Suppose we bred some octopi to be smarter and smarter, in an environment where they were e.g. trained with pavlovian conditioning + artificial selection to be really good at reading internext text and predicting it, and then eventually writing it also.… They would indeed end up a lot more human-like than regular wild octopi. But boy would it be scary if they started getting generally smarter than humans and being integrated deeply into lots of important systems and humans started trusting them a lot etc.)
Your analogy successfully motivates the “man, I’d really like more people to be thinking about the potentially looming Octopcracy” sentiment, and my intuitions here feel pretty similar to the AI case. I would expect the relevant systems (AIs, von-Neumann-Squidwards, etc) to inherit human-like properties wrt human cognition (including normative cognition, like plan search), and a small-but-non-negligible chance that we end up with extinction (or worse).
On maximizers: to me, the most plausible reason for believing that continued human survival would be unstable in Grace’s story either consists in the emergence of dangerous maximizers, or the emergence of related behaviors like rapacious influence-seeking (e.g., Part II of What Failure Looks Like). I agree that maximizers aren’t necessary for human extinction, but it does seem like the most plausible route to ‘human extinction’ rather than ‘something else weird and potentially not great’.
Nice. Well, I guess we just have different intuitions then—for me, the chance of extinction or worse in the Octopcracy case seems a lot bigger than “small but non-negligible” (though I also wouldn’t put it as high as 99%).
Human groups struggle against each other for influence/power/control constantly; why wouldn’t these octopi (or AIs) also seek influence? You don’t need to be an expected utility maximizer to instrumentally converge; humans instrumentally converge all the time.
Oh also you might be interested in Joe Carlsmith’s report on power-seeking AI, it has a relatively thorough discussion of the overall argument for risk.
A system being ‘cognitively efficient wrt humanity’ doesn’t automatically entail ‘whatever goals the system has – and whatever constraints the system might otherwise face – the cognitively efficient system gets what it wants’.
...though I don’t think it buys us more than a couple points; I think people dramatically underestimate how high the ceiling is for humans and think that a reasonably smart human familiar with the right ideas would stand a decent chance at executing a takeover if placed into the position of an AI (assuming speedup of cognition, + whatever actuators current systems typically possess).
However, I think this is wrong:
LLMs distill human cognition
LLMs have whatever capabilities they have because those are the capabilities discovered by gradient descent which, given their architecture, improved their performance on the test task (next token prediction). This task is extremely unlike the tasks represented in the environment where human evolution occurred, and the kind of cognitive machinery which would make a system effective at next token prediction seems very different from whatever it is that humans do. (Humans are capable of next token prediction, but notably we are much worse at it than even GPT-3.)
Separately, the cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values (and/or the cognitive machinery that causes humans to develop values after birth), so if it turned out that LLMs did, somehow, share the bulk of their cognitive algorithms with humans, that would be a slight positive update for me, but not an overwhelming one, since I wouldn’t expect an LLM to want anything remotely relevant to humans. (Most of the things that humans want are lossy proxies for things that improved IGF in the ancestral environment, many of which generalized extremely poorly out of distribution. What are the lossy proxies for minimizing prediction loss that a sufficiently-intelligent LLM would end up with? I don’t know, but I don’t see why they’d have anything to do with the very specific things that humans value.)
Pushback appreciated! But I don’t think you show that “LLMs distill human cognition” is wrong. I agree that ‘next token prediction’ is very different to the tasks that humans faced in their ancestral environments, I just don’t see this as particularly strong evidence against the claim ‘LLMs distill human cognition’.
I initially stated that “LLMs distill human cognition” struck me as a more useful predictive abstraction than a view which claims that the trajectory of ML leads us to a scenario where future AIs, are “in the ways that matter”, doing something more like “randomly sampling from the space of simplicity-weighted plans”. My initial claim still seems right to me.
If you want to pursue the debate further, it might be worth talking about the degree to which you’re (un)convinced by Quintin Pope’s claims in this tweet thread. Admittedly, it sounds like you don’t view this issue as super cruxy for you:
“The cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values”
I don’t know the literature on moral psychology, but that claim doesn’t feel intuitive to me (possibly I’m misunderstanding what you mean by ‘human values’; I’m also interested in any relevant sources). Some thoughts/questions:
To me, this seems like an instance where ‘value reasoning’ and ‘descriptive reasoning’ rely on similar cognitive resources. If LLMs inherit this human-like property (Quintin claims they do), would that update you towards optimism? If not, why not?
I take it that the notion of ‘intelligence’ we’re working with is related to planning. If future AI systems inherit human-like cognition wrt plan search, then I think this is a reason to expect that AI cognition will also inherit not-completely-alien-to-human values — even if there are, in some sense, distinct cognitive mechanisms undergirding ‘values’ and ‘non-values’ reasoning in humans.
This is because the ‘search over plans’ process has both normative and descriptive components. I don’t think the claim about LLMs distilling human cognition constitutes anything like a guarantee that future LLMs will have values we’d really like, and nor is it a call for complacency about the emergence of misaligned goals. I just think it constitutes meaningful evidence against the human extinction claim.
As I write this, I’m starting to think that your claim about distinct cognitive mechanisms primarily seems like an argument for doom conditioned on‘LLMs mostly don’t distill human cognition’, but doesn’t seem like an independent argument for doom conditioned on LLMs distilling human cognition. If LLMs distill the plan search component of human cognition, this feels like a meaningful update against doom. If LLMs mostly fail to distill the parts of human cognition involved in plan search, then cognitive convergence might happen because (e.g.) the Natural Abstraction Hypothesis is true, and ‘human values’ aren’t a natural abstraction. In that case, it seems correct to say that cognitive convergence constitutes, at best, a small update against doom. (The cognitive convergence would occur due to structural properties of patterns in the world, rather than arising as the result of LLMs distilling more specifically human thought patterns related to values)
So I feel like ‘the degree to which we should expect future AIs to converge with human-like cognitive algorithms for plan search’ might be a crux for you?
A working attempt to sketch a simple three-premise argument for the claim: ‘TAI will result in human extinction’, and offer objections. Made mostly for my own benefit while working on another project, but I thought it might be useful to post here.
The structure of my preferred argument is similar to an earlier framing suggested by Katja Grace.
Goal-directed superhuman AI systems will be built (let’s say conditioned on TAI).
If goal-directed superhuman AI systems are built, their values will result in human extinction if realized.
If goal-directed superhuman AI systems are built, they’ll be able to realize their values — even if their values would result in human extinction if realized.
Thus: Humanity will go extinct.
I’ll offer some rough probabilities, but the probabilities I’m offering shouldn’t be taken seriously. I don’t think probabilities are the best way to adjudicate disputes of this kind, but I thought offering a more quantitative sense of my uncertainty (based on my immediate impressions) might be helpful in this case. For the (respective) premises, I might go for 98%, 7%, 83%, resulting in a ~6% chance of human extinction given TAI.
Some more specific objections:
Obviously Premise 2 is doing a lot of the work here. I think that one of the main arguments for believing in Premise 2 is a view like Rob’s, which holds that current ML is on track to produce systems which are, “in the ways that matter”, more like ‘randomly sample (simplicity-weighted) plans’ than anything recognizably human. If future systems are sampling from simplicity-weighted plans to achieve arbitrary goals, then Premise 2 does start to look very plausible.
This basically just seems like an extremely strong claim about the inductive biases of ML systems, and my (likely unsatisfying) response boils down to: (1) I don’t see any strong argument for believing it, and (2) I see some arguments for the alternative conclusion.
I find myself really confused when trying to think about this debate. In a discussion of Rob’s post, Daniel Kokotajlo says: “IMO the burden of proof is firmly on the side of whoever wants to say that therefore things will probably be fine.”
I think I just don’t get the intuition behind his argument (tagging @kokotajlod in case he wants to correct any misunderstandings). I don’t really like ‘burden of proof’ talk, but my instinct is to say “look, LLMs distill human cognition, much of this cognition implicitly contains plans, human-like value judgements, etc.” I start from a place where I currently believe “future systems have human-like inductive biases” will be a better predictive abstraction than “randomly sample from the space of simplicity-weighted plans”. And … I just don’t currently see the argument for rejecting my current view?
Perhaps there are near-term predictions which would help weigh on the dispute between the two hypotheses? I currently interpret the disagreement here as a disagreement about the relevant outcome space over which we should be uncertain, which feels hard to adjudicate. But, right now, I struggle to see the argument for the more doomy outcome space.
More on Premise 2: Paul Christiano offers various considerations which count against doom which appear to go through without having “solved alignment”. These considerations feel less forceful to me than the points in the bullet point above, but they still serve to make Premise 2 seem less likely.
“Given how small the [resource costs of keeping humans around are], even tiny preferences one way or the other will dominate incidental effects from grabbing more resources”.
“There are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you’ve interacted with, and which was prima facie plausibly an important actor in the world”
“Most humans and human societies would be willing to spend much more than 1 trillionth of their resources (= $100/year for all of humanity) for a ton of random different goals”
Paul also mentions “decision-theoretic arguments for cooperation”, including a passing reference to ECL.
I also think the story by Katja Grace below is plausible, in which superhuman AI systems are “goal-directed”, but don’t lead to human extinction.
Perhaps the story above is unlikely because the AI systems in Grace’s story would (in the absence of strong preventative efforts) be dangerous maximizers. I think that this is most plausible on something like Eliezer’s model of agency, and if my views change my best bet is that I’ll have updated towards his view.
I believe: as you develop gradually more capable agentic systems, there are dynamic pressures towards a certain kind of coherency. I don’t think that claim alone establishes the existence of dynamic pressures towards ‘dangerous maximizing cognition’.
I think that AGI cognition (like our own) may well involve schemas, like (say) being loyal, or virtuous. We don’t argmax(virtue). Rather, the virtue schema also applies to the process by which we search over plans.
So I don’t see why ‘having superhuman AIs run Walmart’ necessarily leads to doom, because they might just be implementing schemas like “be a good business professional”, rather than “find the function f(.) which is most ‘business-professional-like’, then maximize f(.) — regardless of whether any human would consider f(.) to represent anything corresponding to ’good business professional.”
Alex Turner has a related comment here.
On Premise 3: I feel unsatisfied, so far, by accounts of AI takeover scenarios. Admittedly, it seems kinda mad for me to say “I’m confident that an AI with greater cognitive power than all of humanity couldn’t kill us if it wanted to”, which is one reason that I’m only at ~⅙ chance that we’d survive in that situation.
But I also don’t know how much my conclusion is swayed by a sense of wanting to avoid the hubris of “man, it would be Really Dumb if you said a highly capable AI couldn’t kill us if it wanted to, and then we end up dead”, rather than a more obviously virtuous form of cognition.
A system being ‘cognitively efficient wrt humanity’ doesn’t automatically entail ‘whatever goals the system has – and whatever constraints the system might otherwise face – the cognitively efficient system gets what it wants’. The arguments attempting to move from ‘AI with superhuman cognitive abilities’ to ‘human extinction’ feel fuzzier than I’d like.
If superhuman systems don’t foom, we might have marginally superhuman systems, who are able to be thwarted before they kill literally everyone (while still doing a lot of damage). Constraints like ‘accessing the relevant physical infrastructure’ might dominate the gains from greater cognitive efficiency.
I also feel pretty confused about how much actual real-world power would be afforded to AIs in light of their highly advanced cognition (a relevant recent discussion), which further brings down my confidence in Premise 3.
I’m also assuming that: conditioned on an AI instrumentally desiring to kill all humans, deceptive alignment is likely. I haven’t read posts like this one which might challenge that assumption. If I came to believe that deceptive alignment was highly unlikely, this could lower the probability of either Premise 2 or Premise 3.
Finally, I sometimes feel confused by the concept of ‘capabilities’ as it’s used in discussions about AGI. From Jenner and Treutlein’s response to Grace’s counterarguments:
Maybe one thing I’m thinking here is that “more difficult” is hard to parse. The AI systems might be able to achieve some narrower outcome that we desire, without being “capable” of destroying humanity. I think this is compatible with having systems which are superhumanly capable of pursuing some broadly-scoped goals, without being capable of pursuing all broadly-scoped goals.
Psychologists talk of ‘g’ bc there’s correlation between performance on tasks we intuitively think of as cognitive, and correlations with some important life outcomes. I don’t know how well the unidimensional notion of intelligence will transfer to advanced AI systems. The fact that some AIs perform decently on IQ tests without being good at much else is at least some weak evidence against the generality of the more unidimensional ‘intelligence’ concept.
However, I agree that there’s a well-defined sense in which we can say that AIs are more cognitively capable than all of humanity combined. I also think that my earlier point about expecting future systems to exhibit human-like inductive biases makes the argument in the bullet point above substantially weaker.
I still remain uneasy about the extent to which unidimensional notion of ‘capabilities’ can feed into claims about takeoffs and takeover scenarios, and I’m currently unclear on whether this makes a practical difference.
(Also, I’m no doubt missing a bunch of relevant information here. But this is probably true for most people, and I think it’s good for people to share objections even if they’re missing important details)
Nice analysis!
I think a main point of disagreement is that I don’t think systems need to be “dangerous maximizers” in the sense you described in order to predictably disempower humanity and then kill everyone. Humans aren’t dangerous maximizers yet we’ve killed many species of animals, neanderthals, and various other human groups (genocide, wars, oppression of populations by governments, etc.) Katja’s scenario sounds plausible for me except for the part where somehow it all turns out OK in the end for humans. :)
Another, related point of disagreement:
I actually agree that current and future systems will have human-like concepts, human-like inductive biases, etc. -- relative to the space of all possible minds at least. But their values will be sufficiently alien that humanity will be in deep trouble. (Analogy: Suppose we bred some octopi to be smarter and smarter, in an environment where they were e.g. trained with pavlovian conditioning + artificial selection to be really good at reading internext text and predicting it, and then eventually writing it also.… They would indeed end up a lot more human-like than regular wild octopi. But boy would it be scary if they started getting generally smarter than humans and being integrated deeply into lots of important systems and humans started trusting them a lot etc.)
thnx! : )
Your analogy successfully motivates the “man, I’d really like more people to be thinking about the potentially looming Octopcracy” sentiment, and my intuitions here feel pretty similar to the AI case. I would expect the relevant systems (AIs, von-Neumann-Squidwards, etc) to inherit human-like properties wrt human cognition (including normative cognition, like plan search), and a small-but-non-negligible chance that we end up with extinction (or worse).
On maximizers: to me, the most plausible reason for believing that continued human survival would be unstable in Grace’s story either consists in the emergence of dangerous maximizers, or the emergence of related behaviors like rapacious influence-seeking (e.g., Part II of What Failure Looks Like). I agree that maximizers aren’t necessary for human extinction, but it does seem like the most plausible route to ‘human extinction’ rather than ‘something else weird and potentially not great’.
Nice. Well, I guess we just have different intuitions then—for me, the chance of extinction or worse in the Octopcracy case seems a lot bigger than “small but non-negligible” (though I also wouldn’t put it as high as 99%).
Human groups struggle against each other for influence/power/control constantly; why wouldn’t these octopi (or AIs) also seek influence? You don’t need to be an expected utility maximizer to instrumentally converge; humans instrumentally converge all the time.
Oh also you might be interested in Joe Carlsmith’s report on power-seeking AI, it has a relatively thorough discussion of the overall argument for risk.
This is a good summary, thanks for writing it up!
I do agree with this, in principle:
...though I don’t think it buys us more than a couple points; I think people dramatically underestimate how high the ceiling is for humans and think that a reasonably smart human familiar with the right ideas would stand a decent chance at executing a takeover if placed into the position of an AI (assuming speedup of cognition, + whatever actuators current systems typically possess).
However, I think this is wrong:
LLMs have whatever capabilities they have because those are the capabilities discovered by gradient descent which, given their architecture, improved their performance on the test task (next token prediction). This task is extremely unlike the tasks represented in the environment where human evolution occurred, and the kind of cognitive machinery which would make a system effective at next token prediction seems very different from whatever it is that humans do. (Humans are capable of next token prediction, but notably we are much worse at it than even GPT-3.)
Separately, the cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values (and/or the cognitive machinery that causes humans to develop values after birth), so if it turned out that LLMs did, somehow, share the bulk of their cognitive algorithms with humans, that would be a slight positive update for me, but not an overwhelming one, since I wouldn’t expect an LLM to want anything remotely relevant to humans. (Most of the things that humans want are lossy proxies for things that improved IGF in the ancestral environment, many of which generalized extremely poorly out of distribution. What are the lossy proxies for minimizing prediction loss that a sufficiently-intelligent LLM would end up with? I don’t know, but I don’t see why they’d have anything to do with the very specific things that humans value.)
Pushback appreciated! But I don’t think you show that “LLMs distill human cognition” is wrong. I agree that ‘next token prediction’ is very different to the tasks that humans faced in their ancestral environments, I just don’t see this as particularly strong evidence against the claim ‘LLMs distill human cognition’.
I initially stated that “LLMs distill human cognition” struck me as a more useful predictive abstraction than a view which claims that the trajectory of ML leads us to a scenario where future AIs, are “in the ways that matter”, doing something more like “randomly sampling from the space of simplicity-weighted plans”. My initial claim still seems right to me.
If you want to pursue the debate further, it might be worth talking about the degree to which you’re (un)convinced by Quintin Pope’s claims in this tweet thread. Admittedly, it sounds like you don’t view this issue as super cruxy for you:
I don’t know the literature on moral psychology, but that claim doesn’t feel intuitive to me (possibly I’m misunderstanding what you mean by ‘human values’; I’m also interested in any relevant sources). Some thoughts/questions:
Does your position rule out the claim that “humans model other human beings using the same architecture that they use to model themselves”?
To me, this seems like an instance where ‘value reasoning’ and ‘descriptive reasoning’ rely on similar cognitive resources. If LLMs inherit this human-like property (Quintin claims they do), would that update you towards optimism? If not, why not?
I take it that the notion of ‘intelligence’ we’re working with is related to planning. If future AI systems inherit human-like cognition wrt plan search, then I think this is a reason to expect that AI cognition will also inherit not-completely-alien-to-human values — even if there are, in some sense, distinct cognitive mechanisms undergirding ‘values’ and ‘non-values’ reasoning in humans.
This is because the ‘search over plans’ process has both normative and descriptive components. I don’t think the claim about LLMs distilling human cognition constitutes anything like a guarantee that future LLMs will have values we’d really like, and nor is it a call for complacency about the emergence of misaligned goals. I just think it constitutes meaningful evidence against the human extinction claim.
As I write this, I’m starting to think that your claim about distinct cognitive mechanisms primarily seems like an argument for doom conditioned on ‘LLMs mostly don’t distill human cognition’, but doesn’t seem like an independent argument for doom conditioned on LLMs distilling human cognition. If LLMs distill the plan search component of human cognition, this feels like a meaningful update against doom. If LLMs mostly fail to distill the parts of human cognition involved in plan search, then cognitive convergence might happen because (e.g.) the Natural Abstraction Hypothesis is true, and ‘human values’ aren’t a natural abstraction. In that case, it seems correct to say that cognitive convergence constitutes, at best, a small update against doom. (The cognitive convergence would occur due to structural properties of patterns in the world, rather than arising as the result of LLMs distilling more specifically human thought patterns related to values)
So I feel like ‘the degree to which we should expect future AIs to converge with human-like cognitive algorithms for plan search’ might be a crux for you?