Pushback appreciated! But I don’t think you show that “LLMs distill human cognition” is wrong. I agree that ‘next token prediction’ is very different to the tasks that humans faced in their ancestral environments, I just don’t see this as particularly strong evidence against the claim ‘LLMs distill human cognition’.
I initially stated that “LLMs distill human cognition” struck me as a more useful predictive abstraction than a view which claims that the trajectory of ML leads us to a scenario where future AIs, are “in the ways that matter”, doing something more like “randomly sampling from the space of simplicity-weighted plans”. My initial claim still seems right to me.
If you want to pursue the debate further, it might be worth talking about the degree to which you’re (un)convinced by Quintin Pope’s claims in this tweet thread. Admittedly, it sounds like you don’t view this issue as super cruxy for you:
“The cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values”
I don’t know the literature on moral psychology, but that claim doesn’t feel intuitive to me (possibly I’m misunderstanding what you mean by ‘human values’; I’m also interested in any relevant sources). Some thoughts/questions:
To me, this seems like an instance where ‘value reasoning’ and ‘descriptive reasoning’ rely on similar cognitive resources. If LLMs inherit this human-like property (Quintin claims they do), would that update you towards optimism? If not, why not?
I take it that the notion of ‘intelligence’ we’re working with is related to planning. If future AI systems inherit human-like cognition wrt plan search, then I think this is a reason to expect that AI cognition will also inherit not-completely-alien-to-human values — even if there are, in some sense, distinct cognitive mechanisms undergirding ‘values’ and ‘non-values’ reasoning in humans.
This is because the ‘search over plans’ process has both normative and descriptive components. I don’t think the claim about LLMs distilling human cognition constitutes anything like a guarantee that future LLMs will have values we’d really like, and nor is it a call for complacency about the emergence of misaligned goals. I just think it constitutes meaningful evidence against the human extinction claim.
As I write this, I’m starting to think that your claim about distinct cognitive mechanisms primarily seems like an argument for doom conditioned on‘LLMs mostly don’t distill human cognition’, but doesn’t seem like an independent argument for doom conditioned on LLMs distilling human cognition. If LLMs distill the plan search component of human cognition, this feels like a meaningful update against doom. If LLMs mostly fail to distill the parts of human cognition involved in plan search, then cognitive convergence might happen because (e.g.) the Natural Abstraction Hypothesis is true, and ‘human values’ aren’t a natural abstraction. In that case, it seems correct to say that cognitive convergence constitutes, at best, a small update against doom. (The cognitive convergence would occur due to structural properties of patterns in the world, rather than arising as the result of LLMs distilling more specifically human thought patterns related to values)
So I feel like ‘the degree to which we should expect future AIs to converge with human-like cognitive algorithms for plan search’ might be a crux for you?
Pushback appreciated! But I don’t think you show that “LLMs distill human cognition” is wrong. I agree that ‘next token prediction’ is very different to the tasks that humans faced in their ancestral environments, I just don’t see this as particularly strong evidence against the claim ‘LLMs distill human cognition’.
I initially stated that “LLMs distill human cognition” struck me as a more useful predictive abstraction than a view which claims that the trajectory of ML leads us to a scenario where future AIs, are “in the ways that matter”, doing something more like “randomly sampling from the space of simplicity-weighted plans”. My initial claim still seems right to me.
If you want to pursue the debate further, it might be worth talking about the degree to which you’re (un)convinced by Quintin Pope’s claims in this tweet thread. Admittedly, it sounds like you don’t view this issue as super cruxy for you:
I don’t know the literature on moral psychology, but that claim doesn’t feel intuitive to me (possibly I’m misunderstanding what you mean by ‘human values’; I’m also interested in any relevant sources). Some thoughts/questions:
Does your position rule out the claim that “humans model other human beings using the same architecture that they use to model themselves”?
To me, this seems like an instance where ‘value reasoning’ and ‘descriptive reasoning’ rely on similar cognitive resources. If LLMs inherit this human-like property (Quintin claims they do), would that update you towards optimism? If not, why not?
I take it that the notion of ‘intelligence’ we’re working with is related to planning. If future AI systems inherit human-like cognition wrt plan search, then I think this is a reason to expect that AI cognition will also inherit not-completely-alien-to-human values — even if there are, in some sense, distinct cognitive mechanisms undergirding ‘values’ and ‘non-values’ reasoning in humans.
This is because the ‘search over plans’ process has both normative and descriptive components. I don’t think the claim about LLMs distilling human cognition constitutes anything like a guarantee that future LLMs will have values we’d really like, and nor is it a call for complacency about the emergence of misaligned goals. I just think it constitutes meaningful evidence against the human extinction claim.
As I write this, I’m starting to think that your claim about distinct cognitive mechanisms primarily seems like an argument for doom conditioned on ‘LLMs mostly don’t distill human cognition’, but doesn’t seem like an independent argument for doom conditioned on LLMs distilling human cognition. If LLMs distill the plan search component of human cognition, this feels like a meaningful update against doom. If LLMs mostly fail to distill the parts of human cognition involved in plan search, then cognitive convergence might happen because (e.g.) the Natural Abstraction Hypothesis is true, and ‘human values’ aren’t a natural abstraction. In that case, it seems correct to say that cognitive convergence constitutes, at best, a small update against doom. (The cognitive convergence would occur due to structural properties of patterns in the world, rather than arising as the result of LLMs distilling more specifically human thought patterns related to values)
So I feel like ‘the degree to which we should expect future AIs to converge with human-like cognitive algorithms for plan search’ might be a crux for you?