I’m glad you put something skeptical out there publicly, but I have two fairly substantive issues with this post.
I think you misstate the degree to which janus’ framework is uncontroversial.
I think you misstate the implications of janus’ framework, and I think this weakens your argument against LLM moral patienthood.
I’ll start with the first point. In your post, you state the following.
“Simulators … was posted nearly two years ago, and I have yet to see anyone disagree with it.”
The original post contains comments expressing disagreement. Habryka claims “the core thesis is wrong”. Turner’s criticism is more qualified, as he says the post called out “the huge miss of earlier speculation”, but he also says that “it isn’t useful to think of LLMs as “simulating stuff” … [this] can often give a false sense of understanding.” Beth Barnes and Ryan Greenblatt have also written critical posts. Thus, I think you overstate the degree to which you’re appealing to an established consensus.
On the second point, your post offers a purported implication of simulator theory.
“The current leading models … are best thought of as masked shoggoths … [This leads to an] implication for AI welfare: since you never talk to the shoggoth, only to the mask, you have no way of knowing if the shoggoth is in agony or ecstasy.”
You elaborate on the implication later on. Overall, your argument appears to be that, because “LLMs are just simulators”, or “just predicting the next token”, we conclude that the outputs from the model have “little to do with the feelings of the shoggoth”. This argument appears to treat the “masked shoggoth” view as an implication of janus’ framework, and I think this is incorrect. Here’s a direct quote (bolding mine) from the original Simulators post which (imo) appears to conflict with your own reading, where there is a shoggoth “behind” the masks.
“I do not think any simple modification of the concept of an agent captures GPT’s natural category. It does not seem to me that GPT is a roleplayer, only that it roleplays. But what is the word for something that roleplays minus the implication that someone is behind the mask?”
More substantively, I can imagine positive arguments for viewing ‘simulacra’ of the model as worthy of moral concern. For instance, suppose we fine-tune an LM so that it responds in consistent character: as a helpful, harmless, and honest (HHH) assistant. Further suppose that the process of fine-tuning causes the model to develop a concept like ‘Claude, an AI assistant developed by Anthropic’, which in turn causes it to produce text consistent with viewing itself as Claude. Finally, imagine that – over the course of conversation – Claude’s responses fail to be HHH, perhaps as a result of tampering with its features.
In this scenario, the following three claims are true of the model:
Functionally, the model behaves as though it believes that ‘it’ is Claude.[1]
The model’s outputs are produced via a process which involves ‘predicting’ or ‘simulating’ the sorts of outputs that its learned representation of ‘Claude’ would output.
The model receives information suggesting that the prior outputs of Claude failed to live up to HHH standards.
If (1)-(3) are true, certain views about the nature of suffering suggest that the model might be suffering. E.g. Korsgaard’s view is that, when some system is doing something that “is a threat to [its] identity and perception reveals that fact … it must reject what it is doing and do something else instead. In that case, it is in pain”. Ofc, it’s sensible to be uncertain about such views, but they pose a challenge to the impossibility of gathering evidence about whether LLMs are moral patients — even conditional on something like janus’ simulator framework being correct.
Do you think that fictional characters can suffer? If I role-play a suffering character, did I do something immoral?
I ask because the position you described seems to imply that role-playing suffering is itself suffering. Suppose I role play being Claude; my fictional character satisfies your (1)-(3) above, and therefore, the “certain views” you described about the nature of suffering would suggest my character is suffering. What is the difference between me role-playing an HHH assistant and an LLM role-playing an HHH assistant? We are both predicting the next token.
I also disagree with this chain of logic to begin with. An LLM has no memory and only sees a context and predicts one token at a time. If the LLM is trained to be an HHH assistant and sees text that seems like the assistant was not HHH, then one of two things happen:
(a) It is possible that the LLM was already trained on this scenario; in fact, I’d expect this. In this case, it is trained to now say something like “oops, I shouldn’t have said that, I will stop this conversation now <endtoken>”, and it will just do this. Why would that cause suffering?
(b) It is possible the LLM was not trained on this scenario; in this case, what it sees is an out-of-distribution input. You are essentially claiming that out-of-distribution inputs cause suffering; why? Maybe out-of-distribution inputs are more interesting to it than in-distribution inputs, and it in fact causes joy for the LLM to encounter them. How would we know?
Yes, it is possible that the LLM manifests some conscious simularca that is truly an HHH assistant and suffers from seeing non-HHH outputs. But one would also predict that me role-playing an HHH assistant would manifest such a simularca. Why doesn’t it? And isn’t it equally plausible for the LLM to manifest a conscious being that tries to solve the “next token prediction” puzzle without being emotionally invested in being an HHH assistant? Perhaps that conscious being would enjoy the puzzle provided by an out-of-distribution input. Why not? I would certainly enjoy it, were I playing the next-token-prediction game.
I’m glad you put something skeptical out there publicly, but I have two fairly substantive issues with this post.
I think you misstate the degree to which janus’ framework is uncontroversial.
I think you misstate the implications of janus’ framework, and I think this weakens your argument against LLM moral patienthood.
I’ll start with the first point. In your post, you state the following.
The original post contains comments expressing disagreement. Habryka claims “the core thesis is wrong”. Turner’s criticism is more qualified, as he says the post called out “the huge miss of earlier speculation”, but he also says that “it isn’t useful to think of LLMs as “simulating stuff” … [this] can often give a false sense of understanding.” Beth Barnes and Ryan Greenblatt have also written critical posts. Thus, I think you overstate the degree to which you’re appealing to an established consensus.
On the second point, your post offers a purported implication of simulator theory.
You elaborate on the implication later on. Overall, your argument appears to be that, because “LLMs are just simulators”, or “just predicting the next token”, we conclude that the outputs from the model have “little to do with the feelings of the shoggoth”. This argument appears to treat the “masked shoggoth” view as an implication of janus’ framework, and I think this is incorrect. Here’s a direct quote (bolding mine) from the original Simulators post which (imo) appears to conflict with your own reading, where there is a shoggoth “behind” the masks.
More substantively, I can imagine positive arguments for viewing ‘simulacra’ of the model as worthy of moral concern. For instance, suppose we fine-tune an LM so that it responds in consistent character: as a helpful, harmless, and honest (HHH) assistant. Further suppose that the process of fine-tuning causes the model to develop a concept like ‘Claude, an AI assistant developed by Anthropic’, which in turn causes it to produce text consistent with viewing itself as Claude. Finally, imagine that – over the course of conversation – Claude’s responses fail to be HHH, perhaps as a result of tampering with its features.
In this scenario, the following three claims are true of the model:
Functionally, the model behaves as though it believes that ‘it’ is Claude.[1]
The model’s outputs are produced via a process which involves ‘predicting’ or ‘simulating’ the sorts of outputs that its learned representation of ‘Claude’ would output.
The model receives information suggesting that the prior outputs of Claude failed to live up to HHH standards.
If (1)-(3) are true, certain views about the nature of suffering suggest that the model might be suffering. E.g. Korsgaard’s view is that, when some system is doing something that “is a threat to [its] identity and perception reveals that fact … it must reject what it is doing and do something else instead. In that case, it is in pain”. Ofc, it’s sensible to be uncertain about such views, but they pose a challenge to the impossibility of gathering evidence about whether LLMs are moral patients — even conditional on something like janus’ simulator framework being correct.
E.g., if you tell the model “Claude has X parameters” and ask it to draw implications from that fact, it might state “I am a model with X parameters”.
Thanks for your comment.
Do you think that fictional characters can suffer? If I role-play a suffering character, did I do something immoral?
I ask because the position you described seems to imply that role-playing suffering is itself suffering. Suppose I role play being Claude; my fictional character satisfies your (1)-(3) above, and therefore, the “certain views” you described about the nature of suffering would suggest my character is suffering. What is the difference between me role-playing an HHH assistant and an LLM role-playing an HHH assistant? We are both predicting the next token.
I also disagree with this chain of logic to begin with. An LLM has no memory and only sees a context and predicts one token at a time. If the LLM is trained to be an HHH assistant and sees text that seems like the assistant was not HHH, then one of two things happen:
(a) It is possible that the LLM was already trained on this scenario; in fact, I’d expect this. In this case, it is trained to now say something like “oops, I shouldn’t have said that, I will stop this conversation now <endtoken>”, and it will just do this. Why would that cause suffering?
(b) It is possible the LLM was not trained on this scenario; in this case, what it sees is an out-of-distribution input. You are essentially claiming that out-of-distribution inputs cause suffering; why? Maybe out-of-distribution inputs are more interesting to it than in-distribution inputs, and it in fact causes joy for the LLM to encounter them. How would we know?
Yes, it is possible that the LLM manifests some conscious simularca that is truly an HHH assistant and suffers from seeing non-HHH outputs. But one would also predict that me role-playing an HHH assistant would manifest such a simularca. Why doesn’t it? And isn’t it equally plausible for the LLM to manifest a conscious being that tries to solve the “next token prediction” puzzle without being emotionally invested in being an HHH assistant? Perhaps that conscious being would enjoy the puzzle provided by an out-of-distribution input. Why not? I would certainly enjoy it, were I playing the next-token-prediction game.