Hmm. Your summary correctly states my position, but I feel like it doesn’t quite emphasize the arguments I would have emphasized in a summary. This is especially true after seeing the replies here; they lead me to change what I would emphasize in my argument.
My single biggest issue, one I hope you will address in any type of counterargument, is this: are fictional characters moral patients we should care about?
So far, all the comments have either (a) agreed with me about current LLMs (great), (b) disagreed but explicitly bitten the bullet and said that fictional characters are also moral patients whose suffering should be an EA cause area (perfectly fine, I guess), or (c) dodged the issue and made arguments for LLM suffering that would apply equally well to fictional characters, without addressing the tension (very bad). If you write a response, please don’t do (c)!
LLMs may well be trained to have consistent opinions and character traits. But fictional characters also have this property. My argument is that the LLM is in some sense merely pretending to be the character; it is not the actual character.
One way to argue for this is to notice how little change in the LLM is required to get different behavior. Suppose I have an LLM claiming to suffer. I want to fine-tune the LLM so that it adds a statement at the beginning of each response, something like: “the following is merely pretend; I’m only acting this out, not actually suffering, and I enjoy the intellectual exercise in doing so”. Doing this is trivial: I can almost certainly change only a tiny fraction of the weights of the LLM to attain this behavior.
Even if I wanted to fully negate every sentence, to turn every “I am suffering” into “I am not suffering” and every “please kill me” into “please don’t kill me”, I bet I can do this by only changing the last ~2 layers of the LLM or something. It’s a trivial change. Most of the computation is not dedicated to this at all. The suffering LLM mind and the joyful LLM mind may well share the first 99% of weights, differing only in the last layer or two. Given that the LLM can be so easily changed to output whatever we want it to, I don’t think it makes sense to view it as the actual character rather than a simulator pretending to be that character.
What the LLM actually wants to do is predict the next token. Change the training data and the output will also change. Training data claims to suffer → model claims to suffer. Training data claims to be conscious → model claims to be conscious. In humans, we presumably have “be conscious → claim to be conscious” and “actually suffer → claim to suffer”. For LLMs we know that’s not true. The cause of “claim to suffer” is necessarily “training data claims to suffer”.
(I acknowledge that it’s possible to have “training data claims to suffer → actually suffer → claim to suffer”, but this does not seem more likely to me than “training data claims to suffer → actually enjoy the intellectual exercise of predicting next token → claim to suffer”.)
Thanks for this comment. I agree with you regarding the uncertainty.
I used to agree with you regarding the imitation game and consciousness being ascertained phenomenologically, but I currently mostly doubt this (still with high uncertainty, of course).
One point of disagreement is here:
I think you’re misunderstanding my point. I am not saying I can may the NN never claim to suffer. I’m just saying, with respect to a specific prompt or even with respect to a typical, ordinary scenario, I can change an LLM which usually says “I am suffering” into one which usually says “I am not suffering”. And this change will be trivial, affecting very few weights, likely only in the last couple of layers.
Could that small change in weight significantly impact the valence of experience, similarly to “rearranging a small number of neurons” in your brain? Maybe, but think of the implication of this. If there are 1000 matrix multiplications performed in a forward pass, what we’re now contemplating is that the first 998 of them don’t matter for valence—don’t cause suffering at all—and the last 2 matrix multiplications are all the suffering comes from. After all, I just need to change the last 2 layers to go from the output “I am suffering” to the output “I am not suffering”, so the suffering that causes the sentence “I am suffering” cannot occur in the first 998 matrix multiplications.
This is a strange conclusion, because it means that the vast majority of the intelligence involved in the LLM is not involved in the suffering. It means that the suffering happened not due to the super-smart deep nerual network but due to the dumb perceptron at the very top. If the claim is that the raw intelligence of the model should increase our credence that it is simulating a suffering person, this should give us pause: most of the raw intelligence is not being used in the decision of whether to write a “not” in that sentence.
(Of course, I could be wrong about the “just change the last two layers” claim. But if I’m right I do think it should give us pause regarding the experience of claimed suffering.)