Thanks for this comment. I agree with you regarding the uncertainty.
I used to agree with you regarding the imitation game and consciousness being ascertained phenomenologically, but I currently mostly doubt this (still with high uncertainty, of course).
One point of disagreement is here:
I am not sure how to evaluate your claim that only trivial changes to the NN are needed to have it negate itself. My sense is that this would probably require more extensive retraining if you really wanted to get it to never role-play that it was suffering under any circumstances. This seems at least as hard as other RLHF “guardrails” tasks unless the approach was particularly fragile/hacky.
Also, I’m just not sure I have super strong intuitions about that mattering a lot because it seems very plausible that just by “shifting a trivial mass of chemicals around” or “rearranging a trivial mass of neurons” somebody could significantly impact the valence of my own experience. I’m just saying, the right small changes to my brain can be very impactful to my mind.
I think you’re misunderstanding my point. I am not saying I can may the NN never claim to suffer. I’m just saying, with respect to a specific prompt or even with respect to a typical, ordinary scenario, I can change an LLM which usually says “I am suffering” into one which usually says “I am not suffering”. And this change will be trivial, affecting very few weights, likely only in the last couple of layers.
Could that small change in weight significantly impact the valence of experience, similarly to “rearranging a small number of neurons” in your brain? Maybe, but think of the implication of this. If there are 1000 matrix multiplications performed in a forward pass, what we’re now contemplating is that the first 998 of them don’t matter for valence—don’t cause suffering at all—and the last 2 matrix multiplications are all the suffering comes from. After all, I just need to change the last 2 layers to go from the output “I am suffering” to the output “I am not suffering”, so the suffering that causes the sentence “I am suffering” cannot occur in the first 998 matrix multiplications.
This is a strange conclusion, because it means that the vast majority of the intelligence involved in the LLM is not involved in the suffering. It means that the suffering happened not due to the super-smart deep nerual network but due to the dumb perceptron at the very top. If the claim is that the raw intelligence of the model should increase our credence that it is simulating a suffering person, this should give us pause: most of the raw intelligence is not being used in the decision of whether to write a “not” in that sentence.
(Of course, I could be wrong about the “just change the last two layers” claim. But if I’m right I do think it should give us pause regarding the experience of claimed suffering.)
Thanks for this comment. I agree with you regarding the uncertainty.
I used to agree with you regarding the imitation game and consciousness being ascertained phenomenologically, but I currently mostly doubt this (still with high uncertainty, of course).
One point of disagreement is here:
I think you’re misunderstanding my point. I am not saying I can may the NN never claim to suffer. I’m just saying, with respect to a specific prompt or even with respect to a typical, ordinary scenario, I can change an LLM which usually says “I am suffering” into one which usually says “I am not suffering”. And this change will be trivial, affecting very few weights, likely only in the last couple of layers.
Could that small change in weight significantly impact the valence of experience, similarly to “rearranging a small number of neurons” in your brain? Maybe, but think of the implication of this. If there are 1000 matrix multiplications performed in a forward pass, what we’re now contemplating is that the first 998 of them don’t matter for valence—don’t cause suffering at all—and the last 2 matrix multiplications are all the suffering comes from. After all, I just need to change the last 2 layers to go from the output “I am suffering” to the output “I am not suffering”, so the suffering that causes the sentence “I am suffering” cannot occur in the first 998 matrix multiplications.
This is a strange conclusion, because it means that the vast majority of the intelligence involved in the LLM is not involved in the suffering. It means that the suffering happened not due to the super-smart deep nerual network but due to the dumb perceptron at the very top. If the claim is that the raw intelligence of the model should increase our credence that it is simulating a suffering person, this should give us pause: most of the raw intelligence is not being used in the decision of whether to write a “not” in that sentence.
(Of course, I could be wrong about the “just change the last two layers” claim. But if I’m right I do think it should give us pause regarding the experience of claimed suffering.)