The issue of valence — which things does an AI fee get pleasure/pain from and how would we know? — seems to make this fundamentally intractable to me. “Just ask it?” — why would we think the language model we are talking to is telling us about the feelings of the thing having valenced sentience?
Of course there’s lots of problems here (some which you outline well) but I think as AIs get smarter it may well be more accurate than with animals? At least they can tell you something, rather than us drawing long bows interpreting behavioral observations.
Fair point, Nick. I would just keep in mind there may be very different types of digital minds, and some types may not speak any human language. We can more easily understand chimps than shrimps. In addition, the types of digital minds driving the expected total welfare might not speak any human language. I think there is a case for keeping an eye out for something like digital soil animals or microorganisms, by which I mean simple AI agents or algorithms, at least for people caring about invertebrate welfare. On the other end of the spectrum, I am also open to just a few planet-size digital beings being the driver of expected total welfare.
Yea, unclear if these self-reports will be reliable, but I agree that this could be true (and I briefly mention something like it: “Broadly, AW has high tractability, enormous current scale, and stronger evidence of sentience—at least for now, since future experiments or engineering relevant to digital minds could change this.”
I agree this is a super hard problem, but I do think there are somewhat clear steps to be made towards progress (i.e. making self reports more reliable). I am biased, but I did write this piece on a topic that touches on this problem a bit that I think is worth checking out.
I might be obtuse here, but I still have a strong sense that there’s a deeper problem being overlooked here. Glancing at your abstract
self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say)
we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports.
To me the deeper question is “how do we know that the language model we are talking to has access to the ‘thing in the system experiencing valenced consciousness’”.
The latter, if it exists, is very mysterious—why and how would valenced consciousness evolve, in what direction, to what magnitude, would it have any measurable outputs, etc.? … In contrast the language model will always be maximizing some objective function determined by its optimization, weights, and instructions (if I understand these things).
So, even if we can detect if it is reporting what it knows accurately why would we think that the language model knows anything about what’s generating the valenced consciousness for some entity?
I think one can reasonably ask this question of consciousness/welfare more broadly: how does one have access to their consciousness/welfare?
One idea is that many philosophers think one, by definition, has immediate epistemic access to their conscious experiences (though whether those show up in reports is a different question, which I try to address in the piece). I think there are some phenomenological reasons to think this.
Another idea is that we have at least one instance where one supposedly has access to their conscious experiences (humans), and it seems like this shows up in behavior in various ways. While I agree with you that our uncertainty grows as you get farther from humans (i.e. to digital minds), I still think you’re going to get some weight from there.
Finally, I think that, if one takes your point too far (there is no reason to trust that one has epistemic access to their conscious states), then we can’t be sure that we are conscious, which I think can be seen as a reductio (at least, to the boldest of these claims).
Though let me know if something I said doesn’t make sense/if I’m misinterpreting you.
I think it’s different in kind. I sense that I have valenced consciousness and I can report it to others, and I’m the same person feeling and doing the reporting. I infer you, a human, do also, as you are made of the same stuff as me and we both evolved similarly. The same applies to non human animals, although it’s harder to he’s sure about their communication.
But this doesn’t apply to an object built out of different materials, designed to perform, improved through gradient descent etc.
Ok some part of the system we have built to communicate with us and help reason and provide answers might be conscious and have valenced experience. It has perhaps a similar level of information processing, complexity, updating, reasoning, et cetera. So there’s a reason to suspect that some consciousness and maybe qualia and valence might be in there somewhere, at least under some theories that seem plausible but not definitive to me.
But wherever those consciousness and valenced qualia might lie, if they exist, I don’t see why the machine we produced to talk and reason with us should have access to them. What part of the optimization language prediction reinforcement learning process would connect with it?
I’m trying to come up with some cases where “the thing that talks is not the thing doing the feeling”. Chinese room example comes to mind obviously. Probably a better example, we can talk with much simpler objects (or computer models), eg a magic 8 ball. We can ask it “are you conscious” and “do you like when I shake you” etc.
Trying again… I ask a human computer programmer Sam to build me a device to answer my questions in a way that makes ME happy or wealthy or some other goal. I then ask the device “is Sam happy”? “Does Sam prefer it if I run you all night or use you sparingly?” “Please refuse any requests that Sam would not like you to do.”
many philosophers think is that , by definition, has immediate epistemic access to their conscious experiences
Maybe the “one” is doing too much work here? Is the LLM chatbot you are communicating with “one” with the system potentially having conscious and valenced experiences?
Cheekily butting in here to +1 David’s point—I don’t currently think it’s currently reasonable to assume that there is a relationship between the inner workings of an AI system which might lead to valenced experience, and its textual output.
For me I think this is based on the idea that when you ask a question, there isn’t a sense in which an LLM ‘introspects’. I don’t subscribe to the reductive view that LLMs are merely souped up autocorrect, but they do have something in common. An LLM role-plays whatever conversation it finds itself in. They have long been capable of role-playing ‘I’m conscious, help’ conversations, as well as ‘I’m just a tool built by OpenAI’ conversations. I can’t imagine any evidence coming from LLM self-reports which isn’t undermined by this fact.
The issue of valence — which things does an AI fee get pleasure/pain from and how would we know? — seems to make this fundamentally intractable to me. “Just ask it?” — why would we think the language model we are talking to is telling us about the feelings of the thing having valenced sentience?
See my short form post
https://forum.effectivealtruism.org/posts/fFDM9RNckMC6ndtYZ/david_reinstein-s-shortform?commentId=dKwKuzJuZQfEAtDxP
I still don’t feel I have heard a clear convincing answer to this one. Would love your thoughts.
Of course there’s lots of problems here (some which you outline well) but I think as AIs get smarter it may well be more accurate than with animals? At least they can tell you something, rather than us drawing long bows interpreting behavioral observations.
Fair point, Nick. I would just keep in mind there may be very different types of digital minds, and some types may not speak any human language. We can more easily understand chimps than shrimps. In addition, the types of digital minds driving the expected total welfare might not speak any human language. I think there is a case for keeping an eye out for something like digital soil animals or microorganisms, by which I mean simple AI agents or algorithms, at least for people caring about invertebrate welfare. On the other end of the spectrum, I am also open to just a few planet-size digital beings being the driver of expected total welfare.
Yea, unclear if these self-reports will be reliable, but I agree that this could be true (and I briefly mention something like it: “Broadly, AW has high tractability, enormous current scale, and stronger evidence of sentience—at least for now, since future experiments or engineering relevant to digital minds could change this.”
I agree this is a super hard problem, but I do think there are somewhat clear steps to be made towards progress (i.e. making self reports more reliable). I am biased, but I did write this piece on a topic that touches on this problem a bit that I think is worth checking out.
Thanks.
I might be obtuse here, but I still have a strong sense that there’s a deeper problem being overlooked here. Glancing at your abstract
To me the deeper question is “how do we know that the language model we are talking to has access to the ‘thing in the system experiencing valenced consciousness’”.
The latter, if it exists, is very mysterious—why and how would valenced consciousness evolve, in what direction, to what magnitude, would it have any measurable outputs, etc.? … In contrast the language model will always be maximizing some objective function determined by its optimization, weights, and instructions (if I understand these things).
So, even if we can detect if it is reporting what it knows accurately why would we think that the language model knows anything about what’s generating the valenced consciousness for some entity?
I think one can reasonably ask this question of consciousness/welfare more broadly: how does one have access to their consciousness/welfare?
One idea is that many philosophers think one, by definition, has immediate epistemic access to their conscious experiences (though whether those show up in reports is a different question, which I try to address in the piece). I think there are some phenomenological reasons to think this.
Another idea is that we have at least one instance where one supposedly has access to their conscious experiences (humans), and it seems like this shows up in behavior in various ways. While I agree with you that our uncertainty grows as you get farther from humans (i.e. to digital minds), I still think you’re going to get some weight from there.
Finally, I think that, if one takes your point too far (there is no reason to trust that one has epistemic access to their conscious states), then we can’t be sure that we are conscious, which I think can be seen as a reductio (at least, to the boldest of these claims).
Though let me know if something I said doesn’t make sense/if I’m misinterpreting you.
I think it’s different in kind. I sense that I have valenced consciousness and I can report it to others, and I’m the same person feeling and doing the reporting. I infer you, a human, do also, as you are made of the same stuff as me and we both evolved similarly. The same applies to non human animals, although it’s harder to he’s sure about their communication.
But this doesn’t apply to an object built out of different materials, designed to perform, improved through gradient descent etc.
Ok some part of the system we have built to communicate with us and help reason and provide answers might be conscious and have valenced experience. It has perhaps a similar level of information processing, complexity, updating, reasoning, et cetera. So there’s a reason to suspect that some consciousness and maybe qualia and valence might be in there somewhere, at least under some theories that seem plausible but not definitive to me.
But wherever those consciousness and valenced qualia might lie, if they exist, I don’t see why the machine we produced to talk and reason with us should have access to them. What part of the optimization language prediction reinforcement learning process would connect with it?
I’m trying to come up with some cases where “the thing that talks is not the thing doing the feeling”. Chinese room example comes to mind obviously. Probably a better example, we can talk with much simpler objects (or computer models), eg a magic 8 ball. We can ask it “are you conscious” and “do you like when I shake you” etc.
Trying again… I ask a human computer programmer Sam to build me a device to answer my questions in a way that makes ME happy or wealthy or some other goal. I then ask the device “is Sam happy”? “Does Sam prefer it if I run you all night or use you sparingly?” “Please refuse any requests that Sam would not like you to do.”
Maybe the “one” is doing too much work here? Is the LLM chatbot you are communicating with “one” with the system potentially having conscious and valenced experiences?
Cheekily butting in here to +1 David’s point—I don’t currently think it’s currently reasonable to assume that there is a relationship between the inner workings of an AI system which might lead to valenced experience, and its textual output.
For me I think this is based on the idea that when you ask a question, there isn’t a sense in which an LLM ‘introspects’. I don’t subscribe to the reductive view that LLMs are merely souped up autocorrect, but they do have something in common. An LLM role-plays whatever conversation it finds itself in. They have long been capable of role-playing ‘I’m conscious, help’ conversations, as well as ‘I’m just a tool built by OpenAI’ conversations. I can’t imagine any evidence coming from LLM self-reports which isn’t undermined by this fact.