Thanks for the reply. I think I can clarify the issue about discrete time intervals. I’d be curious on your thoughts on the last sentence of my comment above if you have any.
Discrete time
So it seems like a time step is defined as the interval between one action and the next?
Yes. But in a SEMI or a https://en.wikipedia.org/wiki/Markov_decision_process#Continuous-time_Markov_Decision_Process Markov Decision Process (SMDP) this is not the case. SMDPs allow temporally extended actions and are commonly used in RL research. Dayan’s papers use a continuous SMDP. You can still have RL agents in this formalism and it tracks our situation more closely. But I don’t think the formalism matters for our discussion because you can arbitrarily approximate any formalism with a standard MDP—I’ll explain below.
The continuous-time experiment looks roughly like this: Imagine you’re in a room and you have to press a lever to get out—and get back to what you would normally be doing and get an average reward rho per second. However, the lever is hard to press. You can press it hard and fast or light and slowly, taking a total time T to complete the press. The total energy cost of pressing is 1/T so ideally you’d press very slowly but that would mean you couldn’t be outside the room during that time (opportunity costs).
In this setting, the ‘action’ is just the time T that you to press the lever. We can easily approximate this with a standard MDP. E.g. you could take action 1 which completely presses the lever in one time step, costing you 1/1=1 reward in energy. Or you could take action 2, which you would have to take twice to complete the press, costing you only 1⁄2 reward (so 1⁄4 for each time you take action 2). And so forth. Does that make sense?
Zero point
Of course, if you don’t like it outside the room at all, you’ll never press the lever—so there is a ‘zero point’ in terms of how much you like it outside. Below that point you’ll never press the lever.
It seems like vigor just says that what you’re doing is better than not doing it?
I’m not entirely sure what you mean, but I’ll clarify that acting vigorously doesn’t say anything about whether the agent is currently happy. It may well act vigorously just to escape punishment. Similarly, an agent that currently works to increase its life-time doesn’t necessarily feel good, but its work still implies that it thinks the additional life-time it gets will be good.
But I think your criticism may be the same as what I said in the edit above—that there is an unwarranted assumption that the agent is at the zero-point before it presses the lever. In the experiments this is assumed because there are no food rewards or shocks during that time. But you could still imaging that a depressed rat would feel bad anyway.
The theory that assumes nonexistence is the zero-point kind of does the same thing though. Although nonexistence is arguably a definite zero-point, the agent’s utility function might still go beyond its life-time...
acting vigorously doesn’t say anything about whether the agent is currently happy
Yeah, I guess I meant the trivial observation that you act vigorously if you judge that doing so has higher expected total discounted reward than not doing so. But this doesn’t speak to whether, after making that vigorous effort, your experiences will be net positive; they might just be less negative.
Of course, if you don’t like it outside the room at all, you’ll never press the lever—so there is a ‘zero point’ in terms of how much you like it outside.
...assuming that sticking around inside the room is neutral. This gets back to the “unwarranted assumption that the agent is at the zero-point before it presses the lever.”
The theory that assumes nonexistence is the zero-point kind of does the same thing though.
Hm. :) I feel like there’s a difference between (a) an agent inside the room who hasn’t yet pressed the lever to get out and (b) the agent not existing at all. For (a), it seems we ought to be able to give a (qualia and morally nonrealist) answer about whether its experiences are positive or negative or neutral, while for (b), such a question seems misplaced.
If it were a human in the room, we could ask that person whether her experiences before lever pressing were net positive or negative. I guess such answers could vary a lot between people based on various cultural, psychological, etc. factors unrelated to the activity level of reward networks. If so, perhaps one position could be that the distinction between positive vs. negative welfare is a pretty anthropomorphic concept that doesn’t travel well outside of a cognitive system capable of making these kinds of judgments. Intuitively, I feel like there is more to the sign of one’s welfare than these high-level, potentially idiosyncratic evaluations, but it’s hard to say what.
I suppose another approach could be to say that the person in the room definitely is at welfare 0 (by fiat) based on lack of reward or punishment signals, regardless of how the person evaluates her welfare verbally.
I feel like there’s a difference between (a) an agent inside the room who hasn’t yet pressed the lever to get out and (b) the agent not existing at all.
Yes that’s probably the right way to think about it. I’m also considering an alternative though: Since we’re describing the situation with a simple computational model we shouldn’t assume that there’s anything going on that isn’t captured by the model. E.g. if the agent in the room is depressed, it will be performing ‘mental actions’ - imagining depressing scenarios etc. But we may have to assume that away, similar to how high school physics would assume no friction etc.
So we’re left with an agent that decides initially that it won’t do anything at all (not even updating its beliefs) because it doesn’t want to be outside of the room and then remains inactive. The question arises if that’s an agent at all and if it’s meaningfully different unconsciousness.
So we’re left with an agent that decides initially that it won’t do anything at all (not even updating its beliefs) because it doesn’t want to be outside of the room and then remains inactive. The question arises if that’s an agent at all and if it’s meaningfully different unconsciousness.
Hm. :) Well, what if the agent did do stuff inside the room but still decided not to go out? We still wouldn’t be able to tell if it was experiencing net positive, negative, or neutral welfare. Examples:
It’s winter. The agent is cold indoors and is trying to move to the warm parts of the room. We assume its welfare is net negative. But it doesn’t go outside because it’s even colder outside.
The agent is indoors having a party. We assume it’s experiencing net positive welfare. It doesn’t want to go outside because the party is inside.
We can reproduce the behavior of these agents with reward/punishment values that are all positive numbers, all negative numbers, or a combination of the two. So if we omit the higher-level thoughts of the agents and just focus on the reward numbers at an abstract level, it doesn’t seem like we can meaningfully distinguish positive or negative welfare. Hence, the sign of welfare must come from the richer context that our human-centered knowledge and evaluations bring?
Of course, qualia nonrealists already knew that the sign and magnitude of an organism’s welfare are things we make up. But most people can agree upon, e.g., the sign of the welfare of the person at the party. In contrast, there doesn’t seem to be a principled way that most people would agree upon for us to attribute a sign of welfare to a simple RL agent that reproduces the high-level behavior of the person at the party.
After some clarification Dayan thinks that vigour is not the thing I was looking for.
We discussed this a bit further and he suggested that the temporal difference error does track pretty closely what we mean by happiness/suffering, at least as far as the zero point is concerned. Here’s a paper making the case (but it has limited scope IMO).
If that’s true, we wouldn’t need e.g. the theory that there’s a zero point to keep firing rates close to zero.
The only problem with TD errors seems to be that they don’t account for the difference between wanting and liking. But it’s currently just unresolved what the function of liking is. So I came away with the impression that liking vs wanting and not the zero point is the central question.
I’ve seen one paper suggesting that liking is basically the consumption of rewards, which would bring us back to the question of the zero point though. But we didn’t find that theory satisfying. E.g. food is just a proxy for survival. And as the paper I linked shows, happiness can follow TD errors even when no rewards are consumed.
Dayan mentioned that liking may even be an epiphenomenon of some things that are going on in the brain when we eat food/have sex etc, similar to how the specific flavour of pleasure we get from listening to music is such an epiphenomenon. I don’t know if that would mean that liking has no function.
Daswani and Leike (2015) also define (p. 4) happiness as the temporal difference error (in an MDP), and for model-based agents, the definition is, in my interpretation, basically the common Internet slogan that “happiness = reality—expectations”. However, the authors point out (p. 2) that pleasure = reward != happiness. This still leaves open the issue of what pleasure is.
Personally I think pleasure is more morally relevant. In Tomasik (2014), I wrote (p. 11):
After training, dopamine spikes when a cue appears signaling that a reward will arrive, not when the reward itself is consumed [Schultz et al., 1997], but we know subjectively that the main pleasure of a reward comes from consuming it, not predicting it. In other words, in equation (1), the pleasure comes from the actual reward r, not from the amount of dopamine δ.
In this post commenting on Daswani and Leike (2015), I said:
I personally don’t think the definition of “happiness” that Daswani and Leike advance is the most morally relevant one, but the authors make an interesting case for their definition. I think their definition corresponds most closely with “being pleased of one’s current state in a high-level sense”. In contrast, I think raw pleasure/pain is most morally significant. As a simple test, ask whether you’d rather be in a state where you’ve been unexpectedly notified that you’ll get a cookie in a few minutes or whether you’d rather be in the state where you actually eat the cookie after having been notified a few minutes earlier. Daswani and Leike’s definition considers being notified about the cookie to be happiness, while I think eating the cookie has more moral relevance.
Dayan mentioned that liking may even be an epiphenomenon of some things that are going on in the brain when we eat food/have sex etc, similar to how the specific flavour of pleasure we get from listening to music is such an epiphenomenon.
I’m not sure I understand, but I wrote a quick thing here inspired by this comment. Do you think that’s what he meant? If so, may I attribute him/you for the idea? It seems fairly plausible. :) Studying what separates red from blue might help shine light on this topic.
Thanks for the reply. I think I can clarify the issue about discrete time intervals. I’d be curious on your thoughts on the last sentence of my comment above if you have any.
Discrete time
Yes. But in a SEMI or a https://en.wikipedia.org/wiki/Markov_decision_process#Continuous-time_Markov_Decision_Process Markov Decision Process (SMDP) this is not the case. SMDPs allow temporally extended actions and are commonly used in RL research. Dayan’s papers use a continuous SMDP. You can still have RL agents in this formalism and it tracks our situation more closely. But I don’t think the formalism matters for our discussion because you can arbitrarily approximate any formalism with a standard MDP—I’ll explain below.
The continuous-time experiment looks roughly like this: Imagine you’re in a room and you have to press a lever to get out—and get back to what you would normally be doing and get an average reward rho per second. However, the lever is hard to press. You can press it hard and fast or light and slowly, taking a total time T to complete the press. The total energy cost of pressing is 1/T so ideally you’d press very slowly but that would mean you couldn’t be outside the room during that time (opportunity costs).
In this setting, the ‘action’ is just the time T that you to press the lever. We can easily approximate this with a standard MDP. E.g. you could take action 1 which completely presses the lever in one time step, costing you 1/1=1 reward in energy. Or you could take action 2, which you would have to take twice to complete the press, costing you only 1⁄2 reward (so 1⁄4 for each time you take action 2). And so forth. Does that make sense?
Zero point
Of course, if you don’t like it outside the room at all, you’ll never press the lever—so there is a ‘zero point’ in terms of how much you like it outside. Below that point you’ll never press the lever.
I’m not entirely sure what you mean, but I’ll clarify that acting vigorously doesn’t say anything about whether the agent is currently happy. It may well act vigorously just to escape punishment. Similarly, an agent that currently works to increase its life-time doesn’t necessarily feel good, but its work still implies that it thinks the additional life-time it gets will be good.
But I think your criticism may be the same as what I said in the edit above—that there is an unwarranted assumption that the agent is at the zero-point before it presses the lever. In the experiments this is assumed because there are no food rewards or shocks during that time. But you could still imaging that a depressed rat would feel bad anyway.
The theory that assumes nonexistence is the zero-point kind of does the same thing though. Although nonexistence is arguably a definite zero-point, the agent’s utility function might still go beyond its life-time...
Does this clarify the case?
Your explanation was clear. :)
Yeah, I guess I meant the trivial observation that you act vigorously if you judge that doing so has higher expected total discounted reward than not doing so. But this doesn’t speak to whether, after making that vigorous effort, your experiences will be net positive; they might just be less negative.
...assuming that sticking around inside the room is neutral. This gets back to the “unwarranted assumption that the agent is at the zero-point before it presses the lever.”
Hm. :) I feel like there’s a difference between (a) an agent inside the room who hasn’t yet pressed the lever to get out and (b) the agent not existing at all. For (a), it seems we ought to be able to give a (qualia and morally nonrealist) answer about whether its experiences are positive or negative or neutral, while for (b), such a question seems misplaced.
If it were a human in the room, we could ask that person whether her experiences before lever pressing were net positive or negative. I guess such answers could vary a lot between people based on various cultural, psychological, etc. factors unrelated to the activity level of reward networks. If so, perhaps one position could be that the distinction between positive vs. negative welfare is a pretty anthropomorphic concept that doesn’t travel well outside of a cognitive system capable of making these kinds of judgments. Intuitively, I feel like there is more to the sign of one’s welfare than these high-level, potentially idiosyncratic evaluations, but it’s hard to say what.
I suppose another approach could be to say that the person in the room definitely is at welfare 0 (by fiat) based on lack of reward or punishment signals, regardless of how the person evaluates her welfare verbally.
Yes that’s probably the right way to think about it. I’m also considering an alternative though: Since we’re describing the situation with a simple computational model we shouldn’t assume that there’s anything going on that isn’t captured by the model. E.g. if the agent in the room is depressed, it will be performing ‘mental actions’ - imagining depressing scenarios etc. But we may have to assume that away, similar to how high school physics would assume no friction etc.
So we’re left with an agent that decides initially that it won’t do anything at all (not even updating its beliefs) because it doesn’t want to be outside of the room and then remains inactive. The question arises if that’s an agent at all and if it’s meaningfully different unconsciousness.
Hm. :) Well, what if the agent did do stuff inside the room but still decided not to go out? We still wouldn’t be able to tell if it was experiencing net positive, negative, or neutral welfare. Examples:
It’s winter. The agent is cold indoors and is trying to move to the warm parts of the room. We assume its welfare is net negative. But it doesn’t go outside because it’s even colder outside.
The agent is indoors having a party. We assume it’s experiencing net positive welfare. It doesn’t want to go outside because the party is inside.
We can reproduce the behavior of these agents with reward/punishment values that are all positive numbers, all negative numbers, or a combination of the two. So if we omit the higher-level thoughts of the agents and just focus on the reward numbers at an abstract level, it doesn’t seem like we can meaningfully distinguish positive or negative welfare. Hence, the sign of welfare must come from the richer context that our human-centered knowledge and evaluations bring?
Of course, qualia nonrealists already knew that the sign and magnitude of an organism’s welfare are things we make up. But most people can agree upon, e.g., the sign of the welfare of the person at the party. In contrast, there doesn’t seem to be a principled way that most people would agree upon for us to attribute a sign of welfare to a simple RL agent that reproduces the high-level behavior of the person at the party.
After some clarification Dayan thinks that vigour is not the thing I was looking for.
We discussed this a bit further and he suggested that the temporal difference error does track pretty closely what we mean by happiness/suffering, at least as far as the zero point is concerned. Here’s a paper making the case (but it has limited scope IMO).
If that’s true, we wouldn’t need e.g. the theory that there’s a zero point to keep firing rates close to zero.
The only problem with TD errors seems to be that they don’t account for the difference between wanting and liking. But it’s currently just unresolved what the function of liking is. So I came away with the impression that liking vs wanting and not the zero point is the central question.
I’ve seen one paper suggesting that liking is basically the consumption of rewards, which would bring us back to the question of the zero point though. But we didn’t find that theory satisfying. E.g. food is just a proxy for survival. And as the paper I linked shows, happiness can follow TD errors even when no rewards are consumed.
Dayan mentioned that liking may even be an epiphenomenon of some things that are going on in the brain when we eat food/have sex etc, similar to how the specific flavour of pleasure we get from listening to music is such an epiphenomenon. I don’t know if that would mean that liking has no function.
Any thoughts?
Interesting. :)
Daswani and Leike (2015) also define (p. 4) happiness as the temporal difference error (in an MDP), and for model-based agents, the definition is, in my interpretation, basically the common Internet slogan that “happiness = reality—expectations”. However, the authors point out (p. 2) that pleasure = reward != happiness. This still leaves open the issue of what pleasure is.
Personally I think pleasure is more morally relevant. In Tomasik (2014), I wrote (p. 11):
In this post commenting on Daswani and Leike (2015), I said:
I’m not sure I understand, but I wrote a quick thing here inspired by this comment. Do you think that’s what he meant? If so, may I attribute him/you for the idea? It seems fairly plausible. :) Studying what separates red from blue might help shine light on this topic.