This topic seems to me both extremely important and neglected. (Maybe it’s neglected because it ~requires some combination of ML and philosophy backgrounds that people rarely have).
My interpretation of the core hypothesis in this post is something like the following: A mesa optimizer may receive evaluative signals that are computed by some subnetwork within the model (a subnetwork that was optimized by the base optimizer to give “useful” evaluative signals w.r.t. the base objective). Those evaluative signals can constitute morally relevant valenced experience. This hypothesis seems to me plausible.
Some further comments:
Re:
For instance, the DeepMind agent discussed in section 4 pre-processes an evaluation of its goal achievement. This evaluative signal is factored/non-integrated in a sense, and so it may not be interacting in the right way with the downstream, abstract processes to reach conscious processing.
I don’t follow. I’m not closely familiar with the Open-Ended Learning paper, but from a quick look my impression is that it’s basically standard RL in multi-agent environments, with more diversity in the training environments than most works. I don’t understand what you mean when you say that the agent “pre-processes an evaluation of its goal achievement” (and why the analogy to humans & evolution is less salient here, if you think that).
Re:
Returning to Tomasik’s assumption, “RL operations are relevant to an agent’s welfare”, the functionalist must disagree. At best we can say that RL operations can be instrumentally valuable by (positively) modifying the valence system.
(I assume that an “RL operation” refers to things like an update of the weights of a policy network.) I’m not sure what you mean by “positively” here. An update to the weights of the policy network can negatively affect an evaluative signal.
[EDIT: Also, re: “Compare now to mesa-optimizers in which the reward signal is definitionally internal to the system”. I don’t think that the definition of mesa-optimizers involves a reward signal. (It’s possible that a mesa-optimizer will never receive any evidence about “how well it’s doing”.)]
Re comment 1: Yes, sorry this was just meant to point at a potential parallel not to work out the parallel in detail. I think it’d be valuable to work out the potential parallel between the DM agent’s predicate predictor module (Fig12/pg14) with my factored-noxiousness-object-detector idea. I just took a brief look at the paper to refresh my memory, but if I’m understanding this correctly, it seems to me that this module predicts which parts of the state prevent goal realization.
Re comment 2: Yes, this should read “(positive/negatively)”. Thanks for pointing this out.
Re EDIT: Mesa-optimizers may or may not represent a reward signal—perhaps there’s a connection here with Demski’s distinction between search and control. But for the purposes of my point in the text, I don’t think this much matters. All I’m trying to say is that VPG-type-optimizers have external reward signals, whereas mesa-optimizers can have internal reward signals.
Re comment 1: Yes, sorry this was just meant to point at a potential parallel not to work out the parallel in detail. I think it’d be valuable to work out the potential parallel between the DM agent’s predicate predictor module (Fig12/pg14) with my factored-noxiousness-object-detector idea. I just took a brief look at the paper to refresh my memory, but if I’m understanding this correctly, it seems to me that this module predicts which parts of the state prevent goal realization.
I guess what I don’t understand is how the “predicate predictor” thing can make it so that the setup is less likely to yield models that support morally relevant valence (if you indeed think that). Suppose the environment is modified such that the observation that the agent gets in each time step includes the value of every predicate in the reward specification. That would make the “predicate predictor” useless (I think; just from a quick look at the paper). Would that new setup be more likely than the original to yield models that have morally relevant valence?
Your new setup seems less likely to have morally relevant valence. Essentially the more the setup factors out valence-relevant computation (e.g. by separating out a module, or by accessing an oracle as in your example) the less likely it is for valenced processing to happen within the agent.
Just to be explicit here, I’m assuming estimates of goal achievement are valence-relevant. How generally this is true is not clear to me.
Essentially the more the setup factors out valence-relevant computation (e.g. by separating out a module, or by accessing an oracle as in your example) the less likely it is for valenced processing to happen within the agent.
I think the analogy to humans suggests otherwise. Suppose a human feels pain in their hand due to touching something hot. We can regard all the relevant mechanisms in their body outside the brain—those that cause the brain to receive the relevant signal—as mechanisms that have been “factored out from the brain”. And yet those mechanisms are involved in morally relevant pain. In contrast, suppose a human touches a radioactive material until they realize it’s dangerous. Here there are no relevant mechanisms that have been “factored out from the brain” (the brain needs to use ~general reasoning); and there is no morally relevant pain in this scenario.
Though generally if “factoring out stuff” means that smaller/less-capable neural networks are used, then maybe it can reduce morally relevant valence risks.
Good clarification. Determining which kinds of factoring are the ones which reduce valence is more subtle than I had thought. I agree with you that the DeepMind set-up seems more analogous to neural nociception (e.g. high heat detection). My proposed set-up (Figure 5) seems significantly different from the DM/nociception case, because it factors the step where nociceptive signals affect decision making and motivation. I’ll edit my post to clarify.
This topic seems to me both extremely important and neglected. (Maybe it’s neglected because it ~requires some combination of ML and philosophy backgrounds that people rarely have).
My interpretation of the core hypothesis in this post is something like the following: A mesa optimizer may receive evaluative signals that are computed by some subnetwork within the model (a subnetwork that was optimized by the base optimizer to give “useful” evaluative signals w.r.t. the base objective). Those evaluative signals can constitute morally relevant valenced experience. This hypothesis seems to me plausible.
Some further comments:
Re:
I don’t follow. I’m not closely familiar with the Open-Ended Learning paper, but from a quick look my impression is that it’s basically standard RL in multi-agent environments, with more diversity in the training environments than most works. I don’t understand what you mean when you say that the agent “pre-processes an evaluation of its goal achievement” (and why the analogy to humans & evolution is less salient here, if you think that).
Re:
(I assume that an “RL operation” refers to things like an update of the weights of a policy network.) I’m not sure what you mean by “positively” here. An update to the weights of the policy network can negatively affect an evaluative signal.
[EDIT: Also, re: “Compare now to mesa-optimizers in which the reward signal is definitionally internal to the system”. I don’t think that the definition of mesa-optimizers involves a reward signal. (It’s possible that a mesa-optimizer will never receive any evidence about “how well it’s doing”.)]
Your interpretation is a good summary!
Re comment 1: Yes, sorry this was just meant to point at a potential parallel not to work out the parallel in detail. I think it’d be valuable to work out the potential parallel between the DM agent’s predicate predictor module (Fig12/pg14) with my factored-noxiousness-object-detector idea. I just took a brief look at the paper to refresh my memory, but if I’m understanding this correctly, it seems to me that this module predicts which parts of the state prevent goal realization.
Re comment 2: Yes, this should read “(positive/negatively)”. Thanks for pointing this out.
Re EDIT: Mesa-optimizers may or may not represent a reward signal—perhaps there’s a connection here with Demski’s distinction between search and control. But for the purposes of my point in the text, I don’t think this much matters. All I’m trying to say is that VPG-type-optimizers have external reward signals, whereas mesa-optimizers can have internal reward signals.
I guess what I don’t understand is how the “predicate predictor” thing can make it so that the setup is less likely to yield models that support morally relevant valence (if you indeed think that). Suppose the environment is modified such that the observation that the agent gets in each time step includes the value of every predicate in the reward specification. That would make the “predicate predictor” useless (I think; just from a quick look at the paper). Would that new setup be more likely than the original to yield models that have morally relevant valence?
Your new setup seems less likely to have morally relevant valence. Essentially the more the setup factors out valence-relevant computation (e.g. by separating out a module, or by accessing an oracle as in your example) the less likely it is for valenced processing to happen within the agent.
Just to be explicit here, I’m assuming estimates of goal achievement are valence-relevant. How generally this is true is not clear to me.
I think the analogy to humans suggests otherwise. Suppose a human feels pain in their hand due to touching something hot. We can regard all the relevant mechanisms in their body outside the brain—those that cause the brain to receive the relevant signal—as mechanisms that have been “factored out from the brain”. And yet those mechanisms are involved in morally relevant pain. In contrast, suppose a human touches a radioactive material until they realize it’s dangerous. Here there are no relevant mechanisms that have been “factored out from the brain” (the brain needs to use ~general reasoning); and there is no morally relevant pain in this scenario.
Though generally if “factoring out stuff” means that smaller/less-capable neural networks are used, then maybe it can reduce morally relevant valence risks.
Good clarification. Determining which kinds of factoring are the ones which reduce valence is more subtle than I had thought. I agree with you that the DeepMind set-up seems more analogous to neural nociception (e.g. high heat detection). My proposed set-up (Figure 5) seems significantly different from the DM/nociception case, because it factors the step where nociceptive signals affect decision making and motivation. I’ll edit my post to clarify.