I think that my description of the thesis (and, actually, my own thinking on it) is a bit fuzzy. Nevertheless, here’s roughly how I’m thinking about it:
First, let’s say that an agent has the “goal” of doing X if it’s sometimes useful to think of the system as “trying to do X.” For example, it’s sometimes useful to think of a person as “trying” to avoid pain, be well-liked, support their family, etc. It’s sometimes useful to think of a chess program as “trying” to win games of chess.
Agents are developed through a series of changes. In the case of a “hand-coded” AI system, the changes would involve developers adding, editing, or removing lines of code. In the case of an RL agent, the changes would typically involve a learning algorithm updating the agent’s policy. In the case of human evolution, the changes would involve genetic mutations.
If the “process orthogonality thesis” were true, then this would mean that we can draw a pretty clean line between between “changes that affect an agent’s capabilities” and “changes that affect an agent’s goals.” Instead, I want to say that it’s really common for changes to affect both capabilities and goals. In practice, we can’t draw a clean line between “capability genes” and “goal genes” or between “RL policy updates that change goals” and “RL policy updates that change capabilities.” Both goals and capabilities tend to take shape together.
That being said, it is true that some changes do, intuitively, mostly just affect either capabilities or goals. I wouldn’t be surprised, for example, if it’s possible to introduce a minus sign somewhere into Deep Blue’s code and transform it into a system that looks like it’s trying to lose at chess; although the system will probably be less good at losing than it was a winning, it may still be pretty capable. So the processes of changing a system’s capabilities and changing its goals can still come apart to some degree.
It’s also possible to do fundamental research and engineering work that is useful for developing a wide variety of systems. For example, hardware progress has, in general, made it easier to develop highly competent RL agents in all sorts of domains. But, when it comes time to train a new RL agent, its goals and capabilities will still take shape together.
Thanks! This does clarify things for me, and I think that the definition of a “goal” is very helpful here. I do still have some uncertainty here about the claim of process orthogonality which I can better understand:
Let’s define an “instrumental goal” as a goal X for which there is a goal Y such that whenever it is useful to think of the agent as “trying to do X” it is in fact also useful to think of it as “trying to do Y”; In this case we can think that X is instrumental to Y. Instrumental goals can be generated at the development phase or by the agent itself (implicitly or explicitly).
I think that the (non-process) orthogonality thesis does not hold with respect to instrumental goals. A better selection of instrumental goals will enable better capabilities, and with greater capabilities comes greater planning capacity.
Therefore, the process orthogonality thesis does not hold as well for instrumental goals. This means that instrumental goals are usually not the goals of interest when trying to discern between process and non-process orthogonality theses, and we should focus on terminal goal (those which aren’t instrumental).
In the case of an RL agent or Deep Blue, I can only see one terminal goal—maximize defined score or win chess. These won’t really be change together with capabilities.
I thought a bit about humans, but I feel that this is much more complicated and needs more nuanced definitions of goals. (is avoiding suffering a terminal goal? It seems that way, but who is doing the thinking in which it is useful to think of one thing or another as a goal? Perhaps the goal is to reduce specific neuronal activity for which avoiding suffering is merely instrumental?)
I thought a bit about humans, but I feel that this is much more complicated and needs more nuanced definitions of goals. (is avoiding suffering a terminal goal? It seems that way, but who is doing the thinking in which it is useful to think of one thing or another as a goal? Perhaps the goal is to reduce specific neuronal activity for which avoiding suffering is merely instrumental?)
I’m actually not very optimistic about a more complex or formal definition of goals. In my mind, the concept of a “goal” is often useful, but it’s sort of an intrinisically fuzzy or fundamentally pragmatic concept. I also think that, in practice, the distinction between an “intrinsic” and “instrumental” goal is pretty fuzzy in the same way (although I think your definition is a good one).
Ultimately, agents exhibit behaviors. It’s often useful to try to summarize these behaviors in terms of what sorts of things the agent is fundamentally “trying” to do and in terms of the “capabilities” that the agent brings to bear. But I think this is just sort of a loose way of speaking. I don’t really think, for example, that there are principled/definitive answers to the questions “What are all of my cat’s goals?”, “Which of my cat’s goals are intrinsic?”, or “What’s my cat’s utility function?” Even if we want to move beyond behavioral definitions of goals, to ones that focus on cognitive processes, I think these sorts of questions will probably still remain pretty fuzzy.
(I think that this way of thinking—in which evolutionary or engineering selection processes ultimately act on “behaviors,” which can only somewhat informally or imprecisely be described in terms of “capabilities” and “goals”—also probably has an influence on my relative optimism about AI alignment. )
I think that my description of the thesis (and, actually, my own thinking on it) is a bit fuzzy. Nevertheless, here’s roughly how I’m thinking about it:
First, let’s say that an agent has the “goal” of doing X if it’s sometimes useful to think of the system as “trying to do X.” For example, it’s sometimes useful to think of a person as “trying” to avoid pain, be well-liked, support their family, etc. It’s sometimes useful to think of a chess program as “trying” to win games of chess.
Agents are developed through a series of changes. In the case of a “hand-coded” AI system, the changes would involve developers adding, editing, or removing lines of code. In the case of an RL agent, the changes would typically involve a learning algorithm updating the agent’s policy. In the case of human evolution, the changes would involve genetic mutations.
If the “process orthogonality thesis” were true, then this would mean that we can draw a pretty clean line between between “changes that affect an agent’s capabilities” and “changes that affect an agent’s goals.” Instead, I want to say that it’s really common for changes to affect both capabilities and goals. In practice, we can’t draw a clean line between “capability genes” and “goal genes” or between “RL policy updates that change goals” and “RL policy updates that change capabilities.” Both goals and capabilities tend to take shape together.
That being said, it is true that some changes do, intuitively, mostly just affect either capabilities or goals. I wouldn’t be surprised, for example, if it’s possible to introduce a minus sign somewhere into Deep Blue’s code and transform it into a system that looks like it’s trying to lose at chess; although the system will probably be less good at losing than it was a winning, it may still be pretty capable. So the processes of changing a system’s capabilities and changing its goals can still come apart to some degree.
It’s also possible to do fundamental research and engineering work that is useful for developing a wide variety of systems. For example, hardware progress has, in general, made it easier to develop highly competent RL agents in all sorts of domains. But, when it comes time to train a new RL agent, its goals and capabilities will still take shape together.
(Hope that clarifies things at least a bit!)
Thanks! This does clarify things for me, and I think that the definition of a “goal” is very helpful here. I do still have some uncertainty here about the claim of process orthogonality which I can better understand:
Let’s define an “instrumental goal” as a goal X for which there is a goal Y such that whenever it is useful to think of the agent as “trying to do X” it is in fact also useful to think of it as “trying to do Y”; In this case we can think that X is instrumental to Y. Instrumental goals can be generated at the development phase or by the agent itself (implicitly or explicitly).
I think that the (non-process) orthogonality thesis does not hold with respect to instrumental goals. A better selection of instrumental goals will enable better capabilities, and with greater capabilities comes greater planning capacity.
Therefore, the process orthogonality thesis does not hold as well for instrumental goals. This means that instrumental goals are usually not the goals of interest when trying to discern between process and non-process orthogonality theses, and we should focus on terminal goal (those which aren’t instrumental).
In the case of an RL agent or Deep Blue, I can only see one terminal goal—maximize defined score or win chess. These won’t really be change together with capabilities.
I thought a bit about humans, but I feel that this is much more complicated and needs more nuanced definitions of goals. (is avoiding suffering a terminal goal? It seems that way, but who is doing the thinking in which it is useful to think of one thing or another as a goal? Perhaps the goal is to reduce specific neuronal activity for which avoiding suffering is merely instrumental?)
I’m actually not very optimistic about a more complex or formal definition of goals. In my mind, the concept of a “goal” is often useful, but it’s sort of an intrinisically fuzzy or fundamentally pragmatic concept. I also think that, in practice, the distinction between an “intrinsic” and “instrumental” goal is pretty fuzzy in the same way (although I think your definition is a good one).
Ultimately, agents exhibit behaviors. It’s often useful to try to summarize these behaviors in terms of what sorts of things the agent is fundamentally “trying” to do and in terms of the “capabilities” that the agent brings to bear. But I think this is just sort of a loose way of speaking. I don’t really think, for example, that there are principled/definitive answers to the questions “What are all of my cat’s goals?”, “Which of my cat’s goals are intrinsic?”, or “What’s my cat’s utility function?” Even if we want to move beyond behavioral definitions of goals, to ones that focus on cognitive processes, I think these sorts of questions will probably still remain pretty fuzzy.
(I think that this way of thinking—in which evolutionary or engineering selection processes ultimately act on “behaviors,” which can only somewhat informally or imprecisely be described in terms of “capabilities” and “goals”—also probably has an influence on my relative optimism about AI alignment. )