Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiencyâthough I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)).
Iâd be curious to understand why you believe this happens. Humans (the only general intelligence we have so far) seems to preserve some uncertainty over goal distributions. So it is unclear to me that generality will necessarily clarify goals.
To be a bit more concrete: I find it plausible that the AGI will encounter possible fine grained (concrete) goals that map into the same high level representation of its goal, whatever it may be. Then you have to refine what the goal representation was meant to mean. After all, a representation of the goal is not the goal itself necessarily. I believe this is what humans face, and why human goals are often a small mess.
If I understand you right, youâre thinking of scenarios like âthe AI initially tries to create lots of watery looking stuff, but then it later realizes that watery looking stuff can be made of different substances (e.g., oxygen paired with protium vs. deuterium)â. We can imagine different outcomes here, like:
Some part of the AI feels like protium is important for âreal waterâ, while another part feels that deuterium is important for âreal waterâ. So the AI spends a lot of its resources going back and forth between the two goals, undoing its own work regularly.
The AI thinks about its values, and realizes that (for some complicated reason related to how it does reflection and how its goals work) itâs really deuterium-containing water that it likes, not protium-containing water. So it switches to making heavy water exclusively.
The AI thinks about its values, and realizes that (for some complicated reason related to how it does reflection and how its goals work) it wants to put 90% of its resources into producing heavy water, and 10% into producing light water.
Whether 1 counts as âone agent thatâs internally conflictedâ versus âmultiple agents in a tug-of-war for controlâ might turn out to be a matter of semantics, depending on whether there turns out to be a crisp and natural interpretation of the word âagentâ.
Whether 2 counts as âthe agent self-modifying to change its goalsâ versus âthe agent keeping the same goals but changing its probability distribution about which physical things those goals are pointing atâ, may also turn out to be an unimportant or arbitrary distinction. It at least doesnât seem very important from a human perspective: the first kind of agent may have a different internal design than the second kind of agent, but the behaviors are likely to look the same from the outside, since sufficiently coherent agents optimize expected utility (probability times utility) in practice, and it may be hard to say from the outside which parts of the expected utility are probability vs. utility, especially if the agentâs doing a bunch of complicated reflection and self-modification.
Similarly, whether 3 counts as ânormative uncertainty about whatâs bestâ versus âcomplete certainty in a meta-level goal that assigns some utility to heavy water and some utility to light waterâ may turn out to be a somewhat arbitrary distinction.
I understood Nateâs post to be saying that sufficiently capable agents tend to stop looking like 1, not that they necessarily tend to stop looking like 2 or like 3.
In principle itâs possible for an agent to stably consist of some sub-agent that optimizes heavy water on Mondays and Tuesdays and light water on the other days of the week. But because the first sub-agent will tend to want to disempower the second sub-agent (so it can produce more heavy water on other days of the week), and the second sub-agent will tend to want to disempower the first sub-agent, there are many scenarios where one sub-agent or the other ends up âwinningâ.
(Or, failing that, the two sub-agents will tend to eventually agree to effectively merge into a new agent that values a compromise between the goals of the original two sub-agents, since both agents can get more of what they want if they spend fewer resources on undoing the other agentâs hard work.)
You understood me correctly. To be specific I was considering the third case in which the agent has uncertainty about is preferred state of the world. It may thus refrain from taking irreversible actions that may have a small upside in one scenario (protonium water) but large negative value in the other (deuterium) due to eg decreasing returns, or if it thinks thereâs a chance to get more information on what the objectives are supposed to mean.
I understand your point that this distinction may look arbitrary, but goals are not necessarily defined at the physical level, but rather over abstractions. For example, is a human with high level of dopamine happier? What is exactly a human? Can a larger human brain be happier? My belief is that since these objectives are built over (possibly changing) abstractions, it is unclear whether a single agent might iron out its goal. In fact, if âwhat the representation of the goal was meant to meanâ makes reference to what some human wanted to represent, youâll probably never have a clear cut unchanging goal.
Though I believe an important problem in this case is how to train an agent able to distinguish between the goal and its representation, and seek to optimise the former. I find it a bit confusing when I think about it.
Iâd be curious to understand why you believe this happens. Humans (the only general intelligence we have so far) seems to preserve some uncertainty over goal distributions. So it is unclear to me that generality will necessarily clarify goals.
To be a bit more concrete: I find it plausible that the AGI will encounter possible fine grained (concrete) goals that map into the same high level representation of its goal, whatever it may be. Then you have to refine what the goal representation was meant to mean. After all, a representation of the goal is not the goal itself necessarily. I believe this is what humans face, and why human goals are often a small mess.
If I understand you right, youâre thinking of scenarios like âthe AI initially tries to create lots of watery looking stuff, but then it later realizes that watery looking stuff can be made of different substances (e.g., oxygen paired with protium vs. deuterium)â. We can imagine different outcomes here, like:
Some part of the AI feels like protium is important for âreal waterâ, while another part feels that deuterium is important for âreal waterâ. So the AI spends a lot of its resources going back and forth between the two goals, undoing its own work regularly.
The AI thinks about its values, and realizes that (for some complicated reason related to how it does reflection and how its goals work) itâs really deuterium-containing water that it likes, not protium-containing water. So it switches to making heavy water exclusively.
The AI thinks about its values, and realizes that (for some complicated reason related to how it does reflection and how its goals work) it wants to put 90% of its resources into producing heavy water, and 10% into producing light water.
Whether 1 counts as âone agent thatâs internally conflictedâ versus âmultiple agents in a tug-of-war for controlâ might turn out to be a matter of semantics, depending on whether there turns out to be a crisp and natural interpretation of the word âagentâ.
Whether 2 counts as âthe agent self-modifying to change its goalsâ versus âthe agent keeping the same goals but changing its probability distribution about which physical things those goals are pointing atâ, may also turn out to be an unimportant or arbitrary distinction. It at least doesnât seem very important from a human perspective: the first kind of agent may have a different internal design than the second kind of agent, but the behaviors are likely to look the same from the outside, since sufficiently coherent agents optimize expected utility (probability times utility) in practice, and it may be hard to say from the outside which parts of the expected utility are probability vs. utility, especially if the agentâs doing a bunch of complicated reflection and self-modification.
Similarly, whether 3 counts as ânormative uncertainty about whatâs bestâ versus âcomplete certainty in a meta-level goal that assigns some utility to heavy water and some utility to light waterâ may turn out to be a somewhat arbitrary distinction.
I understood Nateâs post to be saying that sufficiently capable agents tend to stop looking like 1, not that they necessarily tend to stop looking like 2 or like 3.
In principle itâs possible for an agent to stably consist of some sub-agent that optimizes heavy water on Mondays and Tuesdays and light water on the other days of the week. But because the first sub-agent will tend to want to disempower the second sub-agent (so it can produce more heavy water on other days of the week), and the second sub-agent will tend to want to disempower the first sub-agent, there are many scenarios where one sub-agent or the other ends up âwinningâ.
(Or, failing that, the two sub-agents will tend to eventually agree to effectively merge into a new agent that values a compromise between the goals of the original two sub-agents, since both agents can get more of what they want if they spend fewer resources on undoing the other agentâs hard work.)
You understood me correctly. To be specific I was considering the third case in which the agent has uncertainty about is preferred state of the world. It may thus refrain from taking irreversible actions that may have a small upside in one scenario (protonium water) but large negative value in the other (deuterium) due to eg decreasing returns, or if it thinks thereâs a chance to get more information on what the objectives are supposed to mean.
I understand your point that this distinction may look arbitrary, but goals are not necessarily defined at the physical level, but rather over abstractions. For example, is a human with high level of dopamine happier? What is exactly a human? Can a larger human brain be happier? My belief is that since these objectives are built over (possibly changing) abstractions, it is unclear whether a single agent might iron out its goal. In fact, if âwhat the representation of the goal was meant to meanâ makes reference to what some human wanted to represent, youâll probably never have a clear cut unchanging goal.
Though I believe an important problem in this case is how to train an agent able to distinguish between the goal and its representation, and seek to optimise the former. I find it a bit confusing when I think about it.