You understood me correctly. To be specific I was considering the third case in which the agent has uncertainty about is preferred state of the world. It may thus refrain from taking irreversible actions that may have a small upside in one scenario (protonium water) but large negative value in the other (deuterium) due to eg decreasing returns, or if it thinks thereâs a chance to get more information on what the objectives are supposed to mean.
I understand your point that this distinction may look arbitrary, but goals are not necessarily defined at the physical level, but rather over abstractions. For example, is a human with high level of dopamine happier? What is exactly a human? Can a larger human brain be happier? My belief is that since these objectives are built over (possibly changing) abstractions, it is unclear whether a single agent might iron out its goal. In fact, if âwhat the representation of the goal was meant to meanâ makes reference to what some human wanted to represent, youâll probably never have a clear cut unchanging goal.
Though I believe an important problem in this case is how to train an agent able to distinguish between the goal and its representation, and seek to optimise the former. I find it a bit confusing when I think about it.
You understood me correctly. To be specific I was considering the third case in which the agent has uncertainty about is preferred state of the world. It may thus refrain from taking irreversible actions that may have a small upside in one scenario (protonium water) but large negative value in the other (deuterium) due to eg decreasing returns, or if it thinks thereâs a chance to get more information on what the objectives are supposed to mean.
I understand your point that this distinction may look arbitrary, but goals are not necessarily defined at the physical level, but rather over abstractions. For example, is a human with high level of dopamine happier? What is exactly a human? Can a larger human brain be happier? My belief is that since these objectives are built over (possibly changing) abstractions, it is unclear whether a single agent might iron out its goal. In fact, if âwhat the representation of the goal was meant to meanâ makes reference to what some human wanted to represent, youâll probably never have a clear cut unchanging goal.
Though I believe an important problem in this case is how to train an agent able to distinguish between the goal and its representation, and seek to optimise the former. I find it a bit confusing when I think about it.