AI operates in the single-minded pursuit of a goal that humans provide it. This goal is specified in something called the reward function.
It turns out the problem is a lot worse than this—even if we knew of a safe goal to give AI, we would have no idea how to build an AI that pursues that goal!
See this post for more detail. Another way of saying this using the inner/outer alignment framework: reward is the outer optimization target, but this does automatically induce inner optimization in the same direction.
It turns out the problem is a lot worse than this—even if we knew of a safe goal to give AI, we would have no idea how to build an AI that pursues that goal!
See this post for more detail. Another way of saying this using the inner/outer alignment framework: reward is the outer optimization target, but this does automatically induce inner optimization in the same direction.