Thanks! I think most of this made sense to me. I’m a bit fuzzy on the fourth bullet. Also, I’m still confused why a model would even develop an alternative goal to maximizing its reward function, even if it’s theoretically able to pursue one.
Thanks! I think most of this made sense to me. I’m a bit fuzzy on the fourth bullet. Also, I’m still confused why a model would even develop an alternative goal to maximizing its reward function, even if it’s theoretically able to pursue one.