Right, so I’m pretty on board with optimal policies (i.E., “global maximum” policies) usually involve seeking power. However, gradient descent only finds local maximums, not global maximums. It’s unclear to me whether these global maximums would involve something like power-seeking. My intuition for why this might not be the case is that “small tweaks” in the direction of power-seeking would probably not reap immediate benefits, so gradient descent wouldn’t go down this path.
This is where my question kind of arose from. If you have empirical examples of power-seeking coming up in tasks where it’s nontrivial that it would come up, I’d find that particularly helpful.
Does the paper you sent address this? If so, I’ll spend more time reading it.
Afaik, it remains an open area of research to find examples of emergent power-seeking in real ML systems. Finding such examples would do a lot for raising the alarm about AGI x-risk I think.
Ok, cool, that’s helpful to know. Is your intuition that these examples will definitely occur and we just haven’t seen them yet (due to model size or something like this)? If so, why?
My intuition is that they will occur, hopefully before it’s too late (but it’s possible that due to incentives for deception etc we may not see it before it’s too late). More here: Evaluating LM power-seeking .
With powerful enough systems, convergent instrumental goals emerge, and inner alignment is something that needs to be addressed (i.e. stopping unintended misaligned agents emerging within the model). Optimal Policies Tend To Seek Power.
Right, so I’m pretty on board with optimal policies (i.E., “global maximum” policies) usually involve seeking power. However, gradient descent only finds local maximums, not global maximums. It’s unclear to me whether these global maximums would involve something like power-seeking. My intuition for why this might not be the case is that “small tweaks” in the direction of power-seeking would probably not reap immediate benefits, so gradient descent wouldn’t go down this path.
This is where my question kind of arose from. If you have empirical examples of power-seeking coming up in tasks where it’s nontrivial that it would come up, I’d find that particularly helpful.
Does the paper you sent address this? If so, I’ll spend more time reading it.
Afaik, it remains an open area of research to find examples of emergent power-seeking in real ML systems. Finding such examples would do a lot for raising the alarm about AGI x-risk I think.
Ok, cool, that’s helpful to know. Is your intuition that these examples will definitely occur and we just haven’t seen them yet (due to model size or something like this)? If so, why?
My intuition is that they will occur, hopefully before it’s too late (but it’s possible that due to incentives for deception etc we may not see it before it’s too late). More here: Evaluating LM power-seeking .