Various “auto-GPT” schemes seem like a good demonstration of power-seeking behavior (and perhaps very limited forms of self-preservation or self-improvement), insofar as auto-GPT setups will often invent basic schemes like “I should try to find a way to earn some money in order to accomplish my goal of X”, or “I should start a twitter account to gain some followers”, or other similarly “agenty” actions/plans.
This might be a bit of a stretch, but to the extent that LLMs exhibit “sycophancy” (ie, telling people what they want to hear in response to stuff like political questions), this seems like it might be partially fueled by the LLM “specification gaming” the RLHF process? Since I’d expect that an LLM might get higher “helpful/honest/harmless” scores by trying to guess which answers the grader most wants to hear, instead of trying to give its truly most “honest” answer? (But I don’t have a super-strong understanding of this stuff, and it’s possible that other effects are fueling the sycophancy, such as if most of the sycophancy comes from the base model rather than emerging after RLHF.) But specification gaming seems like such a common phenomenon that there must be better examples out there.
Various “auto-GPT” schemes seem like a good demonstration of power-seeking behavior (and perhaps very limited forms of self-preservation or self-improvement), insofar as auto-GPT setups will often invent basic schemes like “I should try to find a way to earn some money in order to accomplish my goal of X”, or “I should start a twitter account to gain some followers”, or other similarly “agenty” actions/plans.
This might be a bit of a stretch, but to the extent that LLMs exhibit “sycophancy” (ie, telling people what they want to hear in response to stuff like political questions), this seems like it might be partially fueled by the LLM “specification gaming” the RLHF process? Since I’d expect that an LLM might get higher “helpful/honest/harmless” scores by trying to guess which answers the grader most wants to hear, instead of trying to give its truly most “honest” answer? (But I don’t have a super-strong understanding of this stuff, and it’s possible that other effects are fueling the sycophancy, such as if most of the sycophancy comes from the base model rather than emerging after RLHF.) But specification gaming seems like such a common phenomenon that there must be better examples out there.