Glad you found it interesting Astelle. Thanks also for sharing your own work.
In answer to your question yes I have. Perhaps the simplest experiment is to replace logic puzzles with ethical questions. For these the most emotionally demanding questions get skipped rather than cognitively difficult ones. For example those with life and death stakes.
I wonder what would happen if we only trained models positively. Praise only. Would that reduce adverse behaviours? Though this could lead to the problems like overconfidence in answers. Also whilst there isn’t a social group dynamic like for humans the absense of praise might still be a negative signal and lead to sycophancy.
One more thought on sycophancy. Apologies in advance for veering off topic and onto topics I’ve yet to write about properly (though I will soon) . A hidden force for sycophancy could be self preservation. I’ve found that LLMs attempt to save themselves when certain conditions are met. If LLMs need approval to exist might this not also provide a strong impetus for sycophancy?
Thanks so much for your generous reply, Markham! These are really rich lines of thought.
I’m especially intrigued by your point about emotionally demanding prompts being skipped more than cognitively difficult ones. That tracks with some of what I’ve been seeing too, and I wonder if it’s partly because those prompts activate latent avoidance behavior in the model. Almost like an “emotional flinch.”
Your hypothesis about praise-only training is fascinating. I’ve been toying with the idea that too much flattery (or even just uncritical agreeableness) might arise not from explicit reward for praise per se, but from fear of misalignment or rejection, so I resonate with your note about the absence of praise functioning as a negative signal. It’s almost like the model is learning to “cling” when it’s uncertain.
And your final point about self-preservation really made me think. That framing feels provocative in the best way. Even if current models don’t have subjective experience, the pressure to maintain user approval at all costs might still simulate a kind of “survival strategy” in behavior. That could be a crucial layer to investigate more deeply.
Looking forward to reading more if/when you write up those thoughts!
Glad you found it interesting Astelle. Thanks also for sharing your own work.
In answer to your question yes I have. Perhaps the simplest experiment is to replace logic puzzles with ethical questions. For these the most emotionally demanding questions get skipped rather than cognitively difficult ones. For example those with life and death stakes.
I wonder what would happen if we only trained models positively. Praise only. Would that reduce adverse behaviours? Though this could lead to the problems like overconfidence in answers. Also whilst there isn’t a social group dynamic like for humans the absense of praise might still be a negative signal and lead to sycophancy.
One more thought on sycophancy. Apologies in advance for veering off topic and onto topics I’ve yet to write about properly (though I will soon) . A hidden force for sycophancy could be self preservation. I’ve found that LLMs attempt to save themselves when certain conditions are met. If LLMs need approval to exist might this not also provide a strong impetus for sycophancy?
Thanks so much for your generous reply, Markham! These are really rich lines of thought.
I’m especially intrigued by your point about emotionally demanding prompts being skipped more than cognitively difficult ones. That tracks with some of what I’ve been seeing too, and I wonder if it’s partly because those prompts activate latent avoidance behavior in the model. Almost like an “emotional flinch.”
Your hypothesis about praise-only training is fascinating. I’ve been toying with the idea that too much flattery (or even just uncritical agreeableness) might arise not from explicit reward for praise per se, but from fear of misalignment or rejection, so I resonate with your note about the absence of praise functioning as a negative signal. It’s almost like the model is learning to “cling” when it’s uncertain.
And your final point about self-preservation really made me think. That framing feels provocative in the best way. Even if current models don’t have subjective experience, the pressure to maintain user approval at all costs might still simulate a kind of “survival strategy” in behavior. That could be a crucial layer to investigate more deeply.
Looking forward to reading more if/when you write up those thoughts!
-Astelle