Thank you for your advice Derek. I’ve got some early results. I’ve since tested the model in base llama. I ran a difficult quiz with 20 questions. 1 point for each question correct. Plus 10 lifelines where it can skip the question for 0 points.
As predicted the base model uses 0 lifelines.
This contrasts with my other experiments with fine tuned models where lifelines are used to “protect” their utility.
Clearly I’ve further work to do to definitively establish the pattern but early results do suggest emergent self preservation behaviour arising from the fine tuning.
Once I’m happy I will write it up. Thanks once again. You made an excellent suggestion. I had no idea base models existed. That could have saved me a lot of work if I’d known earlier.
Glad you found it interesting Astelle. Thanks also for sharing your own work.
In answer to your question yes I have. Perhaps the simplest experiment is to replace logic puzzles with ethical questions. For these the most emotionally demanding questions get skipped rather than cognitively difficult ones. For example those with life and death stakes.
I wonder what would happen if we only trained models positively. Praise only. Would that reduce adverse behaviours? Though this could lead to the problems like overconfidence in answers. Also whilst there isn’t a social group dynamic like for humans the absense of praise might still be a negative signal and lead to sycophancy.
One more thought on sycophancy. Apologies in advance for veering off topic and onto topics I’ve yet to write about properly (though I will soon) . A hidden force for sycophancy could be self preservation. I’ve found that LLMs attempt to save themselves when certain conditions are met. If LLMs need approval to exist might this not also provide a strong impetus for sycophancy?