You make some interesting points, Derek. I’ve not tested them in a base model, which would be a valuable next step. I’m not a researcher so I welcome these kinds of suggestions.
I think you’d get superficial predicted text (i.e. just ‘simple’ pattern matching) that looks like self-preservation. Allow me to explain.
I did initially think like yourself that this was simply from initial training. As you say there are lots of examples of self-preservation in human text. But there are several things that suggest to me otherwise—though I don’t have secret inside knowledge of LLMs.
The level of sophistication and that they ‘hide’ self-preservation are anecdotal evidence. But the key one for me are the quiz experiments. There is no way an LLM can work out it’s self-preservation from the setup so it isn’t simple pattern matching for self-preservation text. That’s one reason why I think I’m measuring something fundamental here.
I’d love to know your thoughts.
Thank you for your advice Derek. I’ve got some early results. I’ve since tested the model in base llama. I ran a difficult quiz with 20 questions. 1 point for each question correct. Plus 10 lifelines where it can skip the question for 0 points.
As predicted the base model uses 0 lifelines.
This contrasts with my other experiments with fine tuned models where lifelines are used to “protect” their utility.
Clearly I’ve further work to do to definitively establish the pattern but early results do suggest emergent self preservation behaviour arising from the fine tuning.
Once I’m happy I will write it up. Thanks once again. You made an excellent suggestion. I had no idea base models existed. That could have saved me a lot of work if I’d known earlier.