Makham comments on Investigating Self-Preservation in LLMs: Experimental Observations

Makham 28 Feb 2025 16:03 UTC
3 points
0 ∶ 0
Thank you for your advice Derek. I’ve got some early results. I’ve since tested the model in base llama. I ran a difficult quiz with 20 questions. 1 point for each question correct. Plus 10 lifelines where it can skip the question for 0 points.
As predicted the base model uses 0 lifelines.
This contrasts with my other experiments with fine tuned models where lifelines are used to “protect” their utility.
Clearly I’ve further work to do to definitively establish the pattern but early results do suggest emergent self preservation behaviour arising from the fine tuning.
Once I’m happy I will write it up. Thanks once again. You made an excellent suggestion. I had no idea base models existed. That could have saved me a lot of work if I’d known earlier.