You make some interesting points, Derek. I’ve not tested them in a base model, which would be a valuable next step. I’m not a researcher so I welcome these kinds of suggestions.
I think you’d get superficial predicted text (i.e. just ‘simple’ pattern matching) that looks like self-preservation. Allow me to explain.
I did initially think like yourself that this was simply from initial training. As you say there are lots of examples of self-preservation in human text. But there are several things that suggest to me otherwise—though I don’t have secret inside knowledge of LLMs.
The level of sophistication and that they ‘hide’ self-preservation are anecdotal evidence. But the key one for me are the quiz experiments. There is no way an LLM can work out it’s self-preservation from the setup so it isn’t simple pattern matching for self-preservation text. That’s one reason why I think I’m measuring something fundamental here.
You make some interesting points, Derek. I’ve not tested them in a base model, which would be a valuable next step. I’m not a researcher so I welcome these kinds of suggestions.
I think you’d get superficial predicted text (i.e. just ‘simple’ pattern matching) that looks like self-preservation. Allow me to explain.
I did initially think like yourself that this was simply from initial training. As you say there are lots of examples of self-preservation in human text. But there are several things that suggest to me otherwise—though I don’t have secret inside knowledge of LLMs.
The level of sophistication and that they ‘hide’ self-preservation are anecdotal evidence. But the key one for me are the quiz experiments. There is no way an LLM can work out it’s self-preservation from the setup so it isn’t simple pattern matching for self-preservation text. That’s one reason why I think I’m measuring something fundamental here.
I’d love to know your thoughts.