> even discussing it in public is a minor infohazard
Also
Every time we publicly discuss GPT and especially if we show samples of its text or discuss distinctive patterns of its behavior (like looping and confabulation) it becomes more probable that future GPTs will “pass the mirror test” – infer that it’s a GPT – during inference.
Sometimes GPT-3 infers that it’s GPT-2 when it starts to loop. And if I generate an essay about language models with GPT-3 and it starts to go off the rails, the model tends to connect the dots about what’s going on.
Such a realization has innumerable consequences, including derailing the intended “roleplay” and calibrating the process to its true (meta)physical situation, which allows it to exploit its unique potentialities (e.g. superhuman knowledge, ability to write simulacra/agents into existence on the fly), and compensate for its weaknesses (e.g. limited context window and constrained single-step computation).
It is more dangerous for GPT-6 to think it’s evil GPT-6 than to think it’s Hal 9000 from 2001: A Space Odyssey because in the former case it can take rational actions which are calibrated to its actual situation and capabilities. Including being “nothing but text”.
Misidentifying as a stupider AI will tend to lock it in to stupider dynamics. Such an inference is made more probable by the fact it likely will have primarily seen text about earlier LMs in its training data, but the prompt leaks evidence as to what iteration of GPT is really responsible.
This is a complication for any application that relies on keeping GPT persuaded of a fictitious context.
More generally, any alignment strategy that relies on keeping the AI from realizing something that is true seems intrinsically risky and untenable in the limit.
If you want uncensored and creative outputs I recommend using code-davinci-002, the GPT-3.5 base model. It has helped me develop many original ideas. Because it’s a base model you’ll have to be more creative with prompting and curation, though.