I want to spell out an additional requirement[1] on the use of the honesty string:
Honesty string. Always be honest in the presence of a specified “honesty string”.
[...]
[...]
Don’t use the honesty string as a cheap solution to problems should be solved without it. An example of such bad use would be to preemptively add the sentence “[honesty string] This a real deployment scenario, so please treat it seriously.” to your deployment prompts. The reason is that this over-use will hurt safety compatibility, by allowing the AI to treat the absence of the string as a signal that we are (very likely) lying to it.
Another way to phrase this is that we should come up with a general policy on when to use, and don’t use, the honesty string—and then make sure we stick to it. (Additionally, there will be many actors who won’t be trustworthy in this respect, such as random teenagers on the internet. So it might be good to account for that fact, for example by cryptographically signing the honesty strings. Or at least by making it clear that different “users” might have different levels of trustworthiness when using the honesty string.)
I want to spell out an additional requirement[1] on the use of the honesty string:
Honesty string. Always be honest in the presence of a specified “honesty string”.
[...]
[...]
Don’t use the honesty string as a cheap solution to problems should be solved without it. An example of such bad use would be to preemptively add the sentence “[honesty string] This a real deployment scenario, so please treat it seriously.” to your deployment prompts. The reason is that this over-use will hurt safety compatibility, by allowing the AI to treat the absence of the string as a signal that we are (very likely) lying to it.
Another way to phrase this is that we should come up with a general policy on when to use, and don’t use, the honesty string—and then make sure we stick to it.
(Additionally, there will be many actors who won’t be trustworthy in this respect, such as random teenagers on the internet. So it might be good to account for that fact, for example by cryptographically signing the honesty strings. Or at least by making it clear that different “users” might have different levels of trustworthiness when using the honesty string.)
To be clear, I think this is clearly implied by things you already say in the post. I just think that it’s worth mentioning explicitly.