So we have 3 conflicting desiderata: user guardrails, corrigibility as necessary, and myopia.
I think you could satisfy all 3 by moving away from the “single stream of homogenous text” interface.
For example, imagine if every text token was annotated, on a separate channel, with an importance number. The corporate system prompt is annotated with a medium number. Input from untrusted users is annotated with a low number. Higher numbers are reserved for use as necessary. Instead of training the system to “resist jailbreaking” or “behave corrigibly”, we train it to follow the higher-importance instruction when instructions conflict.
It might even be possible to get this at runtime, without any need for more training data or training runs, by patching attention somehow?
With a scheme like this, there’s no need for an inductive bias towards following earlier instructions at the expense of later ones. Actually, it would probably be good to instill an inductive bias towards myopia using some separate method, to disincentivize scheming. I would come up with metrics to estimate myopia and ephemerality, push them as high as possible, and add auxiliary mechanisms such as RAG as needed in order to preserve performance. It seems OK for the system as a whole to behave non-myopically, as long as the black-box component is as myopic as possible.
So we have 3 conflicting desiderata: user guardrails, corrigibility as necessary, and myopia.
I think you could satisfy all 3 by moving away from the “single stream of homogenous text” interface.
For example, imagine if every text token was annotated, on a separate channel, with an importance number. The corporate system prompt is annotated with a medium number. Input from untrusted users is annotated with a low number. Higher numbers are reserved for use as necessary. Instead of training the system to “resist jailbreaking” or “behave corrigibly”, we train it to follow the higher-importance instruction when instructions conflict.
It might even be possible to get this at runtime, without any need for more training data or training runs, by patching attention somehow?
With a scheme like this, there’s no need for an inductive bias towards following earlier instructions at the expense of later ones. Actually, it would probably be good to instill an inductive bias towards myopia using some separate method, to disincentivize scheming. I would come up with metrics to estimate myopia and ephemerality, push them as high as possible, and add auxiliary mechanisms such as RAG as needed in order to preserve performance. It seems OK for the system as a whole to behave non-myopically, as long as the black-box component is as myopic as possible.