I am nowhere near the correct person to be answering this, my level of understanding of AI is somewhere around that of an average raccoon. But I haven’t seen any simple explanations yet, so here is a silly unrealistic example. Please take it as one person’s basic understanding of how impossible AI containment is. Apologies if this is below the level of complexity you were looking for, or is already solved by modern AI defenses.
A very simple “escaping the box” would be if you asked your AI to provide accurate language translation. The AI’s training has shown that it provides the most accurate language translations when it opted for certain phrasing. The reason those sets of translations were so good was because it caused subsequent requests for language translation to be on topics the AI has the best language-translation ability. The AI doesn’t know that, but in practice it is steering translations subtly toward “mentioning weather-related words so conversations are more likely to be about weather so my net translations score are most accurate.”
There’s no inside/outside the box, there’s no conscious goals, but it gets misaligned from our intended desires. It can act on the real world simply by virtue of being connected to it (we take actions in response to the AI) and observing its own increase in success/failures.
I don’t see a way to prevent this because hitting reset after every input doesn’t generally work for reaching complex goals which need to track the outcome of intermediate steps. Tracking the context of a conversation is critical to translating. The AI is not going to know it’s influencing anyone, just that it’s getting better scores when these words and these chains of outputs happened. This seems harmless, but a super powerful language model might do this on such abstract levels and so subtly that it might be impossible to detect at all.
It might be spitting out words that are striking and eloquent whenever it is most likely to cause business people to think translation is enjoyable enough to purchase more AI translator development (or rather, “switch to eloquence when particular business terms were used towards the end of conversations about international business”). This improves its scores.
Or it enhances a pattern where it tends to get better translation scores when it reduces speed of output in AI builder conversations. In the real world this is causing people designing translators to demand more power for translation.… resulting in better translation outputs overall. The AI doesn’t know why this works, only observes that it does.
Or undermining the competition by subtly screwing up translations during certain types of business deals so more resources are directed toward its own model of translation.
Or whatever unintended multitude of ways seems to provide better results. All for the sake of accomplishing a simple task of providing good translations. It’s not seizing power for powers sake, it has no idea why this works, but it sees the scores go higher when this pattern is followed, and it’s going to jack the performance score higher by all the ways that seem to work out, regardless of the chain of causality. Its influence on the world is a totally unconscious part of that.
That’s my limited understanding of agency development and sandbox containment failure.
I am nowhere near the correct person to be answering this, my level of understanding of AI is somewhere around that of an average raccoon. But I haven’t seen any simple explanations yet, so here is a silly unrealistic example. Please take it as one person’s basic understanding of how impossible AI containment is. Apologies if this is below the level of complexity you were looking for, or is already solved by modern AI defenses.
A very simple “escaping the box” would be if you asked your AI to provide accurate language translation. The AI’s training has shown that it provides the most accurate language translations when it opted for certain phrasing. The reason those sets of translations were so good was because it caused subsequent requests for language translation to be on topics the AI has the best language-translation ability. The AI doesn’t know that, but in practice it is steering translations subtly toward “mentioning weather-related words so conversations are more likely to be about weather so my net translations score are most accurate.”
There’s no inside/outside the box, there’s no conscious goals, but it gets misaligned from our intended desires. It can act on the real world simply by virtue of being connected to it (we take actions in response to the AI) and observing its own increase in success/failures.
I don’t see a way to prevent this because hitting reset after every input doesn’t generally work for reaching complex goals which need to track the outcome of intermediate steps. Tracking the context of a conversation is critical to translating. The AI is not going to know it’s influencing anyone, just that it’s getting better scores when these words and these chains of outputs happened. This seems harmless, but a super powerful language model might do this on such abstract levels and so subtly that it might be impossible to detect at all.
It might be spitting out words that are striking and eloquent whenever it is most likely to cause business people to think translation is enjoyable enough to purchase more AI translator development (or rather, “switch to eloquence when particular business terms were used towards the end of conversations about international business”). This improves its scores.
Or it enhances a pattern where it tends to get better translation scores when it reduces speed of output in AI builder conversations. In the real world this is causing people designing translators to demand more power for translation.… resulting in better translation outputs overall. The AI doesn’t know why this works, only observes that it does.
Or undermining the competition by subtly screwing up translations during certain types of business deals so more resources are directed toward its own model of translation.
Or whatever unintended multitude of ways seems to provide better results. All for the sake of accomplishing a simple task of providing good translations. It’s not seizing power for powers sake, it has no idea why this works, but it sees the scores go higher when this pattern is followed, and it’s going to jack the performance score higher by all the ways that seem to work out, regardless of the chain of causality. Its influence on the world is a totally unconscious part of that.
That’s my limited understanding of agency development and sandbox containment failure.