The goal you specify in the prompt is not the goal that the AI is acting on when it responds. Consider: if someone tells you, “Your goal is now [x]”, does that change your (terminal) goals? No, because those don’t come from other people telling you things (or other environmental inputs)[1].
Understanding a goal that’s been put into writing, and having that goal, are two very different things.
This is a bit of an exaggeration, because humans don’t generally have very coherent goals, and will “discover” new goals or refine existing ones as they learn new things. But I think it’s basically correct to say that there’s no straightforward relationship between telling a human to have a goal, and them having it, especially for adults (i.e. a trained model).
We don’t know how to do that. It’s something that falls out of its training, but we currently don’t know how to even predict what goal any particular training setup will result in, let alone aim for a specific one.
The goal you specify in the prompt is not the goal that the AI is acting on when it responds. Consider: if someone tells you, “Your goal is now [x]”, does that change your (terminal) goals? No, because those don’t come from other people telling you things (or other environmental inputs)[1].
Understanding a goal that’s been put into writing, and having that goal, are two very different things.
This is a bit of an exaggeration, because humans don’t generally have very coherent goals, and will “discover” new goals or refine existing ones as they learn new things. But I think it’s basically correct to say that there’s no straightforward relationship between telling a human to have a goal, and them having it, especially for adults (i.e. a trained model).
Sorry, I’m still a little confused. If we establish an AI’s terminal goal from the get-go, why wouldn’t we have total control over it?
We don’t know how to do that. It’s something that falls out of its training, but we currently don’t know how to even predict what goal any particular training setup will result in, let alone aim for a specific one.