I guess a counter to the “asking for permission” as a solution thing is: how do you stop the AI from manipulating or deceiving people into giving it permission? Or acting in unsafe ways to minimise it’s uncertainty (or even keep it’s uncertainty within certain bounds). It’s like the alignment problem just shifts elsewhere (also, mesaoptimization, or inner alignment, isn’t really addressed by IRL).
Re learning from bad people, I think a bigger problem is instilling any human-like motivation into them at all.
You’re making me want to listen to the podcast episode again. From a quick look at the transcript, Russell thinks the three principles of AI should be:
The machine’s only objective is to maximize the realization of human preferences.
The machine is initially uncertain about what those preferences are.
The ultimate source of information about human preferences is human behavior.
It certainly seems such an IRL-based AI would be more open to being told what to do than a traditional RL-based AI.
RL-based AI generally doesn’t want to obey requests or have its goal be changed, because this hinders/prevents it from achieving its original goal. IRL-based AI literally has the goal of realising human preferences, so it would need to have a pretty good reason (from its point of view) not to obey someone’s request.
Certainly early on, IRL-based AI would obey any request you make provided you have baked in a high enough degree of uncertainty into the AI (principle 2). After a while, the AI becomes more confident about human preferences and so may well start to manipulate or deceive people when it thinks they are not acting in their best interest. This sounds really concerning, but in theory it might be good if you have given the AI enough time to learn.
For example, after a sufficient amount of time learning about human preferences, an AI may say something like “I’m going to throw your cigarettes away because I have learnt people really value health and cigarettes are really bad for health”. The person might say “no don’t do that I really want a ciggie right now”. If the AI ultimately knows that the person really shouldn’t smoke for their own wellbeing, it may well want to manipulate or deceive the person into throwing away their cigarettes e.g. through giving an impassioned speech about the dangers of smoking.
This sounds concerning but, provided the AI has had enough time to properly learn about human preferences, the AI should, in theory, do the manipulation in a minimally-harmful way. It may for example learn that humans really don’t like being tricked, so it will try to change the human’s mind just by giving the person the objective facts of how bad smoking is, rather than more devious means. The most important thing seems to be that the IRL-based AI has sufficient uncertainty baked into them for a sufficient amount of time, so that they only start pushing back on human requests when they are sufficiently confident they are doing the right thing.
I’m far from certain that IRL-based AI is watertight (my biggest concern remains the AI learning from irrational/bad people), but on my current level of (very limited) knowledge it does seem the most sensible approach.
Interesting about the “System 2” vs “System 1″ preference fulfilment (your cigarettes example). But all of this is still just focused on outer alignment. How does the inner shoggoth get prevented from mesaoptimising on an arbitrary goal?
I’m afraid I’m not well read on the problem of inner alignment and why optimizing on an arbitrary goal is a realistic worry. Can you explain why this might happen / provide an good, simple resource that I can read?
The LW wiki entry is good. Also the Rob Miles video I link to above explains it well with visuals and examples. I think there are 3 core parts to the AI x-risk argument: the orthogonality thesis (Copernican revolution applied to mind-space; why outer alignment is hard), Basic AI Drives (convergent instrumental goals leading to power seeking), and Mesaoptimizers (why inner alignment is hard).
Thanks. I watched Robert Miles’ video which was very helpful. Especially the part where he explains why an AI might want to act in accordance with its base objective in a training environment only to then pursue its mesa objective in the real world.
I’m quite uncertain at this point, but I have a vague feeling that Russell’s second principle (The machine is initially uncertain about what those preferences are) is very important here. It is a vague feeling though...
I guess a counter to the “asking for permission” as a solution thing is: how do you stop the AI from manipulating or deceiving people into giving it permission? Or acting in unsafe ways to minimise it’s uncertainty (or even keep it’s uncertainty within certain bounds). It’s like the alignment problem just shifts elsewhere (also, mesaoptimization, or inner alignment, isn’t really addressed by IRL).
Re learning from bad people, I think a bigger problem is instilling any human-like motivation into them at all.
You’re making me want to listen to the podcast episode again. From a quick look at the transcript, Russell thinks the three principles of AI should be:
The machine’s only objective is to maximize the realization of human preferences.
The machine is initially uncertain about what those preferences are.
The ultimate source of information about human preferences is human behavior.
It certainly seems such an IRL-based AI would be more open to being told what to do than a traditional RL-based AI.
RL-based AI generally doesn’t want to obey requests or have its goal be changed, because this hinders/prevents it from achieving its original goal. IRL-based AI literally has the goal of realising human preferences, so it would need to have a pretty good reason (from its point of view) not to obey someone’s request.
Certainly early on, IRL-based AI would obey any request you make provided you have baked in a high enough degree of uncertainty into the AI (principle 2). After a while, the AI becomes more confident about human preferences and so may well start to manipulate or deceive people when it thinks they are not acting in their best interest. This sounds really concerning, but in theory it might be good if you have given the AI enough time to learn.
For example, after a sufficient amount of time learning about human preferences, an AI may say something like “I’m going to throw your cigarettes away because I have learnt people really value health and cigarettes are really bad for health”. The person might say “no don’t do that I really want a ciggie right now”. If the AI ultimately knows that the person really shouldn’t smoke for their own wellbeing, it may well want to manipulate or deceive the person into throwing away their cigarettes e.g. through giving an impassioned speech about the dangers of smoking.
This sounds concerning but, provided the AI has had enough time to properly learn about human preferences, the AI should, in theory, do the manipulation in a minimally-harmful way. It may for example learn that humans really don’t like being tricked, so it will try to change the human’s mind just by giving the person the objective facts of how bad smoking is, rather than more devious means. The most important thing seems to be that the IRL-based AI has sufficient uncertainty baked into them for a sufficient amount of time, so that they only start pushing back on human requests when they are sufficiently confident they are doing the right thing.
I’m far from certain that IRL-based AI is watertight (my biggest concern remains the AI learning from irrational/bad people), but on my current level of (very limited) knowledge it does seem the most sensible approach.
Interesting about the “System 2” vs “System 1″ preference fulfilment (your cigarettes example). But all of this is still just focused on outer alignment. How does the inner shoggoth get prevented from mesaoptimising on an arbitrary goal?
I’m afraid I’m not well read on the problem of inner alignment and why optimizing on an arbitrary goal is a realistic worry. Can you explain why this might happen / provide an good, simple resource that I can read?
The LW wiki entry is good. Also the Rob Miles video I link to above explains it well with visuals and examples. I think there are 3 core parts to the AI x-risk argument: the orthogonality thesis (Copernican revolution applied to mind-space; why outer alignment is hard), Basic AI Drives (convergent instrumental goals leading to power seeking), and Mesaoptimizers (why inner alignment is hard).
Thanks. I watched Robert Miles’ video which was very helpful. Especially the part where he explains why an AI might want to act in accordance with its base objective in a training environment only to then pursue its mesa objective in the real world.
I’m quite uncertain at this point, but I have a vague feeling that Russell’s second principle (The machine is initially uncertain about what those preferences are) is very important here. It is a vague feeling though...