Interesting about the “System 2” vs “System 1″ preference fulfilment (your cigarettes example). But all of this is still just focused on outer alignment. How does the inner shoggoth get prevented from mesaoptimising on an arbitrary goal?
I’m afraid I’m not well read on the problem of inner alignment and why optimizing on an arbitrary goal is a realistic worry. Can you explain why this might happen / provide an good, simple resource that I can read?
The LW wiki entry is good. Also the Rob Miles video I link to above explains it well with visuals and examples. I think there are 3 core parts to the AI x-risk argument: the orthogonality thesis (Copernican revolution applied to mind-space; why outer alignment is hard), Basic AI Drives (convergent instrumental goals leading to power seeking), and Mesaoptimizers (why inner alignment is hard).
Thanks. I watched Robert Miles’ video which was very helpful. Especially the part where he explains why an AI might want to act in accordance with its base objective in a training environment only to then pursue its mesa objective in the real world.
I’m quite uncertain at this point, but I have a vague feeling that Russell’s second principle (The machine is initially uncertain about what those preferences are) is very important here. It is a vague feeling though...
Interesting about the “System 2” vs “System 1″ preference fulfilment (your cigarettes example). But all of this is still just focused on outer alignment. How does the inner shoggoth get prevented from mesaoptimising on an arbitrary goal?
I’m afraid I’m not well read on the problem of inner alignment and why optimizing on an arbitrary goal is a realistic worry. Can you explain why this might happen / provide an good, simple resource that I can read?
The LW wiki entry is good. Also the Rob Miles video I link to above explains it well with visuals and examples. I think there are 3 core parts to the AI x-risk argument: the orthogonality thesis (Copernican revolution applied to mind-space; why outer alignment is hard), Basic AI Drives (convergent instrumental goals leading to power seeking), and Mesaoptimizers (why inner alignment is hard).
Thanks. I watched Robert Miles’ video which was very helpful. Especially the part where he explains why an AI might want to act in accordance with its base objective in a training environment only to then pursue its mesa objective in the real world.
I’m quite uncertain at this point, but I have a vague feeling that Russell’s second principle (The machine is initially uncertain about what those preferences are) is very important here. It is a vague feeling though...