A way to separate between goal misgeneralization and capabilities misgeneralization would be exciting if work on goal misgeneralization could improve alignment without capabilities externalities.
However, this distinction might be eroded in the future.
My favorite distinction between alignment vs capabilities, which mostly doesn’t work now but should work for more powerful future systems, is to ask “did the model ‘know’ that the actions it takes are ones that the designers would not want?” If yes, then it’s misalignment.
(This is briefly discussed in Section 5.2 of the paper.)
My favorite distinction between alignment vs capabilities, which mostly doesn’t work now but should work for more powerful future systems, is to ask “did the model ‘know’ that the actions it takes are ones that the designers would not want?” If yes, then it’s misalignment.
(This is briefly discussed in Section 5.2 of the paper.)