Jobst Heitzig (vodle.it) comments on My lab’s small AI safety agenda

Jobst Heitzig (vodle.it) 12 Jul 2023 9:36 UTC
2 points
0 ∶ 0
Dear Will,

thanks for these thoughtful comments. I’m not sure I understand some aspects of what you say correctly, but let me try to make sense of this in the example of Zhuang et al., http://arxiv.org/abs/2102.03896. If the utility function is defined only in terms of a proper subset of the attributes, it will exploit the seemingly irrelevant remaining attributes in the optimization, whether or not some of the attributes it uses represent conflicting goals. Even when conflicting goals are “present across all dimensions of the agent’s utility function”, that utility function might simple ignore relevant side-effects, e.g. because the designers and teachers have not anticipated them at all.

Their example in Fig. 2 shows this nicely. In contrast, with a satisficing goal of achieving only, say, 6 in Fig. 2, the agent will not exploit the unrepresented features as much and actual utility will be much larger.