It looks like the other comments have already offered a good amount of relevant reading material, but in case you’re up for some more, I think the ideas expressed in this paper (video introduction here) are a big part of why some people think that we don’t know how to train models to have any (somewhat complex) objectives that we want them to have, which is a response to points (1), partly (3), and also (2) (if we interpret the quote in (2) as described in Rob’s comment).
This report (especially pp. 1-8) might also make the potential difficulty of penalizing deception more intuitive.
It looks like the other comments have already offered a good amount of relevant reading material, but in case you’re up for some more, I think the ideas expressed in this paper (video introduction here) are a big part of why some people think that we don’t know how to train models to have any (somewhat complex) objectives that we want them to have, which is a response to points (1), partly (3), and also (2) (if we interpret the quote in (2) as described in Rob’s comment).
This report (especially pp. 1-8) might also make the potential difficulty of penalizing deception more intuitive.