The thing I have most changed my mind about since writing the post of mine you cite is adjacent to the “disvalue through evolution” category: basically, I’ve become more worried that disvalue is instrumentally useful. E.g. maybe the most efficient paperclip maximizer is one that’s really sad about the lack of paperclips.
There’s some old writing on this by Carl Shulman and Brian Tomasik; I would be excited for someone to do a more thorough write up/literature review for the red teaming contest (or just in general).
The thing I have most changed my mind about since writing the post of mine you cite is adjacent to the “disvalue through evolution” category: basically, I’ve become more worried that disvalue is instrumentally useful. E.g. maybe the most efficient paperclip maximizer is one that’s really sad about the lack of paperclips.
There’s some old writing on this by Carl Shulman and Brian Tomasik; I would be excited for someone to do a more thorough write up/literature review for the red teaming contest (or just in general).