Since I originally wrote this post I’ve only become more certain of the central message, which is that EAs and rationalist-like people in general are at extreme risk of Goodharting ourselves. See for example a more recent LW post on that theme.
In this post I use the idea of “legibility” to talk about impact that can be easily measured. I’m now less sure that was the right move, since legibility is a bit of jargon that, while it’s taken off in some circles, hasn’t caught on more broadly. Although the post deals with this, a better version of this post might avoid talking about legibility all together and instead speak in more familiar language about measurement, etc. that people are already familiar with. There’s nothing in here that I think hinges on the idea of legibility, though it’s certainly helpful for framing the point, so if there were interest I think I’d be willing to revisit this post and see if I can make a shorter version of it that doesn’t teaching some extra jargon above all the other necessary jargon.
I think I’d also highlight the Goodharting part more, since that’s really what the problem is. More time on Goodharting and why this is a consequence of that, less time on going round the topic.
Since I originally wrote this post I’ve only become more certain of the central message, which is that EAs and rationalist-like people in general are at extreme risk of Goodharting ourselves. See for example a more recent LW post on that theme.
In this post I use the idea of “legibility” to talk about impact that can be easily measured. I’m now less sure that was the right move, since legibility is a bit of jargon that, while it’s taken off in some circles, hasn’t caught on more broadly. Although the post deals with this, a better version of this post might avoid talking about legibility all together and instead speak in more familiar language about measurement, etc. that people are already familiar with. There’s nothing in here that I think hinges on the idea of legibility, though it’s certainly helpful for framing the point, so if there were interest I think I’d be willing to revisit this post and see if I can make a shorter version of it that doesn’t teaching some extra jargon above all the other necessary jargon.
I think I’d also highlight the Goodharting part more, since that’s really what the problem is. More time on Goodharting and why this is a consequence of that, less time on going round the topic.