Michael_Cohen comments on On how various plans miss the hard bits of the alignment challenge

Michael_Cohen 2 Aug 2022 16:42 UTC
9 points
1 ∶ 0
I constructed an agent where you can literally prove that if you set a parameter high enough, it won’t try to kill everyone, while still eventually at least matching human-level intelligence. Sure it uses a realizability assumption, sure it’s intractable in its current form, sure it might require an enormously long training period, but these are computer science problems, not philosophy problems, and they clearly suggest paths forward. The underlying concept is sound. It struck me as undignified to say this in the past, but maybe dignity rightly construed should compel me to: it absolutely boggles me that ~no one in the EA community talks about this. It’s not in this blog post; it’s not in Richard’s curriculum; it wasn’t in Evan’s list of promising AGI safety ideas.
I agree with your perspective on all of these approaches, except my initial reaction is to be more pessimistic about natural abstractions. It seems to me that a good understanding of natural abstractions is not good enough for putting a handle on a part of an agent’s mind. We’d also need to understand “natural types”, the type signatures that agents’ brains use to represent those abstractions. And I think that there is a long, long list of types, in which each is as natural as the rest.
There’s an interpretability benchmark that occurred to me recently, which I may as well mention here, because I agree approximately none of the interpretability research I see strikes me as progress toward strategically relevant interpretation of AGI. Try to understand what corvids are saying to each other.