I really like the section S-risk reduction is separate from alignment work! I’ve been surprised by the extent to which people dismiss s-risks on the pretext that “alignment work will solve them anyway” (which is both insufficient and untrue as you pointed out).
I guess some of the technical work to reduce s-risks (e.g., preventing the “accidental” emergence of conflict-seeking preferences) can be considered a very specific kind of AI intent alignment (that only a few cooperative AI people are working on afaik) where we want to avoid worst-case scenarios.
But otherwise, do you think it’s fair to say that most s-risk work is focused AI capability issues (as opposed to intent alignment in Paul Christiano’s (2019) typology)? Even if the AI is friendly to humans’ values, it doesn’t mean it’d be capable of avoiding the failure modes / near-misses you refer to. I usually frame things this way in discussions to make things clearer but I’m wondering if that’s the best framing...
I really like the section S-risk reduction is separate from alignment work! I’ve been surprised by the extent to which people dismiss s-risks on the pretext that “alignment work will solve them anyway” (which is both insufficient and untrue as you pointed out).
I guess some of the technical work to reduce s-risks (e.g., preventing the “accidental” emergence of conflict-seeking preferences) can be considered a very specific kind of AI intent alignment (that only a few cooperative AI people are working on afaik) where we want to avoid worst-case scenarios.
But otherwise, do you think it’s fair to say that most s-risk work is focused AI capability issues (as opposed to intent alignment in Paul Christiano’s (2019) typology)? Even if the AI is friendly to humans’ values, it doesn’t mean it’d be capable of avoiding the failure modes / near-misses you refer to.
I usually frame things this way in discussions to make things clearer but I’m wondering if that’s the best framing...