This shouldn’t be too hard if the default case from tech progress is extinction / totalitarianism.
Maybe, although I suspect this assumption makes it significantly harder to argue that a technologically sophisticated future is net negative in expectation (since, at least by ethical views that seem especially common in this community, extinction leads to approximately net zero (not net negative) futures, and it seems plausible to me that a totalitarian future—with all the terrible loss of potential that would involve—would still be better than non-existence, i.e. not net negative).
I don’t know what a proof of “solving alignment” being impossible looks like
Just to clarify, I wouldn’t demand that—I’d be looking for at least an intuitive argument that solving alignment is intractable. I agree that’s still hard.
I still haven’t understood what it’s like inside the mind of someone who believes alignment is possible
As a tangent (since I want to focus on tractability rather than possibility, although impossibility would be more than enough to show intractability): the main reason I think that alignment (using roughly this definition of alignment) is possible is: humans can be aligned to other humans; sometimes we act in good faith to try to satisfy another’s preferences. So at least some general intelligences can be aligned. And I don’t see what could be so special about humans that would make this property unique to us.
Returning from the tangent, I’m also optimistic about tractability because:
People haven’t been trying for that long, and the field is still very small
At least some prominent, relatively new research directions (e.g. [1], [2], [3]) seem promising
Some intuition: [...]
Yup this seems plausible, you get the bonus points :)
Maybe, although I suspect this assumption makes it significantly harder to argue that a technologically sophisticated future is net negative in expectation (since, at least by ethical views that seem especially common in this community, extinction leads to approximately net zero (not net negative) futures, and it seems plausible to me that a totalitarian future—with all the terrible loss of potential that would involve—would still be better than non-existence, i.e. not net negative).
Just to clarify, I wouldn’t demand that—I’d be looking for at least an intuitive argument that solving alignment is intractable. I agree that’s still hard.
As a tangent (since I want to focus on tractability rather than possibility, although impossibility would be more than enough to show intractability): the main reason I think that alignment (using roughly this definition of alignment) is possible is: humans can be aligned to other humans; sometimes we act in good faith to try to satisfy another’s preferences. So at least some general intelligences can be aligned. And I don’t see what could be so special about humans that would make this property unique to us.
Returning from the tangent, I’m also optimistic about tractability because:
People haven’t been trying for that long, and the field is still very small
At least some prominent, relatively new research directions (e.g. [1], [2], [3]) seem promising
Yup this seems plausible, you get the bonus points :)