I think that some plausible alignment schemes seem like they could plausibly involve causing suffering to the AIs. I think that it seems pretty bad to inflict huge amounts of suffering on AIs, both because it’s unethical and because it seems potentially inadvisable to make AIs justifiably mad at us.
If unaligned AIs are morally valuable, then it’s less bad to get overthrown by them, and perhaps we should be aiming to produce successors who we’re happier to be overthrown by. See here for discussion. (Obviously the plan A is to align the AIs, but it seems good to know how important it is to succeed at this, and making unaligned but valuable successors seems like a not-totally-crazy plan B.)
I’m curious to what extent the value of the “happiness-to-be-overthrown-by” (H2BOB) variable for the unaligned AI that overthrew us would be predictive of the H2BOB value of future generations / evolutions of AI. Specifically, it seems at least plausible that the nature and rate of unaligned AI evolution could be so broad and fast that knowing the nature and H2BOB of the first AGI would tell us essentially nothing about prospects for AI welfare in the long run.
If unaligned AIs are morally valuable, then it’s less bad to get overthrown by them
Are you confident that being overthrown by AIs is bad? I am quite uncertain. For example, maybe most people would say that humans overpowering other animals was good overall.
Reframing your question as an answer: there isn’t much work on AI sentience because we can probably solve it later without much loss, and work on AI sentience trades off with work on other AI stuff (mostly because many of the people who could work on AI sentience could also work on other AI stuff), and we can’t save other AI stuff for later.
If they are aligned, then surely our future selves can figure this out?
I think it’s entirely plausible we just don’t care to figure it out, especially if we have some kind of singleton scenario where the entity in control decides to optimize human/personal welfare at the expense of other sentient beings. Just consider how humans currently treat animals and now imagine that there is no opportunity for lobbying for AI welfare, we’re just locked into place.
Ultimately, I am very uncertain, but I would not say that solving AI alignment/control will “surely” lead to a good future.
Scenario 1: Alignment goes well. In this scenario, I agree that our future AI-assisted selves can figure things out, and that pre-alignment AI sentience work will have been wasted effort.
Scenario 2: Alignment goes poorly. While I don’t technically disagree with your statement, “If AIs are unaligned with human values, that seems very bad already,” I do think it misleads through lumping together all kinds of misaligned AI outcomes into “very bad,” when in reality this category ranges across many orders of magnitude of badness.[1] In the case that we lose control of the future at some point, to me it seems worthwhile to try to steer away from some of the worse outcomes (e.g., astronomical “byproduct” suffering of digital minds, which is likely easier to avoid if we better understand AI sentience), before then.
Maybe a question instead of an answer, but what longtermist questions does this seem like a crux for?
If AIs are unaligned with human values, that seems very bad already.
If they are aligned, then surely our future selves can figure this out?
Again, could be very dumb question, but without knowing that, it doesn’t seem surprising how little attention is paid to AI sentience.
I think this is a great question. My answers:
I think that some plausible alignment schemes seem like they could plausibly involve causing suffering to the AIs. I think that it seems pretty bad to inflict huge amounts of suffering on AIs, both because it’s unethical and because it seems potentially inadvisable to make AIs justifiably mad at us.
If unaligned AIs are morally valuable, then it’s less bad to get overthrown by them, and perhaps we should be aiming to produce successors who we’re happier to be overthrown by. See here for discussion. (Obviously the plan A is to align the AIs, but it seems good to know how important it is to succeed at this, and making unaligned but valuable successors seems like a not-totally-crazy plan B.)
I’m curious to what extent the value of the “happiness-to-be-overthrown-by” (H2BOB) variable for the unaligned AI that overthrew us would be predictive of the H2BOB value of future generations / evolutions of AI. Specifically, it seems at least plausible that the nature and rate of unaligned AI evolution could be so broad and fast that knowing the nature and H2BOB of the first AGI would tell us essentially nothing about prospects for AI welfare in the long run.
I like this answer and will read the link in bullet 2. I’m very interested in further reading in bullet 1 as well.
Hi Buck,
Are you confident that being overthrown by AIs is bad? I am quite uncertain. For example, maybe most people would say that humans overpowering other animals was good overall.
Reframing your question as an answer: there isn’t much work on AI sentience because we can probably solve it later without much loss, and work on AI sentience trades off with work on other AI stuff (mostly because many of the people who could work on AI sentience could also work on other AI stuff), and we can’t save other AI stuff for later.
I think it’s entirely plausible we just don’t care to figure it out, especially if we have some kind of singleton scenario where the entity in control decides to optimize human/personal welfare at the expense of other sentient beings. Just consider how humans currently treat animals and now imagine that there is no opportunity for lobbying for AI welfare, we’re just locked into place.
Ultimately, I am very uncertain, but I would not say that solving AI alignment/control will “surely” lead to a good future.
Scenario 1: Alignment goes well. In this scenario, I agree that our future AI-assisted selves can figure things out, and that pre-alignment AI sentience work will have been wasted effort.
Scenario 2: Alignment goes poorly. While I don’t technically disagree with your statement, “If AIs are unaligned with human values, that seems very bad already,” I do think it misleads through lumping together all kinds of misaligned AI outcomes into “very bad,” when in reality this category ranges across many orders of magnitude of badness.[1] In the case that we lose control of the future at some point, to me it seems worthwhile to try to steer away from some of the worse outcomes (e.g., astronomical “byproduct” suffering of digital minds, which is likely easier to avoid if we better understand AI sentience), before then.
From the roughly neutral outcome of paperclip maximization, to the extremely bad outcome of optimized suffering.