Positive argument in favor of humans: It seems pretty likely that whatever I’d value on-reflection will be represented in a human future, since I’m a human. (And accordingly, I’m similar to many other humans along many dimensions.)
If AI values where sampled ~randomly (whatever that means), I think that the above argument would be basically enough to carry the day in favor of humans.
But here’s a salient positive argument in favor of why AIs’ values will be similar to mine: People will be training AIs to be nice and helpful, which will surely push them towards better values.
However, I also expect people to be training AIs for obedience and, in particular, training them to not disempower humanity. So if we condition on a future where AIs disempower humanity, we evidentally didn’t have that much control over their values. This signiciantly weakens the strength of the argument “they’ll be nice because we’ll train them to be nice”.
In addition: human disempowerment is more likely to succeed if AIs are willing to egregiously violate norms, such a by lying, stealing, and killing. So conditioning on human disempowerment also updates me somewhat towards egregiously norm-violating AI. That makes me feel less good about their values.
Another argument is that, in the near term, we’ll train AIs to act nicely on short-horizon tasks, but we won’t particularly train them to deliberate and reflect on their values well. So even if “AIs’ best-guess stated values” are similar to “my best-guess stated values”, there’s less reason to belive that “AIs’ on-reflection values” are similar to “my on-reflection values”. (Whereas the basic argument of my being similar to humans still work ok: “my on-reflection values” vs. “other humans’ on-reflection values”.)
Edit: Oops, I accidentally switched to talking about “my on-reflection values” rather than “total utilitarian values”. The former is ultimately what I care more about, though, so it is what I’m more interested in. But sorry for the switch.
Here’s one line of argument:
Positive argument in favor of humans: It seems pretty likely that whatever I’d value on-reflection will be represented in a human future, since I’m a human. (And accordingly, I’m similar to many other humans along many dimensions.)
If AI values where sampled ~randomly (whatever that means), I think that the above argument would be basically enough to carry the day in favor of humans.
But here’s a salient positive argument in favor of why AIs’ values will be similar to mine: People will be training AIs to be nice and helpful, which will surely push them towards better values.
However, I also expect people to be training AIs for obedience and, in particular, training them to not disempower humanity. So if we condition on a future where AIs disempower humanity, we evidentally didn’t have that much control over their values. This signiciantly weakens the strength of the argument “they’ll be nice because we’ll train them to be nice”.
In addition: human disempowerment is more likely to succeed if AIs are willing to egregiously violate norms, such a by lying, stealing, and killing. So conditioning on human disempowerment also updates me somewhat towards egregiously norm-violating AI. That makes me feel less good about their values.
Another argument is that, in the near term, we’ll train AIs to act nicely on short-horizon tasks, but we won’t particularly train them to deliberate and reflect on their values well. So even if “AIs’ best-guess stated values” are similar to “my best-guess stated values”, there’s less reason to belive that “AIs’ on-reflection values” are similar to “my on-reflection values”. (Whereas the basic argument of my being similar to humans still work ok: “my on-reflection values” vs. “other humans’ on-reflection values”.)
Edit: Oops, I accidentally switched to talking about “my on-reflection values” rather than “total utilitarian values”. The former is ultimately what I care more about, though, so it is what I’m more interested in. But sorry for the switch.