One additional point is that, to the extent that EA AI advocacy acts to restrict, delay, and obstruct AI—even if this is not their intention—the effect may be to entrench an early form of cosmic NIMBYism in which AI population growth is impeded. If continued, this advocacy could have a lasting effect on whether our civilization grows to be as large as possible, by establishing norms against growth and innovation.
When viewed from this perspective, the consequence of much of EA AI advocacy may be to reduce the expected size of our vibrant, rich post-human future, rather than increase it. I currently think this consequence is fairly plausible too; as a result, I’m not convinced that obstructing AI is a good policy on total utilitarian views.
Thanks, interesting idea, I think I mostly disagree and would like to see AI progress specifically slowed/halted while continuing to have advances in space exploration, biology, nuclear power, etc and that if we later get safe TAI we won’t have become too anti-technology/anti-growth to expand a lot. But I hadn’t thought about this before and there probably is something to this, I just think it is most likely swamped by the risks from AI. It is a good reason to be careful in pause AI type pitches to be narrowly focused on frontier AI models rather than tech and science in general.
I suppose when I think about pro-expansion things I would like to see they are only really ones that do not (IMO) increase x-risks—better institutions, more pro-natalism, space exploration, maybe cognitive enhancement.
Thanks, interesting idea, I think I mostly disagree and would like to see AI progress specifically slowed/halted while continuing to have advances in space exploration, biology, nuclear power, etc
Of those technologies, AI seems to be the only one that could be transformative, in the sense of sustaining dramatic economic growth and bringing about a giant, vibrant cosmic future. In other words, it seems you’re saying we should slow down the most promising technology—the only technology that could actually take us to the future you’re advocating for—but make sure not to slow down the less promising ones. The fact that people want to slow down (and halt!) precisely the technology that is most promising is basically the whole reason I’m worried here—I think my argument would be much less strong if I we were talking about slowing down something like nuclear power.
I hadn’t thought about this before and there probably is something to this, I just think it is most likely swamped by the risks from AI.
It’s important to be clear about what we mean when we talk about the risks from AI. Do you mean:
The risk that AI could disempower humanity in particular?
The risk that AI could derail a large, vibrant cosmic civilization?
I think AI does pose a large risk in the sense of (1), but (2) is more important from a total utilitarian perspective, and it doesn’t seem particularly likely to me that AIs pose a large risk in the sense of (2) (as the AIs themselves, after disempowering humanity, would presumably go on to create a big, vibrant civilization).
If you care about humanity as a species in particular, I understand the motive behind slowing down AI. On the other hand, if you’re a total utilitarian (or you’re concerned about the present generation of humans who might otherwise miss out on the benefits of AI), then I’m not convinced, as you seem to be, that the risks from AI outweigh the considerations that I mentioned.
It is a good reason to be careful in pause AI type pitches to be narrowly focused on frontier AI models rather than tech and science in general.
Again, the frontier AI models, in my view, are precisely what is most promising from a pro-growth perspective. So if you are worried about EAs choking off economic growth and spurring cosmic NIMBYism by establishing norms against growth, it seems from my perspective that you should be mostconcerned about attempts to obstruct frontier AI research.
What’s the argument for why an AI future will create lots of value by total utilitarian lights?
At least for hedonistic total utilitarianism, I expect that a large majority of expected-hedonistic-value (from our current epistemic state) will be created by people who are at least partially sympathetic to hedonistic utilitarianism or other value systems that value a similar type of happiness in a scope-sensitive fashion. And I’d guess that humans are more likely to have such values than AI systems. (At least conditional on my thinking that such values are a good idea, on reflection.)
Objective-list theories of welfare seems even less likely to be endorsed by AIs. (Since they seem pretty niche to human values.)
There’s certainly some values you could have that would mainly be concerned that we got any old world with a large civilization. Or that would think it morally appropriate to be happy that someone got to use the universe for what they wanted, and morally inappropriate to be too opinionated about who that should be. But I don’t think that looks like utilitarianism.
We can similarly ask, “Why would an em future create lots of value by total utilitarian lights?” The answer I’d give is: it would happen for essentially the same reasons biological humans might do such a thing. For example, some biological humans are utilitarians. But some ems might be utilitarians too. Therefore, both could create lots of value by total utilitarian lights.
In order to claim that ems have a significantly lower chance of creating lots of value by total utilitarian lights than biological humans, you’d need to posit a distinction between ems and biological humans that makes this possibility plausible. Some candidate distinctions, such as the idea that ems would not be conscious because they’re on a computer, seem implausible in any way that could imply the conclusion. So, at least as far as I can tell, I cannot identify any such distinction; and thus, ems seem similarly likely to create lots of value by total utilitarian lights, compared to biological humans.
The exact same analysis can likewise be carried over to the case for AIs. Some biological humans are utilitarians, but some AIs might be utilitarians too. Therefore, both could create lots of value by total utilitarian lights.
In order to claim that AIs have a significantly lower chance of creating lots of value by total utilitarian lights than biological humans, you’d need to posit a distinction between AIs and biological humans that makes this possibility plausible. A number of candidate distinctions have been given to me in the past. These include:
The idea that AIs will not be conscious
The idea that AIs will care less about optimizing for extreme states of moral value
The idea that AIs will care more about optimizing imperfectly specified utility functions, which won’t produce much utilitarian moral value
In each case I generally find that the candidate distinction is either poorly supported, or it does not provide strong support for the conclusion. So, just as with ems, I find the idea that AIs will have a significantly lower chance of creating lots of value by total utilitarian lights than biological humans to be weak. I do not claim that there is definitely no such distinction that would convince me of this premise. But I have yet to hear one that has compelled me so far.
Positive argument in favor of humans: It seems pretty likely that whatever I’d value on-reflection will be represented in a human future, since I’m a human. (And accordingly, I’m similar to many other humans along many dimensions.)
If AI values where sampled ~randomly (whatever that means), I think that the above argument would be basically enough to carry the day in favor of humans.
But here’s a salient positive argument in favor of why AIs’ values will be similar to mine: People will be training AIs to be nice and helpful, which will surely push them towards better values.
However, I also expect people to be training AIs for obedience and, in particular, training them to not disempower humanity. So if we condition on a future where AIs disempower humanity, we evidentally didn’t have that much control over their values. This signiciantly weakens the strength of the argument “they’ll be nice because we’ll train them to be nice”.
In addition: human disempowerment is more likely to succeed if AIs are willing to egregiously violate norms, such a by lying, stealing, and killing. So conditioning on human disempowerment also updates me somewhat towards egregiously norm-violating AI. That makes me feel less good about their values.
Another argument is that, in the near term, we’ll train AIs to act nicely on short-horizon tasks, but we won’t particularly train them to deliberate and reflect on their values well. So even if “AIs’ best-guess stated values” are similar to “my best-guess stated values”, there’s less reason to belive that “AIs’ on-reflection values” are similar to “my on-reflection values”. (Whereas the basic argument of my being similar to humans still work ok: “my on-reflection values” vs. “other humans’ on-reflection values”.)
Edit: Oops, I accidentally switched to talking about “my on-reflection values” rather than “total utilitarian values”. The former is ultimately what I care more about, though, so it is what I’m more interested in. But sorry for the switch.
The ones I would say are something like (approximately in priority order):
AI’s values could result mostly from playing the training game or other relatively specific optimizations they performed in training which might result in extremely bizarre values from our perspective.
More generally AI values might be highly alien in a way where caring about experience seems very strange to them.
AIs by default will be optimized for very specific commercial purposes with narrow specializations and a variety of hyperspecific heuristics and the resulting values and generalizations of these will be problematic
I care ultimately about what I would think is good upon (vast amounts of) reflection and there are good a priori reasons to think this is similar to what other humans (who care about using vast amounts of compute) will end up thinking is good.
As a sub argument, I might care specifically about things which are much more specific than “lots of good diverse experience”. And, divergences from what I care about (even conditioning on something roughly utilitarian) might result in massive discounts from my perspective.
I care less about my values and preferences in worlds where they seem relatively contingent, e.g. they aren’t broadly shared on reflection by reasonable fractions of humanity.
AIs don’t have a genetic bottleneck and thus can learn much more specific drives that perform well while evolution had to make values more discoverable and adaptable.
E.g. various things about empathy.
AIs might have extremely low levels of cognitive diversity in their training environments as far as co-workers go which might result in very different attitudes as far as caring about diverse experience.
Some of these can be defeated relatively easily if we train AIs specifically to be good successors, but the default AIs which end up with power over the future will not have this property.
Also, I should note that this isn’t a very strong list, though in aggregate it’s sufficient to make me think that human control is perhaps 4x better than AIs. (For reference, I’d say that me personally being in control is maybe 3x better than human control.) I disagree with a MIRI style view about the disvalue of AI and the extent of fragility of value that seems implicit.
AI’s values could result mostly from playing the training game or other relatively specific optimizations they performed in training
Don’t humans also play the training game when being instructed to be nice/good/moral? (Humans don’t do it all the time, and maybe some humans don’t do it at all; but then again, I don’t think every AI would play the training game all the time either.)
AIs by default will be optimized for very specific commercial purposes with narrow specializations and a variety of hyperspecific heuristics and the resulting values and generalizations of these will be problematic
You should compare against human nature, which was optimized for something quite different from utilitarianism. If I add up the pros and cons of the thing humans were optimized for and compare it against the thing AIs will be optimized for, I don’t see why it comes out with humans on top, from a utilitarian perspective. Can you elaborate on your reasoning here?
I care ultimately about what I would think is good upon (vast amounts of) reflection and there are good a priori reasons to think this is similar to what other humans (who care about using vast amounts of compute) will end up thinking is good.
What are these a priori reasons and why don’t they similarly apply to AI?
AIs don’t have a genetic bottleneck and thus can learn much more specific drives that perform well while evolution had to make values more discoverable and adaptable.
I haven’t thought about this one much, but it seems like an interesting consideration.
AIs might have extremely low levels of cognitive diversity in their training environments as far as co-workers go which might result in very different attitudes as far as caring about diverse experience.
This consideration feels quite weak to me, although you also listed it last, so I guess you might agree with my assessment.
You should compare against human nature, which was optimized for something quite different from utilitarianism. If I add up the pros and cons of the thing humans were optimized for and compare it against the thing AIs will be optimized for, I don’t see why it comes out with humans on top, from a utilitarian perspective. Can you elaborate on your reasoning here?
I can’t quickly elaborate in a clear way, but some messy combination of:
I can currently observe humans which screens off a bunch of the comparison and let’s me do direct analysis.
I can directly observe AIs and make predictions of future training methods and their values seem to result from a much more heavily optimized and precise thing with less “slack” in some sense. (Perhaps this is related to genetic bottleneck, I’m unsure.)
AIs will be primarily trained in things which look extremely different from “cooperatively achieving high genetic fitness”.
Current AIs seem to use the vast, vast majority of their reasoning power for purposes which aren’t directly related to their final applications. I predict this will also apply for internal high level reasoning of AIs. This doesn’t seem true for humans.
Humans seem optimized for something which isn’t that far off from utilitarianism from some perspective? Make yourself survive, make your kin group survive, make your tribe survive, etc? I think utilitarianism is often a natural generalization of “I care about the experience of XYZ, it seems arbitrary/dumb/bad to draw the boundary narrowly, so I should extend this further” (This is how I get to utilitarianism.) I think the AI optimization looks considerably worse than this by default.
(Again, note that I said in my comment above: “Some of these can be defeated relatively easily if we train AIs specifically to be good successors, but the default AIs which end up with power over the future will not have this property.” I edited this in to my prior comment, so you might have missed it, sorry.)
I can currently observe humans which screens off a bunch of the comparison and let’s me do direct analysis.
I’m in agreement that this consideration makes it hard to do a direct comparison. But I think this consideration should mostly make us more uncertain, rather than making us think that humans are better than the alternative. Analogy: if you rolled a die, and I didn’t see the result, the expected value is not low just because I am uncertain about what happened. What matters here is the expected value, not necessarily the variance.
I can directly observe AIs and make predictions of future training methods and their values seem to result from a much more heavily optimized and precise thing with less “slack” in some sense. (Perhaps this is related to genetic bottleneck, I’m unsure.)
I guess I am having trouble understanding this point.
AIs will be primarily trained in things which look extremely different from “cooperatively achieving high genetic fitness”.
Sure, but the question is why being different makes it worse along the relevant axes that we were discussing. The question is not just “will AIs be different than humans?” to which the answer would be “Obviously, yes”. We’re talking about why the differences between humans and AIs make AIs better or worse in expectation, not merely different.
Current AIs seem to use the vast, vast majority of their reasoning power for purposes which aren’t directly related to their final applications. I predict this will also apply for internal high level reasoning of AIs. This doesn’t seem true for humans.
I am having a hard time parsing this claim. What do you mean by “final applications”? And why won’t this be true for future AGIs that are at human-level intelligence or above? And why does this make a difference to the ultimate claim that you’re trying to support?
Humans seem optimized for something which isn’t that far off from utilitarianism from some perspective? Make yourself survive, make your kin group survive, make your tribe survive, etc? I think utilitarianism is often a natural generalization of “I care about the experience of XYZ, it seems arbitrary/dumb/bad to draw the boundary narrowly, so I should extend this further” (This is how I get to utilitarianism.) I think the AI optimization looks considerably worse than this by default.
This consideration seems very weak to me. Early AGIs will presumably be directly optimized for something like consumer value, which looks a lot closer to “utilitarianism” to me than the implicit values in gene-centered evolution. When I talk to GPT-4, I find that it’s way more altruistic and interested in making others happy than most humans are. This seems kind of a little bit like utilitarianism to me—at least more than your description of what human evolution was optimizing for. But maybe I’m just not understanding the picture you’re painting well enough though. Or maybe my model of AI is wrong.
I’m in agreement that this consideration makes it hard to do a direct comparison. But I think this consideration should mostly make us more uncertain, rather than making us think that humans are better than the alternative.
Actually, I was just trying to say “I can see what humans are like, and it seems pretty good relative to me current guesses about AIs in ways that dont just update me up about AIs” sorry about the confusion.
I think utilitarianism is often a natural generalization of “I care about the experience of XYZ, it seems arbitrary/dumb/bad to draw the boundary narrowly, so I should extend this further” (This is how I get to utilitarianism.) I think the AI optimization looks considerably worse than this by default.
Why is this different between AIs and humans? Do you expect AIs to care less about experience than humans, maybe bc humans get reward during life-time learning about AIs don’t get reward during in context learning?
I can directly observe AIs and make predictions of future training methods and their values seem to result from a much more heavily optimized and precise thing with less “slack” in some sense. (Perhaps this is related to genetic bottleneck, I’m unsure.)
Can you say more about how slack (or genetic bottleneck) would affect whether AIs have values that are good by human lights?
Current AIs seem to use the vast, vast majority of their reasoning power for purposes which aren’t directly related to their final applications. I predict this will also apply for internal high level reasoning of AIs. This doesn’t seem true for humans.
In what sense do AIs use their reasoning power in this way? How that that affect whether they will have values that humans like?
“Human” is just one category you belong to. You’re also a member of the category “intelligent beings”, which you will share with AGIs. Another category you share with near-future AGIs is “beings who were trained on 21st century cultural data”. I guess 12th century humans aren’t in that category, so maybe we don’t share their values?
Perhaps the category that matters is your nationality. Or maybe it’s “beings in the Milky Way”, and you wouldn’t trust people from Andromeda? (To be clear, this is rhetorical, not an actual suggestion)
My point here is that I think your argument could benefit from some rigor by specifying exactly what about being human makes someone share your values in the sense you are describing. As it stands, this reasoning seems quite shallow to me.
Currently, humans seem much closer to me in a values level than GPT-4 base. I think this is also likely to be true of future AIs, though I understand why you might not find this convincing.
I think the architecture (learning algorithm, etc.) and training environment between me and other humans seems vastly more similar than between me and likely AIs.
I don’t think I’m going to flesh this argument out to an extent to which you’d find it sufficiently rigorous or convincing, sorry.
I don’t think I’m going to flesh this argument out to an extent to which you’d find it sufficiently rigorous or convincing, sorry.
Getting a bit meta for a bit, I’m curious (if you’d like to answer) whether you feel that you won’t explain your views rigorously in a convincing way here mainly because (1) you are uncertain about these specific views, (2) you think your views are inherently difficult or costly to explain despite nonetheless being very compelling, (3) you think I can’t understand your views easily because I’m lacking some bedrock intuitions that are too costly to convey, or (4) some other option.
My views are reasonably messy, complicated, hard to articulate, and based on a relatively diffuse set of intuitions. I think we also reason in a pretty different way about the situation than you seem to (3). I think it wouldn’t be impossible to try to write up a post on my views, but I would need to consolidate and think about how exactly to express where I’m at. (Maybe 2-5 person days of work.) I haven’t really consolidated my views or something close to reflective equilibrium.
I also just that arguing about pure philosophy very rarely gets anywhere and is very hard to make convincing in general.
I’m somewhat uncertain on the “inside view/mechanistic” level. (But my all considered view is partially defering to some people which makes me overall less worried that I should immediately reconsider my life choices.)
I think my views are compelling, but I’m not sure if I’d say “very compelling”
My guess now of where we most disagree is regarding the value of a world where AIs disempower humanity and go onto have a vast technologically super-advanced, rapidly expanding civilisation. I think this would quite likely be ~0 value since we don’t understand consciousness at all really, and my guess is that AIs aren’t yet conscious and if we relatively quickly get to TAI in the current paradigm they probably still won’t be moral patients. As a sentientist I don’t really care whether there is a huge future if humans (or something sufficiently related to humans e.g. we carefully study consciousness for a millennium and create digital people we are very confident have morally important experiences to be our successors) aren’t in it.
So yes I agree frontier AI models are where the most transformative potential lies, but I would prefer to get there far later once we understand alignment and consciousness far better (while other less important tech progress continues in the meantime).
My guess now of where we most disagree is regarding the value of a world where AIs disempower humanity and go onto have a vast technologically super-advanced, rapidly expanding civilisation. I think this would quite likely be ~0 value since we don’t understand consciousness at all really, and my guess is that AIs aren’t yet conscious and if we relatively quickly get to TAI in the current paradigm they probably still won’t be moral patients.
Thanks. I disagree with this for the following reasons:
AIs will get more complex over time, even in our current paradigm. Eventually I expect AIs will have highly sophisticated cognition that I’d feel comfortable calling conscious, on our current path of development (I’m an illusionist about phenomenal consciousness so I don’t think there’s a fact of the matter anyway).
If we slowed down AI, I don’t think that would necessarily translate into a higher likelihood that future AIs will be conscious. Why would it?
In the absence of a strong argument that slowing down AI makes future AIs more likely to be conscious, I still think the considerations I mentioned are stronger than the counter-considerations you’ve mentioned here, and I think they should push us towards trying to avoid entrenching norms that could hamper future growth and innovation.
One additional point is that, to the extent that EA AI advocacy acts to restrict, delay, and obstruct AI—even if this is not their intention—the effect may be to entrench an early form of cosmic NIMBYism in which AI population growth is impeded. If continued, this advocacy could have a lasting effect on whether our civilization grows to be as large as possible, by establishing norms against growth and innovation.
When viewed from this perspective, the consequence of much of EA AI advocacy may be to reduce the expected size of our vibrant, rich post-human future, rather than increase it. I currently think this consequence is fairly plausible too; as a result, I’m not convinced that obstructing AI is a good policy on total utilitarian views.
Thanks, interesting idea, I think I mostly disagree and would like to see AI progress specifically slowed/halted while continuing to have advances in space exploration, biology, nuclear power, etc and that if we later get safe TAI we won’t have become too anti-technology/anti-growth to expand a lot. But I hadn’t thought about this before and there probably is something to this, I just think it is most likely swamped by the risks from AI. It is a good reason to be careful in pause AI type pitches to be narrowly focused on frontier AI models rather than tech and science in general.
I suppose when I think about pro-expansion things I would like to see they are only really ones that do not (IMO) increase x-risks—better institutions, more pro-natalism, space exploration, maybe cognitive enhancement.
Of those technologies, AI seems to be the only one that could be transformative, in the sense of sustaining dramatic economic growth and bringing about a giant, vibrant cosmic future. In other words, it seems you’re saying we should slow down the most promising technology—the only technology that could actually take us to the future you’re advocating for—but make sure not to slow down the less promising ones. The fact that people want to slow down (and halt!) precisely the technology that is most promising is basically the whole reason I’m worried here—I think my argument would be much less strong if I we were talking about slowing down something like nuclear power.
It’s important to be clear about what we mean when we talk about the risks from AI. Do you mean:
The risk that AI could disempower humanity in particular?
The risk that AI could derail a large, vibrant cosmic civilization?
I think AI does pose a large risk in the sense of (1), but (2) is more important from a total utilitarian perspective, and it doesn’t seem particularly likely to me that AIs pose a large risk in the sense of (2) (as the AIs themselves, after disempowering humanity, would presumably go on to create a big, vibrant civilization).
If you care about humanity as a species in particular, I understand the motive behind slowing down AI. On the other hand, if you’re a total utilitarian (or you’re concerned about the present generation of humans who might otherwise miss out on the benefits of AI), then I’m not convinced, as you seem to be, that the risks from AI outweigh the considerations that I mentioned.
Again, the frontier AI models, in my view, are precisely what is most promising from a pro-growth perspective. So if you are worried about EAs choking off economic growth and spurring cosmic NIMBYism by establishing norms against growth, it seems from my perspective that you should be most concerned about attempts to obstruct frontier AI research.
What’s the argument for why an AI future will create lots of value by total utilitarian lights?
At least for hedonistic total utilitarianism, I expect that a large majority of expected-hedonistic-value (from our current epistemic state) will be created by people who are at least partially sympathetic to hedonistic utilitarianism or other value systems that value a similar type of happiness in a scope-sensitive fashion. And I’d guess that humans are more likely to have such values than AI systems. (At least conditional on my thinking that such values are a good idea, on reflection.)
Objective-list theories of welfare seems even less likely to be endorsed by AIs. (Since they seem pretty niche to human values.)
There’s certainly some values you could have that would mainly be concerned that we got any old world with a large civilization. Or that would think it morally appropriate to be happy that someone got to use the universe for what they wanted, and morally inappropriate to be too opinionated about who that should be. But I don’t think that looks like utilitarianism.
We can similarly ask, “Why would an em future create lots of value by total utilitarian lights?” The answer I’d give is: it would happen for essentially the same reasons biological humans might do such a thing. For example, some biological humans are utilitarians. But some ems might be utilitarians too. Therefore, both could create lots of value by total utilitarian lights.
In order to claim that ems have a significantly lower chance of creating lots of value by total utilitarian lights than biological humans, you’d need to posit a distinction between ems and biological humans that makes this possibility plausible. Some candidate distinctions, such as the idea that ems would not be conscious because they’re on a computer, seem implausible in any way that could imply the conclusion. So, at least as far as I can tell, I cannot identify any such distinction; and thus, ems seem similarly likely to create lots of value by total utilitarian lights, compared to biological humans.
The exact same analysis can likewise be carried over to the case for AIs. Some biological humans are utilitarians, but some AIs might be utilitarians too. Therefore, both could create lots of value by total utilitarian lights.
In order to claim that AIs have a significantly lower chance of creating lots of value by total utilitarian lights than biological humans, you’d need to posit a distinction between AIs and biological humans that makes this possibility plausible. A number of candidate distinctions have been given to me in the past. These include:
The idea that AIs will not be conscious
The idea that AIs will care less about optimizing for extreme states of moral value
The idea that AIs will care more about optimizing imperfectly specified utility functions, which won’t produce much utilitarian moral value
In each case I generally find that the candidate distinction is either poorly supported, or it does not provide strong support for the conclusion. So, just as with ems, I find the idea that AIs will have a significantly lower chance of creating lots of value by total utilitarian lights than biological humans to be weak. I do not claim that there is definitely no such distinction that would convince me of this premise. But I have yet to hear one that has compelled me so far.
Here’s one line of argument:
Positive argument in favor of humans: It seems pretty likely that whatever I’d value on-reflection will be represented in a human future, since I’m a human. (And accordingly, I’m similar to many other humans along many dimensions.)
If AI values where sampled ~randomly (whatever that means), I think that the above argument would be basically enough to carry the day in favor of humans.
But here’s a salient positive argument in favor of why AIs’ values will be similar to mine: People will be training AIs to be nice and helpful, which will surely push them towards better values.
However, I also expect people to be training AIs for obedience and, in particular, training them to not disempower humanity. So if we condition on a future where AIs disempower humanity, we evidentally didn’t have that much control over their values. This signiciantly weakens the strength of the argument “they’ll be nice because we’ll train them to be nice”.
In addition: human disempowerment is more likely to succeed if AIs are willing to egregiously violate norms, such a by lying, stealing, and killing. So conditioning on human disempowerment also updates me somewhat towards egregiously norm-violating AI. That makes me feel less good about their values.
Another argument is that, in the near term, we’ll train AIs to act nicely on short-horizon tasks, but we won’t particularly train them to deliberate and reflect on their values well. So even if “AIs’ best-guess stated values” are similar to “my best-guess stated values”, there’s less reason to belive that “AIs’ on-reflection values” are similar to “my on-reflection values”. (Whereas the basic argument of my being similar to humans still work ok: “my on-reflection values” vs. “other humans’ on-reflection values”.)
Edit: Oops, I accidentally switched to talking about “my on-reflection values” rather than “total utilitarian values”. The former is ultimately what I care more about, though, so it is what I’m more interested in. But sorry for the switch.
The ones I would say are something like (approximately in priority order):
AI’s values could result mostly from playing the training game or other relatively specific optimizations they performed in training which might result in extremely bizarre values from our perspective.
More generally AI values might be highly alien in a way where caring about experience seems very strange to them.
AIs by default will be optimized for very specific commercial purposes with narrow specializations and a variety of hyperspecific heuristics and the resulting values and generalizations of these will be problematic
I care ultimately about what I would think is good upon (vast amounts of) reflection and there are good a priori reasons to think this is similar to what other humans (who care about using vast amounts of compute) will end up thinking is good.
As a sub argument, I might care specifically about things which are much more specific than “lots of good diverse experience”. And, divergences from what I care about (even conditioning on something roughly utilitarian) might result in massive discounts from my perspective.
I care less about my values and preferences in worlds where they seem relatively contingent, e.g. they aren’t broadly shared on reflection by reasonable fractions of humanity.
AIs don’t have a genetic bottleneck and thus can learn much more specific drives that perform well while evolution had to make values more discoverable and adaptable.
E.g. various things about empathy.
AIs might have extremely low levels of cognitive diversity in their training environments as far as co-workers go which might result in very different attitudes as far as caring about diverse experience.
Some of these can be defeated relatively easily if we train AIs specifically to be good successors, but the default AIs which end up with power over the future will not have this property.
Also, I should note that this isn’t a very strong list, though in aggregate it’s sufficient to make me think that human control is perhaps 4x better than AIs. (For reference, I’d say that me personally being in control is maybe 3x better than human control.) I disagree with a MIRI style view about the disvalue of AI and the extent of fragility of value that seems implicit.
Don’t humans also play the training game when being instructed to be nice/good/moral? (Humans don’t do it all the time, and maybe some humans don’t do it at all; but then again, I don’t think every AI would play the training game all the time either.)
You should compare against human nature, which was optimized for something quite different from utilitarianism. If I add up the pros and cons of the thing humans were optimized for and compare it against the thing AIs will be optimized for, I don’t see why it comes out with humans on top, from a utilitarian perspective. Can you elaborate on your reasoning here?
What are these a priori reasons and why don’t they similarly apply to AI?
I haven’t thought about this one much, but it seems like an interesting consideration.
This consideration feels quite weak to me, although you also listed it last, so I guess you might agree with my assessment.
I can’t quickly elaborate in a clear way, but some messy combination of:
I can currently observe humans which screens off a bunch of the comparison and let’s me do direct analysis.
I can directly observe AIs and make predictions of future training methods and their values seem to result from a much more heavily optimized and precise thing with less “slack” in some sense. (Perhaps this is related to genetic bottleneck, I’m unsure.)
AIs will be primarily trained in things which look extremely different from “cooperatively achieving high genetic fitness”.
Current AIs seem to use the vast, vast majority of their reasoning power for purposes which aren’t directly related to their final applications. I predict this will also apply for internal high level reasoning of AIs. This doesn’t seem true for humans.
Humans seem optimized for something which isn’t that far off from utilitarianism from some perspective? Make yourself survive, make your kin group survive, make your tribe survive, etc? I think utilitarianism is often a natural generalization of “I care about the experience of XYZ, it seems arbitrary/dumb/bad to draw the boundary narrowly, so I should extend this further” (This is how I get to utilitarianism.) I think the AI optimization looks considerably worse than this by default.
(Again, note that I said in my comment above: “Some of these can be defeated relatively easily if we train AIs specifically to be good successors, but the default AIs which end up with power over the future will not have this property.” I edited this in to my prior comment, so you might have missed it, sorry.)
I’m in agreement that this consideration makes it hard to do a direct comparison. But I think this consideration should mostly make us more uncertain, rather than making us think that humans are better than the alternative. Analogy: if you rolled a die, and I didn’t see the result, the expected value is not low just because I am uncertain about what happened. What matters here is the expected value, not necessarily the variance.
I guess I am having trouble understanding this point.
Sure, but the question is why being different makes it worse along the relevant axes that we were discussing. The question is not just “will AIs be different than humans?” to which the answer would be “Obviously, yes”. We’re talking about why the differences between humans and AIs make AIs better or worse in expectation, not merely different.
I am having a hard time parsing this claim. What do you mean by “final applications”? And why won’t this be true for future AGIs that are at human-level intelligence or above? And why does this make a difference to the ultimate claim that you’re trying to support?
This consideration seems very weak to me. Early AGIs will presumably be directly optimized for something like consumer value, which looks a lot closer to “utilitarianism” to me than the implicit values in gene-centered evolution. When I talk to GPT-4, I find that it’s way more altruistic and interested in making others happy than most humans are. This seems kind of a little bit like utilitarianism to me—at least more than your description of what human evolution was optimizing for. But maybe I’m just not understanding the picture you’re painting well enough though. Or maybe my model of AI is wrong.
Actually, I was just trying to say “I can see what humans are like, and it seems pretty good relative to me current guesses about AIs in ways that dont just update me up about AIs” sorry about the confusion.
Why is this different between AIs and humans? Do you expect AIs to care less about experience than humans, maybe bc humans get reward during life-time learning about AIs don’t get reward during in context learning?
Can you say more about how slack (or genetic bottleneck) would affect whether AIs have values that are good by human lights?
They might well be trained to cooperate with other copies on tasks, if this is they way they’ll be deployed in practice?
In what sense do AIs use their reasoning power in this way? How that that affect whether they will have values that humans like?
I am a human. Other humans might end up in a similar spot on reflection.
(Also I care less about values of mine which are highly contingent wrt humans.)
“Human” is just one category you belong to. You’re also a member of the category “intelligent beings”, which you will share with AGIs. Another category you share with near-future AGIs is “beings who were trained on 21st century cultural data”. I guess 12th century humans aren’t in that category, so maybe we don’t share their values?
Perhaps the category that matters is your nationality. Or maybe it’s “beings in the Milky Way”, and you wouldn’t trust people from Andromeda? (To be clear, this is rhetorical, not an actual suggestion)
My point here is that I think your argument could benefit from some rigor by specifying exactly what about being human makes someone share your values in the sense you are describing. As it stands, this reasoning seems quite shallow to me.
Currently, humans seem much closer to me in a values level than GPT-4 base. I think this is also likely to be true of future AIs, though I understand why you might not find this convincing.
I think the architecture (learning algorithm, etc.) and training environment between me and other humans seems vastly more similar than between me and likely AIs.
I don’t think I’m going to flesh this argument out to an extent to which you’d find it sufficiently rigorous or convincing, sorry.
Getting a bit meta for a bit, I’m curious (if you’d like to answer) whether you feel that you won’t explain your views rigorously in a convincing way here mainly because (1) you are uncertain about these specific views, (2) you think your views are inherently difficult or costly to explain despite nonetheless being very compelling, (3) you think I can’t understand your views easily because I’m lacking some bedrock intuitions that are too costly to convey, or (4) some other option.
My views are reasonably messy, complicated, hard to articulate, and based on a relatively diffuse set of intuitions. I think we also reason in a pretty different way about the situation than you seem to (3). I think it wouldn’t be impossible to try to write up a post on my views, but I would need to consolidate and think about how exactly to express where I’m at. (Maybe 2-5 person days of work.) I haven’t really consolidated my views or something close to reflective equilibrium.
I also just that arguing about pure philosophy very rarely gets anywhere and is very hard to make convincing in general.
I’m somewhat uncertain on the “inside view/mechanistic” level. (But my all considered view is partially defering to some people which makes me overall less worried that I should immediately reconsider my life choices.)
I think my views are compelling, but I’m not sure if I’d say “very compelling”
My guess now of where we most disagree is regarding the value of a world where AIs disempower humanity and go onto have a vast technologically super-advanced, rapidly expanding civilisation. I think this would quite likely be ~0 value since we don’t understand consciousness at all really, and my guess is that AIs aren’t yet conscious and if we relatively quickly get to TAI in the current paradigm they probably still won’t be moral patients. As a sentientist I don’t really care whether there is a huge future if humans (or something sufficiently related to humans e.g. we carefully study consciousness for a millennium and create digital people we are very confident have morally important experiences to be our successors) aren’t in it.
So yes I agree frontier AI models are where the most transformative potential lies, but I would prefer to get there far later once we understand alignment and consciousness far better (while other less important tech progress continues in the meantime).
Thanks. I disagree with this for the following reasons:
AIs will get more complex over time, even in our current paradigm. Eventually I expect AIs will have highly sophisticated cognition that I’d feel comfortable calling conscious, on our current path of development (I’m an illusionist about phenomenal consciousness so I don’t think there’s a fact of the matter anyway).
If we slowed down AI, I don’t think that would necessarily translate into a higher likelihood that future AIs will be conscious. Why would it?
In the absence of a strong argument that slowing down AI makes future AIs more likely to be conscious, I still think the considerations I mentioned are stronger than the counter-considerations you’ve mentioned here, and I think they should push us towards trying to avoid entrenching norms that could hamper future growth and innovation.