AI’s values could result mostly from playing the training game or other relatively specific optimizations they performed in training
Don’t humans also play the training game when being instructed to be nice/good/moral? (Humans don’t do it all the time, and maybe some humans don’t do it at all; but then again, I don’t think every AI would play the training game all the time either.)
AIs by default will be optimized for very specific commercial purposes with narrow specializations and a variety of hyperspecific heuristics and the resulting values and generalizations of these will be problematic
You should compare against human nature, which was optimized for something quite different from utilitarianism. If I add up the pros and cons of the thing humans were optimized for and compare it against the thing AIs will be optimized for, I don’t see why it comes out with humans on top, from a utilitarian perspective. Can you elaborate on your reasoning here?
I care ultimately about what I would think is good upon (vast amounts of) reflection and there are good a priori reasons to think this is similar to what other humans (who care about using vast amounts of compute) will end up thinking is good.
What are these a priori reasons and why don’t they similarly apply to AI?
AIs don’t have a genetic bottleneck and thus can learn much more specific drives that perform well while evolution had to make values more discoverable and adaptable.
I haven’t thought about this one much, but it seems like an interesting consideration.
AIs might have extremely low levels of cognitive diversity in their training environments as far as co-workers go which might result in very different attitudes as far as caring about diverse experience.
This consideration feels quite weak to me, although you also listed it last, so I guess you might agree with my assessment.
You should compare against human nature, which was optimized for something quite different from utilitarianism. If I add up the pros and cons of the thing humans were optimized for and compare it against the thing AIs will be optimized for, I don’t see why it comes out with humans on top, from a utilitarian perspective. Can you elaborate on your reasoning here?
I can’t quickly elaborate in a clear way, but some messy combination of:
I can currently observe humans which screens off a bunch of the comparison and let’s me do direct analysis.
I can directly observe AIs and make predictions of future training methods and their values seem to result from a much more heavily optimized and precise thing with less “slack” in some sense. (Perhaps this is related to genetic bottleneck, I’m unsure.)
AIs will be primarily trained in things which look extremely different from “cooperatively achieving high genetic fitness”.
Current AIs seem to use the vast, vast majority of their reasoning power for purposes which aren’t directly related to their final applications. I predict this will also apply for internal high level reasoning of AIs. This doesn’t seem true for humans.
Humans seem optimized for something which isn’t that far off from utilitarianism from some perspective? Make yourself survive, make your kin group survive, make your tribe survive, etc? I think utilitarianism is often a natural generalization of “I care about the experience of XYZ, it seems arbitrary/dumb/bad to draw the boundary narrowly, so I should extend this further” (This is how I get to utilitarianism.) I think the AI optimization looks considerably worse than this by default.
(Again, note that I said in my comment above: “Some of these can be defeated relatively easily if we train AIs specifically to be good successors, but the default AIs which end up with power over the future will not have this property.” I edited this in to my prior comment, so you might have missed it, sorry.)
I can currently observe humans which screens off a bunch of the comparison and let’s me do direct analysis.
I’m in agreement that this consideration makes it hard to do a direct comparison. But I think this consideration should mostly make us more uncertain, rather than making us think that humans are better than the alternative. Analogy: if you rolled a die, and I didn’t see the result, the expected value is not low just because I am uncertain about what happened. What matters here is the expected value, not necessarily the variance.
I can directly observe AIs and make predictions of future training methods and their values seem to result from a much more heavily optimized and precise thing with less “slack” in some sense. (Perhaps this is related to genetic bottleneck, I’m unsure.)
I guess I am having trouble understanding this point.
AIs will be primarily trained in things which look extremely different from “cooperatively achieving high genetic fitness”.
Sure, but the question is why being different makes it worse along the relevant axes that we were discussing. The question is not just “will AIs be different than humans?” to which the answer would be “Obviously, yes”. We’re talking about why the differences between humans and AIs make AIs better or worse in expectation, not merely different.
Current AIs seem to use the vast, vast majority of their reasoning power for purposes which aren’t directly related to their final applications. I predict this will also apply for internal high level reasoning of AIs. This doesn’t seem true for humans.
I am having a hard time parsing this claim. What do you mean by “final applications”? And why won’t this be true for future AGIs that are at human-level intelligence or above? And why does this make a difference to the ultimate claim that you’re trying to support?
Humans seem optimized for something which isn’t that far off from utilitarianism from some perspective? Make yourself survive, make your kin group survive, make your tribe survive, etc? I think utilitarianism is often a natural generalization of “I care about the experience of XYZ, it seems arbitrary/dumb/bad to draw the boundary narrowly, so I should extend this further” (This is how I get to utilitarianism.) I think the AI optimization looks considerably worse than this by default.
This consideration seems very weak to me. Early AGIs will presumably be directly optimized for something like consumer value, which looks a lot closer to “utilitarianism” to me than the implicit values in gene-centered evolution. When I talk to GPT-4, I find that it’s way more altruistic and interested in making others happy than most humans are. This seems kind of a little bit like utilitarianism to me—at least more than your description of what human evolution was optimizing for. But maybe I’m just not understanding the picture you’re painting well enough though. Or maybe my model of AI is wrong.
I’m in agreement that this consideration makes it hard to do a direct comparison. But I think this consideration should mostly make us more uncertain, rather than making us think that humans are better than the alternative.
Actually, I was just trying to say “I can see what humans are like, and it seems pretty good relative to me current guesses about AIs in ways that dont just update me up about AIs” sorry about the confusion.
I think utilitarianism is often a natural generalization of “I care about the experience of XYZ, it seems arbitrary/dumb/bad to draw the boundary narrowly, so I should extend this further” (This is how I get to utilitarianism.) I think the AI optimization looks considerably worse than this by default.
Why is this different between AIs and humans? Do you expect AIs to care less about experience than humans, maybe bc humans get reward during life-time learning about AIs don’t get reward during in context learning?
I can directly observe AIs and make predictions of future training methods and their values seem to result from a much more heavily optimized and precise thing with less “slack” in some sense. (Perhaps this is related to genetic bottleneck, I’m unsure.)
Can you say more about how slack (or genetic bottleneck) would affect whether AIs have values that are good by human lights?
Current AIs seem to use the vast, vast majority of their reasoning power for purposes which aren’t directly related to their final applications. I predict this will also apply for internal high level reasoning of AIs. This doesn’t seem true for humans.
In what sense do AIs use their reasoning power in this way? How that that affect whether they will have values that humans like?
“Human” is just one category you belong to. You’re also a member of the category “intelligent beings”, which you will share with AGIs. Another category you share with near-future AGIs is “beings who were trained on 21st century cultural data”. I guess 12th century humans aren’t in that category, so maybe we don’t share their values?
Perhaps the category that matters is your nationality. Or maybe it’s “beings in the Milky Way”, and you wouldn’t trust people from Andromeda? (To be clear, this is rhetorical, not an actual suggestion)
My point here is that I think your argument could benefit from some rigor by specifying exactly what about being human makes someone share your values in the sense you are describing. As it stands, this reasoning seems quite shallow to me.
Currently, humans seem much closer to me in a values level than GPT-4 base. I think this is also likely to be true of future AIs, though I understand why you might not find this convincing.
I think the architecture (learning algorithm, etc.) and training environment between me and other humans seems vastly more similar than between me and likely AIs.
I don’t think I’m going to flesh this argument out to an extent to which you’d find it sufficiently rigorous or convincing, sorry.
I don’t think I’m going to flesh this argument out to an extent to which you’d find it sufficiently rigorous or convincing, sorry.
Getting a bit meta for a bit, I’m curious (if you’d like to answer) whether you feel that you won’t explain your views rigorously in a convincing way here mainly because (1) you are uncertain about these specific views, (2) you think your views are inherently difficult or costly to explain despite nonetheless being very compelling, (3) you think I can’t understand your views easily because I’m lacking some bedrock intuitions that are too costly to convey, or (4) some other option.
My views are reasonably messy, complicated, hard to articulate, and based on a relatively diffuse set of intuitions. I think we also reason in a pretty different way about the situation than you seem to (3). I think it wouldn’t be impossible to try to write up a post on my views, but I would need to consolidate and think about how exactly to express where I’m at. (Maybe 2-5 person days of work.) I haven’t really consolidated my views or something close to reflective equilibrium.
I also just that arguing about pure philosophy very rarely gets anywhere and is very hard to make convincing in general.
I’m somewhat uncertain on the “inside view/mechanistic” level. (But my all considered view is partially defering to some people which makes me overall less worried that I should immediately reconsider my life choices.)
I think my views are compelling, but I’m not sure if I’d say “very compelling”
Don’t humans also play the training game when being instructed to be nice/good/moral? (Humans don’t do it all the time, and maybe some humans don’t do it at all; but then again, I don’t think every AI would play the training game all the time either.)
You should compare against human nature, which was optimized for something quite different from utilitarianism. If I add up the pros and cons of the thing humans were optimized for and compare it against the thing AIs will be optimized for, I don’t see why it comes out with humans on top, from a utilitarian perspective. Can you elaborate on your reasoning here?
What are these a priori reasons and why don’t they similarly apply to AI?
I haven’t thought about this one much, but it seems like an interesting consideration.
This consideration feels quite weak to me, although you also listed it last, so I guess you might agree with my assessment.
I can’t quickly elaborate in a clear way, but some messy combination of:
I can currently observe humans which screens off a bunch of the comparison and let’s me do direct analysis.
I can directly observe AIs and make predictions of future training methods and their values seem to result from a much more heavily optimized and precise thing with less “slack” in some sense. (Perhaps this is related to genetic bottleneck, I’m unsure.)
AIs will be primarily trained in things which look extremely different from “cooperatively achieving high genetic fitness”.
Current AIs seem to use the vast, vast majority of their reasoning power for purposes which aren’t directly related to their final applications. I predict this will also apply for internal high level reasoning of AIs. This doesn’t seem true for humans.
Humans seem optimized for something which isn’t that far off from utilitarianism from some perspective? Make yourself survive, make your kin group survive, make your tribe survive, etc? I think utilitarianism is often a natural generalization of “I care about the experience of XYZ, it seems arbitrary/dumb/bad to draw the boundary narrowly, so I should extend this further” (This is how I get to utilitarianism.) I think the AI optimization looks considerably worse than this by default.
(Again, note that I said in my comment above: “Some of these can be defeated relatively easily if we train AIs specifically to be good successors, but the default AIs which end up with power over the future will not have this property.” I edited this in to my prior comment, so you might have missed it, sorry.)
I’m in agreement that this consideration makes it hard to do a direct comparison. But I think this consideration should mostly make us more uncertain, rather than making us think that humans are better than the alternative. Analogy: if you rolled a die, and I didn’t see the result, the expected value is not low just because I am uncertain about what happened. What matters here is the expected value, not necessarily the variance.
I guess I am having trouble understanding this point.
Sure, but the question is why being different makes it worse along the relevant axes that we were discussing. The question is not just “will AIs be different than humans?” to which the answer would be “Obviously, yes”. We’re talking about why the differences between humans and AIs make AIs better or worse in expectation, not merely different.
I am having a hard time parsing this claim. What do you mean by “final applications”? And why won’t this be true for future AGIs that are at human-level intelligence or above? And why does this make a difference to the ultimate claim that you’re trying to support?
This consideration seems very weak to me. Early AGIs will presumably be directly optimized for something like consumer value, which looks a lot closer to “utilitarianism” to me than the implicit values in gene-centered evolution. When I talk to GPT-4, I find that it’s way more altruistic and interested in making others happy than most humans are. This seems kind of a little bit like utilitarianism to me—at least more than your description of what human evolution was optimizing for. But maybe I’m just not understanding the picture you’re painting well enough though. Or maybe my model of AI is wrong.
Actually, I was just trying to say “I can see what humans are like, and it seems pretty good relative to me current guesses about AIs in ways that dont just update me up about AIs” sorry about the confusion.
Why is this different between AIs and humans? Do you expect AIs to care less about experience than humans, maybe bc humans get reward during life-time learning about AIs don’t get reward during in context learning?
Can you say more about how slack (or genetic bottleneck) would affect whether AIs have values that are good by human lights?
They might well be trained to cooperate with other copies on tasks, if this is they way they’ll be deployed in practice?
In what sense do AIs use their reasoning power in this way? How that that affect whether they will have values that humans like?
I am a human. Other humans might end up in a similar spot on reflection.
(Also I care less about values of mine which are highly contingent wrt humans.)
“Human” is just one category you belong to. You’re also a member of the category “intelligent beings”, which you will share with AGIs. Another category you share with near-future AGIs is “beings who were trained on 21st century cultural data”. I guess 12th century humans aren’t in that category, so maybe we don’t share their values?
Perhaps the category that matters is your nationality. Or maybe it’s “beings in the Milky Way”, and you wouldn’t trust people from Andromeda? (To be clear, this is rhetorical, not an actual suggestion)
My point here is that I think your argument could benefit from some rigor by specifying exactly what about being human makes someone share your values in the sense you are describing. As it stands, this reasoning seems quite shallow to me.
Currently, humans seem much closer to me in a values level than GPT-4 base. I think this is also likely to be true of future AIs, though I understand why you might not find this convincing.
I think the architecture (learning algorithm, etc.) and training environment between me and other humans seems vastly more similar than between me and likely AIs.
I don’t think I’m going to flesh this argument out to an extent to which you’d find it sufficiently rigorous or convincing, sorry.
Getting a bit meta for a bit, I’m curious (if you’d like to answer) whether you feel that you won’t explain your views rigorously in a convincing way here mainly because (1) you are uncertain about these specific views, (2) you think your views are inherently difficult or costly to explain despite nonetheless being very compelling, (3) you think I can’t understand your views easily because I’m lacking some bedrock intuitions that are too costly to convey, or (4) some other option.
My views are reasonably messy, complicated, hard to articulate, and based on a relatively diffuse set of intuitions. I think we also reason in a pretty different way about the situation than you seem to (3). I think it wouldn’t be impossible to try to write up a post on my views, but I would need to consolidate and think about how exactly to express where I’m at. (Maybe 2-5 person days of work.) I haven’t really consolidated my views or something close to reflective equilibrium.
I also just that arguing about pure philosophy very rarely gets anywhere and is very hard to make convincing in general.
I’m somewhat uncertain on the “inside view/mechanistic” level. (But my all considered view is partially defering to some people which makes me overall less worried that I should immediately reconsider my life choices.)
I think my views are compelling, but I’m not sure if I’d say “very compelling”