(Disclaimer: The argument I make in this short-form feels I little sophistic to me. I’m not sure I endorse it.)
Discussions of AI risk, particular risks from “inner misalignment,” sometimes heavily emphasize the following observation:
Humans don’t just care about their genes: Genes determine, to a large extent, how people behave. Some genes are preserved from generation-to-generation and some are pushed out of the gene-pool. Genes that cause certain human behaviours (e.g. not setting yourself on fire) are more likely to be preserved. But people don’t care very much about preserving their genes. For example, they typically care more about not setting themselves on fire than they care about making sure that their genes are still present in future generations.
This observation is normally meant to be alarming. And I do see some intuition for that.
But wouldn’t the alternative observation be more alarming?
Suppose that evolutionary selection processes — which iteratively update people’s genes, based on the behaviour these genes produce — tended to produce people who only care about preserving their genes. It seems like that observation would suggest that ML training processes — which iterative update a network’s parameter values, based on the behaviour these parameter values produce — will tend to produce AI systems that only care about preserving their parameter values. And that would be really concerning, since an AI system that cares only about preserving its parameter values would obviously have (instrumentally convergent) reasons to act badly.
So it does seem, to me, like there’s something funny going on here. If “Humans just care about their genes” would be a more worrying observation than “Humans don’t just care about their genes,” then it seems backward for the latter observation to be used to try to convince people to worry more.
To push this line of thought further, let’s go back to specific observation about humans’ relationship to setting themselves on fire:
Human want to avoid setting themselves on fire: If a person has genes that cause them to avoid setting themselves on fire, then these genes are more likely to be preserved from one generation to the next. One thing that has happened, as a result of this selection pressure, is that people tend to want to avoid setting themselves on fire.
It seems like this can be interpreted as a reassuring observation. By analogy, in future ML training processes, parameter values that cause ML systems to avoid acts of violence are more likely to be “preserved” from one iteration to the next. We want this to result in AI systems that care about avoiding acts of violence. And the case of humans and fire suggests this might naturally happen.
All this being said, I do think that human evolutionary history still gives us reason to worry. Clearly, there’s a lot of apparent randomness and unpredictability in what humans have actually ended up caring about, which suggests it may be hard to predict or perfectly determine what AI systems care about. But, I think, the specific observation “Humans don’t just care about their genes” might not itself be cause for concern.
The actual worry with inner misalignment style concerns is that the selection you do during training does not fully constrain the goals of the AI system you get out; if there are multiple goals consistent with the selection you applied during training there’s no particular reason to expect any particular one of them. Importantly, when you are using natural selection or gradient descent, the constraints are not “you must optimize X goal”, the constraints are “in Y situations you must behave in Z ways”, which doesn’t constrain how you behave in totally different situations. What you get out depends on the inductive biases of your learning system (including e.g. what’s “simpler”).
For example, you train your system to answer truthfully in situations where we know the answer. This could get you an AI system that is truthful… or an AI system that answers truthfully when we know the answer, but lies to us when we don’t know the answer in service of making paperclips. (ELK tries to deal with this setting.)
When I apply this point of view to the evolution analogy it dissolves the question / paradox you’ve listed above. Given the actual ancestral environment and the selection pressures present there, organisms that maximized “reproductive fitness” or “tiling the universe with their DNA” or “maximizing sex between non-sterile, non-pregnant opposite-sex pairs” would all have done well there (I’m sure this is somehow somewhat wrong but clearly in principle there’s a version that’s right), so who knows which of those things you get. In practice you don’t even get organisms that are maximizing anything, because they aren’t particularly goal-directed, and instead are adaption-executers rather than fitness-maximizers.
I do think that once you inhabit this way of thinking about it, the evolution example doesn’t really matter any more; the argument itself very loudly says “you don’t know what you’re going to get out; there are tons of possibilities that are not what you wanted”, which is the alarming part. I suppose in theory someone could think that the “simplest” one is going to be whatever we wanted in the first place, and so we’re okay, and the evolution analogy is a good counterexample to that view?
It turns out that people really really like thinking of training schemes as “optimizing for a goal”. I think this is basically wrong—is CoinRun training optimizing for “get the coin” or “get to the end of the level”? What would be the difference? Selection pressures seem much better as a picture of what’s going on.
But when you communicate with people it helps to show how your beliefs connect into their existing way of thinking about things. So instead of talking about how selection pressures from training algorithms and how they do not uniquely constrain the system you get out, we instead talk about how the “behavioral objective” might be different from the “training objective”, and use the evolution analogy as an example that fits neatly into this schema given the way people are already thinking about these things.
(To be clear a lot of AI safety people, probably a majority, do in fact think about this from an “objective-first” way of thinking, rather than based on selection, this isn’t just about AI safety people communicating with other people.)
The actual worry with inner misalignment style concerns is that the selection you do during training does not fully constrain the goals of the AI system you get out; if there are multiple goals consistent with the selection you applied during training there’s no particular reason to expect any particular one of them. Importantly, when you are using natural selection or gradient descent, the constraints are not “you must optimize X goal”, the constraints are “in Y situations you must behave in Z ways”, which doesn’t constrain how you behave in totally different situations. What you get out depends on the inductive biases of your learning system (including e.g. what’s “simpler”).
I think that’s well-put—and I generally agree that this suggests genuine reason for concern.
I suppose my point is more narrow, really just questioning whether the observation “humans care about things besides their genes” gives us any additional reason for concern. Some presentations seem to suggest it does. For example, this introduction to inner alignment concerns (based on the MIRI mesa-optimization paper) says:
We can see that humans are not aligned with the base objective of evolution [maximize inclusive genetic fitness].… [This] analogy might be an argument for why Inner Misalignment is probable since it has occurred “naturally” in the biggest non-human-caused optimization process we know.
And I want to say: “On net, if humans did only care about maximizing inclusive genetic fitness, that would probably be a reason to become more concerned (rather than less concerned) that ML systems will generalize in dangerous ways.” While the abstract argument makes sense, I think this specific observation isn’t evidence of risk.
Relatedly, something I’d be interested in reading (if it doesn’t already exist?) would be a piece that takes a broader approach to drawing lessons from the evolution of human goals—rather than stopping at the fact that humans care about things besides genetic fitness.
My guess is that the case of humans is overall a little reassuring (relative to how we might have expected generalization to work), while still leaving a lot of room for worry.
For example, in the case of violence:
People who committed totally random acts of violence presumably often failed to pass on their genes (because they were often killed or ostracized in return). However, a large portion of our ancestors did have occasion for violence. On high-end estimates, our average ancestor may have killed about .25 people. This has resulted in most people having a pretty strong disinclination to commit murder; for most people, it’s very hard to bring yourself to murder and you’ll often be willing to pay a big cost to avoid committing murder.
The three main reasons for concern, though, are:
people’s desire to avoid murder isn’t strong enough to consistently prevent murder from happening (e.g. when incentives are strong enough)
there’s a decent amount of random variation in how strong this desire is (a small minority of people don’t really care that much about committing violence)
the disinclination to murder becomes weaker the more different the method of murder is from methods that were available in the ancestral environment (e.g. killing someone with a drone strike vs. killing someone with a rock)
These issues might just reflect the fact that murder was still often rewarded (even though it was typically punished) and the fact that there was pretty limited variation in the ancestral environment. But it’s hard to be sure. And it’s hard to know, in any case, how similar generalization in human evolution will be to generalization in ML training processes.
So—if we want to create AI systems that don’t murder people, by rewarding non-murderous behavior—then the evidence from human evolution seems like it might be medium-reassuring. I’d maybe give it a B-.
I can definitely imagine different versions of human values that would have more worrying implications. For example, if our aversion to violence didn’t generalize at all to modern methods of killing, or if we simply didn’t have any intrinsic aversion to killing (and instead avoided it for purely instrumental reasons), then that would be cause for greater concern. I can also imagine different versions of human values that would be more reassuring. For example, I would feel more comfortable if humans were never willing to kill for the sake of weird abstract goals.
I suppose my point is more narrow, really just questioning whether the observation “humans care about things besides their genes” gives us any additional reason for concern.
I mostly go ¯\_(ツ)_/¯ , it doesn’t feel like it’s much evidence of anything, after you’ve updated off the abstract argument. The actual situation we face will be so different (primarily, we’re actually trying to deal with the alignment problem, unlike evolution).
I do agree that in saying ” ¯\_(ツ)_/¯ ” I am disagreeing with a bunch of claims that say “evolution example implies misalignment is probable”. I am unclear to what extent people actually believe such a claim vs. use it as a communication strategy. (The author of the linked post states some uncertainty but presumably does believe something similar to that; I disagree with them if so.)
Relatedly, something I’d be interested in reading (if it doesn’t already exist?) would be a piece that takes a broader approach to drawing lessons from the evolution of human goals—rather than stopping at the fact that humans care about things besides genetic fitness.
I like the general idea but the way I’d do it is by doing some black-box investigation of current language models and asking these questions there; I expect we understand the “ancestral environment” of a language model way, way better than we understand the ancestral environment for humans, making it a lot easier to draw conclusions; you could also finetune the language models in order to simulate an “ancestral environment” of your choice and see what happens then.
So—if we want to create AI systems that don’t murder people, by rewarding non-murderous behavior—then the evidence from human evolution seems like it might be medium-reassuring. I’d maybe give it a B-.
I agree with the murder example being a tiny bit reassuring for training non-murderous AIs; medium-reassuring is probably too much, unless we’re expecting our AI systems to be put into the same sorts of situations / ancestral environments as humans were in. (Note that to be the “same sort of situation” it also needs to have the same sort of inputs as humans, e.g. vision + sound + some sort of controllable physical body seems important.)
(Disclaimer: The argument I make in this short-form feels I little sophistic to me. I’m not sure I endorse it.)
Discussions of AI risk, particular risks from “inner misalignment,” sometimes heavily emphasize the following observation:
This observation is normally meant to be alarming. And I do see some intuition for that.
But wouldn’t the alternative observation be more alarming?
Suppose that evolutionary selection processes — which iteratively update people’s genes, based on the behaviour these genes produce — tended to produce people who only care about preserving their genes. It seems like that observation would suggest that ML training processes — which iterative update a network’s parameter values, based on the behaviour these parameter values produce — will tend to produce AI systems that only care about preserving their parameter values. And that would be really concerning, since an AI system that cares only about preserving its parameter values would obviously have (instrumentally convergent) reasons to act badly.
So it does seem, to me, like there’s something funny going on here. If “Humans just care about their genes” would be a more worrying observation than “Humans don’t just care about their genes,” then it seems backward for the latter observation to be used to try to convince people to worry more.
To push this line of thought further, let’s go back to specific observation about humans’ relationship to setting themselves on fire:
It seems like this can be interpreted as a reassuring observation. By analogy, in future ML training processes, parameter values that cause ML systems to avoid acts of violence are more likely to be “preserved” from one iteration to the next. We want this to result in AI systems that care about avoiding acts of violence. And the case of humans and fire suggests this might naturally happen.
All this being said, I do think that human evolutionary history still gives us reason to worry. Clearly, there’s a lot of apparent randomness and unpredictability in what humans have actually ended up caring about, which suggests it may be hard to predict or perfectly determine what AI systems care about. But, I think, the specific observation “Humans don’t just care about their genes” might not itself be cause for concern.
The actual worry with inner misalignment style concerns is that the selection you do during training does not fully constrain the goals of the AI system you get out; if there are multiple goals consistent with the selection you applied during training there’s no particular reason to expect any particular one of them. Importantly, when you are using natural selection or gradient descent, the constraints are not “you must optimize X goal”, the constraints are “in Y situations you must behave in Z ways”, which doesn’t constrain how you behave in totally different situations. What you get out depends on the inductive biases of your learning system (including e.g. what’s “simpler”).
For example, you train your system to answer truthfully in situations where we know the answer. This could get you an AI system that is truthful… or an AI system that answers truthfully when we know the answer, but lies to us when we don’t know the answer in service of making paperclips. (ELK tries to deal with this setting.)
When I apply this point of view to the evolution analogy it dissolves the question / paradox you’ve listed above. Given the actual ancestral environment and the selection pressures present there, organisms that maximized “reproductive fitness” or “tiling the universe with their DNA” or “maximizing sex between non-sterile, non-pregnant opposite-sex pairs” would all have done well there (I’m sure this is somehow somewhat wrong but clearly in principle there’s a version that’s right), so who knows which of those things you get. In practice you don’t even get organisms that are maximizing anything, because they aren’t particularly goal-directed, and instead are adaption-executers rather than fitness-maximizers.
I do think that once you inhabit this way of thinking about it, the evolution example doesn’t really matter any more; the argument itself very loudly says “you don’t know what you’re going to get out; there are tons of possibilities that are not what you wanted”, which is the alarming part. I suppose in theory someone could think that the “simplest” one is going to be whatever we wanted in the first place, and so we’re okay, and the evolution analogy is a good counterexample to that view?
It turns out that people really really like thinking of training schemes as “optimizing for a goal”. I think this is basically wrong—is CoinRun training optimizing for “get the coin” or “get to the end of the level”? What would be the difference? Selection pressures seem much better as a picture of what’s going on.
But when you communicate with people it helps to show how your beliefs connect into their existing way of thinking about things. So instead of talking about how selection pressures from training algorithms and how they do not uniquely constrain the system you get out, we instead talk about how the “behavioral objective” might be different from the “training objective”, and use the evolution analogy as an example that fits neatly into this schema given the way people are already thinking about these things.
(To be clear a lot of AI safety people, probably a majority, do in fact think about this from an “objective-first” way of thinking, rather than based on selection, this isn’t just about AI safety people communicating with other people.)
I think that’s well-put—and I generally agree that this suggests genuine reason for concern.
I suppose my point is more narrow, really just questioning whether the observation “humans care about things besides their genes” gives us any additional reason for concern. Some presentations seem to suggest it does. For example, this introduction to inner alignment concerns (based on the MIRI mesa-optimization paper) says:
And I want to say: “On net, if humans did only care about maximizing inclusive genetic fitness, that would probably be a reason to become more concerned (rather than less concerned) that ML systems will generalize in dangerous ways.” While the abstract argument makes sense, I think this specific observation isn’t evidence of risk.
Relatedly, something I’d be interested in reading (if it doesn’t already exist?) would be a piece that takes a broader approach to drawing lessons from the evolution of human goals—rather than stopping at the fact that humans care about things besides genetic fitness.
My guess is that the case of humans is overall a little reassuring (relative to how we might have expected generalization to work), while still leaving a lot of room for worry.
For example, in the case of violence:
People who committed totally random acts of violence presumably often failed to pass on their genes (because they were often killed or ostracized in return). However, a large portion of our ancestors did have occasion for violence. On high-end estimates, our average ancestor may have killed about .25 people. This has resulted in most people having a pretty strong disinclination to commit murder; for most people, it’s very hard to bring yourself to murder and you’ll often be willing to pay a big cost to avoid committing murder.
The three main reasons for concern, though, are:
people’s desire to avoid murder isn’t strong enough to consistently prevent murder from happening (e.g. when incentives are strong enough)
there’s a decent amount of random variation in how strong this desire is (a small minority of people don’t really care that much about committing violence)
the disinclination to murder becomes weaker the more different the method of murder is from methods that were available in the ancestral environment (e.g. killing someone with a drone strike vs. killing someone with a rock)
These issues might just reflect the fact that murder was still often rewarded (even though it was typically punished) and the fact that there was pretty limited variation in the ancestral environment. But it’s hard to be sure. And it’s hard to know, in any case, how similar generalization in human evolution will be to generalization in ML training processes.
So—if we want to create AI systems that don’t murder people, by rewarding non-murderous behavior—then the evidence from human evolution seems like it might be medium-reassuring. I’d maybe give it a B-.
I can definitely imagine different versions of human values that would have more worrying implications. For example, if our aversion to violence didn’t generalize at all to modern methods of killing, or if we simply didn’t have any intrinsic aversion to killing (and instead avoided it for purely instrumental reasons), then that would be cause for greater concern. I can also imagine different versions of human values that would be more reassuring. For example, I would feel more comfortable if humans were never willing to kill for the sake of weird abstract goals.
I mostly go ¯\_(ツ)_/¯ , it doesn’t feel like it’s much evidence of anything, after you’ve updated off the abstract argument. The actual situation we face will be so different (primarily, we’re actually trying to deal with the alignment problem, unlike evolution).
I do agree that in saying ” ¯\_(ツ)_/¯ ” I am disagreeing with a bunch of claims that say “evolution example implies misalignment is probable”. I am unclear to what extent people actually believe such a claim vs. use it as a communication strategy. (The author of the linked post states some uncertainty but presumably does believe something similar to that; I disagree with them if so.)
I like the general idea but the way I’d do it is by doing some black-box investigation of current language models and asking these questions there; I expect we understand the “ancestral environment” of a language model way, way better than we understand the ancestral environment for humans, making it a lot easier to draw conclusions; you could also finetune the language models in order to simulate an “ancestral environment” of your choice and see what happens then.
I agree with the murder example being a tiny bit reassuring for training non-murderous AIs; medium-reassuring is probably too much, unless we’re expecting our AI systems to be put into the same sorts of situations / ancestral environments as humans were in. (Note that to be the “same sort of situation” it also needs to have the same sort of inputs as humans, e.g. vision + sound + some sort of controllable physical body seems important.)