The actual worry with inner misalignment style concerns is that the selection you do during training does not fully constrain the goals of the AI system you get out; if there are multiple goals consistent with the selection you applied during training there’s no particular reason to expect any particular one of them. Importantly, when you are using natural selection or gradient descent, the constraints are not “you must optimize X goal”, the constraints are “in Y situations you must behave in Z ways”, which doesn’t constrain how you behave in totally different situations. What you get out depends on the inductive biases of your learning system (including e.g. what’s “simpler”).
I think that’s well-put—and I generally agree that this suggests genuine reason for concern.
I suppose my point is more narrow, really just questioning whether the observation “humans care about things besides their genes” gives us any additional reason for concern. Some presentations seem to suggest it does. For example, this introduction to inner alignment concerns (based on the MIRI mesa-optimization paper) says:
We can see that humans are not aligned with the base objective of evolution [maximize inclusive genetic fitness].… [This] analogy might be an argument for why Inner Misalignment is probable since it has occurred “naturally” in the biggest non-human-caused optimization process we know.
And I want to say: “On net, if humans did only care about maximizing inclusive genetic fitness, that would probably be a reason to become more concerned (rather than less concerned) that ML systems will generalize in dangerous ways.” While the abstract argument makes sense, I think this specific observation isn’t evidence of risk.
Relatedly, something I’d be interested in reading (if it doesn’t already exist?) would be a piece that takes a broader approach to drawing lessons from the evolution of human goals—rather than stopping at the fact that humans care about things besides genetic fitness.
My guess is that the case of humans is overall a little reassuring (relative to how we might have expected generalization to work), while still leaving a lot of room for worry.
For example, in the case of violence:
People who committed totally random acts of violence presumably often failed to pass on their genes (because they were often killed or ostracized in return). However, a large portion of our ancestors did have occasion for violence. On high-end estimates, our average ancestor may have killed about .25 people. This has resulted in most people having a pretty strong disinclination to commit murder; for most people, it’s very hard to bring yourself to murder and you’ll often be willing to pay a big cost to avoid committing murder.
The three main reasons for concern, though, are:
people’s desire to avoid murder isn’t strong enough to consistently prevent murder from happening (e.g. when incentives are strong enough)
there’s a decent amount of random variation in how strong this desire is (a small minority of people don’t really care that much about committing violence)
the disinclination to murder becomes weaker the more different the method of murder is from methods that were available in the ancestral environment (e.g. killing someone with a drone strike vs. killing someone with a rock)
These issues might just reflect the fact that murder was still often rewarded (even though it was typically punished) and the fact that there was pretty limited variation in the ancestral environment. But it’s hard to be sure. And it’s hard to know, in any case, how similar generalization in human evolution will be to generalization in ML training processes.
So—if we want to create AI systems that don’t murder people, by rewarding non-murderous behavior—then the evidence from human evolution seems like it might be medium-reassuring. I’d maybe give it a B-.
I can definitely imagine different versions of human values that would have more worrying implications. For example, if our aversion to violence didn’t generalize at all to modern methods of killing, or if we simply didn’t have any intrinsic aversion to killing (and instead avoided it for purely instrumental reasons), then that would be cause for greater concern. I can also imagine different versions of human values that would be more reassuring. For example, I would feel more comfortable if humans were never willing to kill for the sake of weird abstract goals.
I suppose my point is more narrow, really just questioning whether the observation “humans care about things besides their genes” gives us any additional reason for concern.
I mostly go ¯\_(ツ)_/¯ , it doesn’t feel like it’s much evidence of anything, after you’ve updated off the abstract argument. The actual situation we face will be so different (primarily, we’re actually trying to deal with the alignment problem, unlike evolution).
I do agree that in saying ” ¯\_(ツ)_/¯ ” I am disagreeing with a bunch of claims that say “evolution example implies misalignment is probable”. I am unclear to what extent people actually believe such a claim vs. use it as a communication strategy. (The author of the linked post states some uncertainty but presumably does believe something similar to that; I disagree with them if so.)
Relatedly, something I’d be interested in reading (if it doesn’t already exist?) would be a piece that takes a broader approach to drawing lessons from the evolution of human goals—rather than stopping at the fact that humans care about things besides genetic fitness.
I like the general idea but the way I’d do it is by doing some black-box investigation of current language models and asking these questions there; I expect we understand the “ancestral environment” of a language model way, way better than we understand the ancestral environment for humans, making it a lot easier to draw conclusions; you could also finetune the language models in order to simulate an “ancestral environment” of your choice and see what happens then.
So—if we want to create AI systems that don’t murder people, by rewarding non-murderous behavior—then the evidence from human evolution seems like it might be medium-reassuring. I’d maybe give it a B-.
I agree with the murder example being a tiny bit reassuring for training non-murderous AIs; medium-reassuring is probably too much, unless we’re expecting our AI systems to be put into the same sorts of situations / ancestral environments as humans were in. (Note that to be the “same sort of situation” it also needs to have the same sort of inputs as humans, e.g. vision + sound + some sort of controllable physical body seems important.)
I think that’s well-put—and I generally agree that this suggests genuine reason for concern.
I suppose my point is more narrow, really just questioning whether the observation “humans care about things besides their genes” gives us any additional reason for concern. Some presentations seem to suggest it does. For example, this introduction to inner alignment concerns (based on the MIRI mesa-optimization paper) says:
And I want to say: “On net, if humans did only care about maximizing inclusive genetic fitness, that would probably be a reason to become more concerned (rather than less concerned) that ML systems will generalize in dangerous ways.” While the abstract argument makes sense, I think this specific observation isn’t evidence of risk.
Relatedly, something I’d be interested in reading (if it doesn’t already exist?) would be a piece that takes a broader approach to drawing lessons from the evolution of human goals—rather than stopping at the fact that humans care about things besides genetic fitness.
My guess is that the case of humans is overall a little reassuring (relative to how we might have expected generalization to work), while still leaving a lot of room for worry.
For example, in the case of violence:
People who committed totally random acts of violence presumably often failed to pass on their genes (because they were often killed or ostracized in return). However, a large portion of our ancestors did have occasion for violence. On high-end estimates, our average ancestor may have killed about .25 people. This has resulted in most people having a pretty strong disinclination to commit murder; for most people, it’s very hard to bring yourself to murder and you’ll often be willing to pay a big cost to avoid committing murder.
The three main reasons for concern, though, are:
people’s desire to avoid murder isn’t strong enough to consistently prevent murder from happening (e.g. when incentives are strong enough)
there’s a decent amount of random variation in how strong this desire is (a small minority of people don’t really care that much about committing violence)
the disinclination to murder becomes weaker the more different the method of murder is from methods that were available in the ancestral environment (e.g. killing someone with a drone strike vs. killing someone with a rock)
These issues might just reflect the fact that murder was still often rewarded (even though it was typically punished) and the fact that there was pretty limited variation in the ancestral environment. But it’s hard to be sure. And it’s hard to know, in any case, how similar generalization in human evolution will be to generalization in ML training processes.
So—if we want to create AI systems that don’t murder people, by rewarding non-murderous behavior—then the evidence from human evolution seems like it might be medium-reassuring. I’d maybe give it a B-.
I can definitely imagine different versions of human values that would have more worrying implications. For example, if our aversion to violence didn’t generalize at all to modern methods of killing, or if we simply didn’t have any intrinsic aversion to killing (and instead avoided it for purely instrumental reasons), then that would be cause for greater concern. I can also imagine different versions of human values that would be more reassuring. For example, I would feel more comfortable if humans were never willing to kill for the sake of weird abstract goals.
I mostly go ¯\_(ツ)_/¯ , it doesn’t feel like it’s much evidence of anything, after you’ve updated off the abstract argument. The actual situation we face will be so different (primarily, we’re actually trying to deal with the alignment problem, unlike evolution).
I do agree that in saying ” ¯\_(ツ)_/¯ ” I am disagreeing with a bunch of claims that say “evolution example implies misalignment is probable”. I am unclear to what extent people actually believe such a claim vs. use it as a communication strategy. (The author of the linked post states some uncertainty but presumably does believe something similar to that; I disagree with them if so.)
I like the general idea but the way I’d do it is by doing some black-box investigation of current language models and asking these questions there; I expect we understand the “ancestral environment” of a language model way, way better than we understand the ancestral environment for humans, making it a lot easier to draw conclusions; you could also finetune the language models in order to simulate an “ancestral environment” of your choice and see what happens then.
I agree with the murder example being a tiny bit reassuring for training non-murderous AIs; medium-reassuring is probably too much, unless we’re expecting our AI systems to be put into the same sorts of situations / ancestral environments as humans were in. (Note that to be the “same sort of situation” it also needs to have the same sort of inputs as humans, e.g. vision + sound + some sort of controllable physical body seems important.)