Is the motivation for 3 mainly something like “predictive performance and consequentialist behaviour are correlated in many measures over very large sets of algorithms”, or is there a more concrete story about how this behaviour emerges from current AI paradigms?
Here is my story, I’m not sure if this is what you are referring to (it sounds like it probably is).
Any prediction algorithm faces many internal tradeoffs about e.g. what to spend time thinking about and what to store in memory to reference in the future. An algorithm which makes those choices well across many different inputs will tend to do better, and in the limit I expect it to be possible to do better more robustly by making some of those choices in a consequentialist way (i.e. predicting the consequences of different possible options) rather than having all of them baked in by gradient descent or produced by simpler heuristics.
If systems with consequentialist reasoning are able to make better predictions, then gradient descent will tend to select them.
Of course all these lines are blurry. But I think that systems that are “consequentialist” in this sense will eventually tend to exhibit the failure modes we are concerned about, including (eventually) deceptive alignment.
I think making this story more concrete would involve specifying particular examples of consequentialist cognition, describing how they are implemented in a given neural network architecture, and describing the trajectory by which gradient descent learns them on a given dataset. I think specifying these details can be quite involved both because they are likely to involve literally billions of separate pieces of machinery functioning together, and because designing such mechanisms is difficult (which is why we delegate it to SGD). But I do think we can fill them in well enough to verify that this kind of thing can happen in principle (even if we can’t fill them in in a way that is realistic, given that we can’t design performant trillion parameter models by hand).
Is the motivation for 3 mainly something like “predictive performance and consequentialist behaviour are correlated in many measures over very large sets of algorithms”, or is there a more concrete story about how this behaviour emerges from current AI paradigms?
Here is my story, I’m not sure if this is what you are referring to (it sounds like it probably is).
Any prediction algorithm faces many internal tradeoffs about e.g. what to spend time thinking about and what to store in memory to reference in the future. An algorithm which makes those choices well across many different inputs will tend to do better, and in the limit I expect it to be possible to do better more robustly by making some of those choices in a consequentialist way (i.e. predicting the consequences of different possible options) rather than having all of them baked in by gradient descent or produced by simpler heuristics.
If systems with consequentialist reasoning are able to make better predictions, then gradient descent will tend to select them.
Of course all these lines are blurry. But I think that systems that are “consequentialist” in this sense will eventually tend to exhibit the failure modes we are concerned about, including (eventually) deceptive alignment.
I think making this story more concrete would involve specifying particular examples of consequentialist cognition, describing how they are implemented in a given neural network architecture, and describing the trajectory by which gradient descent learns them on a given dataset. I think specifying these details can be quite involved both because they are likely to involve literally billions of separate pieces of machinery functioning together, and because designing such mechanisms is difficult (which is why we delegate it to SGD). But I do think we can fill them in well enough to verify that this kind of thing can happen in principle (even if we can’t fill them in in a way that is realistic, given that we can’t design performant trillion parameter models by hand).