I really liked this post. I’ve often felt frustrated by how badly the alignment community has explained the problem, especially to ML practitioners and researchers, and I personally find neither Superintelligence nor Human Compatible very persuasive. For what it’s worth, my default hypothesis is that you’re unconvinced by the arguments about AI risk in significant part because you are applying an usually high level of epistemic rigour, which is a skill that seems valuable to continue applying to this topic (including in the case where AI risk isn’t important, since that will help us uncover our mistake sooner). I can think of some specific possibilities, and will send you a message about them.
The frustration I mentioned was the main motivation for me designing the AGISF course; I’m now working on follow-up material to hopefully convey the key ideas in a simpler and more streamlined way (e.g. getting rid of the concept of “mesa-optimisers”; clarifying the relationship between “behaviours that are reinforced because they lead to humans being mistaken” and “deliberate deception”; etc). Thanks for noting the “deception” ambiguity in the AGI safety fundamentals curriculum—I’ve replaced it with a more careful claim (details in reply to this comment).
Old: “The techniques discussed this week showcase a tradeoff between power and alignment: behavioural cloning provides the fewest incentives for misbehaviour, but is also hardest to use to go beyond human-level ability. Whereas reward modelling can reward agents for unexpected behaviour that leads to good outcomes (as long as humans can recognise them) - but this also means that those agents might find and be rewarded for manipulative or deceptive actions. Christiano et al. (2017) provide an example of an agent learning to deceive the human evaluator; and Stiennon et al. (2020) provide an example of an agent learning to “deceive” its reward model. Lastly, while IRL could in theory be used even for tasks that humans can’t evaluate, it relies most heavily on assumptions about human rationality in order to align agents.”
New: “The techniques discussed this week showcase a tradeoff between power and alignment: behavioural cloning provides the fewest incentives for misbehaviour, but is also hardest to use to go beyond human-level ability. Reward modelling, by contrast, can reward agents for unexpected behaviour that leads to good outcomes—but also rewards agents for manipulative or deceptive actions. (Although deliberate deception is likely beyond the capabilities of current agents, there are examples of simpler behaviours have a similar effect: Christiano et al. (2017) describes an agent learning behaviour which misled the human evaluator; and Stiennon et al. (2020) describes an agent learning behaviour which was misclassified by its reward model.) Lastly, while IRL can potentially be used even for tasks that humans can’t evaluate, the theoretical justification for why this should work relies on implausibly strong assumptions about human rationality.”
my default hypothesis is that you’re unconvinced by the arguments about AI risk in significant part because you are applying an usually high level of epistemic rigour
This seems plausible to me, based on:
The people I know who have thought deeply about AI risk and come away unconvinced often seems to match this pattern.
I think some of the people who care most about AI risk apply a lower level of epistemic rigour than I would, e.g. some seem to have much stronger beliefs about how the future will go than I think can be reasonably justified.
Interesting to hear your personal opinion on the persuasiveness of Superintelligence and Human Compatible! And thanks for designing the AGISF course, it was useful.
Superintelligence doesn’t talk about ML enough to be strongly persuasive given the magnitude of the claims it’s making (although it does a reasonable job of conveying core ideas like the instrumental convergence thesis and orthogonality thesis, which are where many skeptics get stuck).
Human Compatible only spends, I think, a couple of pages actually explaining the core of the alignment problem (although it does a good job at debunking some of the particularly bad responses to it). It doesn’t do a great job at linking the conventional ML paradigm to the superintelligence paradigm, and I don’t think the “assistance games” approach is anywhere near as promising as Russell makes it out to be.
I wish you would summarize this disagreement with Russell as “I think neural networks / ML will lead to AGI whereas Russell expects it will be something else”. Everything else seems downstream of that. (If I had similar beliefs about how we’d get to AGI as Russell, and I was forced to choose to work on some existing research agenda, it would be assistance games. Though really I would prefer to see if I could transfer the insights from neural network / ML alignment, which might then give rise to some new agenda.)
This seems particularly important to do when talking to someone who also thinks neural networks/ ML will not lead to AGI.
FWIW, I don’t think the problem with assistance games is that it assumes that ML is not going to get to AGI. The issues seem much deeper than that (mostly of the “grain of truth” sort, and from the fact that in CIRL-like formulations, the actual update-rule for how to update your beliefs about the correct value function is where 99% of the problem lies, and the rest of the decomposition doesn’t really seem to me to reduce the problem very much, but instead just shunts it into a tiny box that then seems to get ignored, as far as I can tell).
The issues seem much deeper than that (mostly of the “grain of truth” sort, and from the fact that in CIRL-like formulations, the actual update-rule for how to update your beliefs about the correct value function is where 99% of the problem lies, and the rest of the decomposition doesn’t really seem to me to reduce the problem very much
Sounds right, and compatible with everything I said? (Not totally sure what counts as “reducing the problem”, plausibly I’d disagree with you there.)
Like, if you were trying to go to the Moon, and you discovered the rocket equation and some BOTECs said it might be feasible to use, I think (a) you should be excited about this new paradigm for how to get to the Moon, and (b) “99% of the problem” still lies ahead of you, in making a device that actually uses the rocket equation appropriately.
Is there some other paradigm for AI alignment (neural net based or otherwise) that you think solves more than “1% of the problem”? I’ll be happy to shoot it down for you.
instead just shunts it into a tiny box that then seems to get ignored, as far as I can tell
This is definitely a known problem. I think you don’t see much work on it because (a) there isn’t much work on assistance games in general (my outsider impression is that many CHAI grad students are focused on neural nets), and (b) it’s the sort of work that is particularly hard to do in academia.
Some abstractions that feel like they do real work on AI Alignment (compared to CIRL stuff):
Inner optimization
Intent alignment vs. impact alignment
Natural abstraction hypothesis
Coherent Extrapolated Volition
Instrumental convergence
Acausal trade
None of these are paradigms, but all of them feel like they do substantially reduce the problem, in a way that doesn’t feel true for CIRL. It is possible I have a skewed perception of actual CIRL stuff, based on your last paragraph though, so it’s plausible we are just talking about different things.
Huh. I’d put assistance games above all of those things (except inner optimization but that’s again downstream of the paradigm difference; inner optimization is much less of a thing when you aren’t getting intelligence through a giant search over programs). Probably not worth getting into this disagreement though.
I don’t think that my main disagreement with Stuart is about how we’ll reach AGI, because critiques of his approach, like this page, don’t actually require any assumption that we’re in the ML paradigm.
Whether AGI will be built in the ML paradigm or not, I think that CIRL does less than 5%, and probably less than 1%, of the conceptual work of solving alignment; whereas the rocket equation does significantly more than 5% of the conceptual work required to get to the moon. And then in both cases there’s lots of engineering work required too. (If AGI will be built in a non-ML paradigm, then getting 5% of the way to solving alignment probably requires actually making claims about whatever the replacement-to-ML paradigm is, which I haven’t seen from Stuart.)
But Stuart’s presentation of his ideas seems wildly inconsistent with both my position and your position above (e.g. in Human Compatible he seems way more confident in his proposal than would be justified by having gotten even 5% of the way to a solution).
I don’t think that my main disagreement with Stuart is about how we’ll reach AGI, because critiques of his approach, like this page, don’t actually require any assumption that we’re in the ML paradigm.
I agree that single critique doesn’t depend on the ML paradigm. If that’s your main disagreement then I retract my claim that it’s downstream of paradigm disagreements.
What’s your probability that if we really tried to get the assistance paradigm to work then we’d ultimately conclude it was basically doomed because of this objection? I’m at like 50%, such that if there were no other objections the decision would be “it is blindingly obvious that we should pursue this”.
I think that CIRL does less than 5%, and probably less than 1%, of the conceptual work of solving alignment; whereas the rocket equation does significantly more than 5% of the conceptual work required to get to the moon.
I might disagree with this but I don’t know how you’re distinguishing between conceptual and non-conceptual work. (I’m guessing I’ll disagree with the rocket equation doing > 5% of the conceptual work.)
If AGI will be built in a non-ML paradigm, then getting 5% of the way to solving alignment probably requires actually making claims about whatever the replacement-to-ML paradigm is, which I haven’t seen from Stuart.
I don’t think this is particularly relevant to the rest of the disagreement, but this is explicitly discussed in Human Compatible! It’s right at the beginning of my summary of it!
But Stuart’s presentation of his ideas seems wildly inconsistent with both my position and your position above (e.g. in Human Compatible he seems way more confident in his proposal than would be justified by having gotten even 5% of the way to a solution).
Are you reacting to his stated beliefs or the way he communicates?
If you are reacting to his stated beliefs: I’m not sure where you get this from. His actual beliefs (as stated in Human Compatible) are that there are lots of problems that still need to be solved. From my summary:
Another problem with inferring preferences from behavior is that humans are nearly always in some deeply nested plan, and many actions don’t even occur to us. Right now I’m writing this summary, and not considering whether I should become a fireman. I’m not writing this summary because I just ran a calculation showing that this would best achieve my preferences, I’m doing it because it’s a subpart of the overall plan of writing this bonus newsletter, which itself is a subpart of other plans. The connection to my preferences is very far up. How do we deal with that fact?
There are perhaps more fundamental challenges with the notion of “preferences” itself. For example, our experiencing self and our remembering self may have different preferences—if so, which one should our agent optimize for? In addition, our preferences often change over time: should our agent optimize for our current preferences, even if it knows that they will predictably change in the future? This one could potentially be solved by learning meta-preferences that dictate what kinds of preference change processes are acceptable.
All of these issues suggest that we need work across many fields (such as AI, cognitive science, psychology, and neuroscience) to reverse-engineer human cognition, so that we can put principle 3 into action and create a model that shows how human behavior arises from human preferences.
If you are reacting to how he communicates: I don’t know why you expect him to follow the norms of the EA community and sprinkle “probably” in every sentence. That’s not the norms that the broader world operates under; he’s writing for the broader world.
I really liked this post. I’ve often felt frustrated by how badly the alignment community has explained the problem, especially to ML practitioners and researchers, and I personally find neither Superintelligence nor Human Compatible very persuasive. For what it’s worth, my default hypothesis is that you’re unconvinced by the arguments about AI risk in significant part because you are applying an usually high level of epistemic rigour, which is a skill that seems valuable to continue applying to this topic (including in the case where AI risk isn’t important, since that will help us uncover our mistake sooner). I can think of some specific possibilities, and will send you a message about them.
The frustration I mentioned was the main motivation for me designing the AGISF course; I’m now working on follow-up material to hopefully convey the key ideas in a simpler and more streamlined way (e.g. getting rid of the concept of “mesa-optimisers”; clarifying the relationship between “behaviours that are reinforced because they lead to humans being mistaken” and “deliberate deception”; etc). Thanks for noting the “deception” ambiguity in the AGI safety fundamentals curriculum—I’ve replaced it with a more careful claim (details in reply to this comment).
Old: “The techniques discussed this week showcase a tradeoff between power and alignment: behavioural cloning provides the fewest incentives for misbehaviour, but is also hardest to use to go beyond human-level ability. Whereas reward modelling can reward agents for unexpected behaviour that leads to good outcomes (as long as humans can recognise them) - but this also means that those agents might find and be rewarded for manipulative or deceptive actions. Christiano et al. (2017) provide an example of an agent learning to deceive the human evaluator; and Stiennon et al. (2020) provide an example of an agent learning to “deceive” its reward model. Lastly, while IRL could in theory be used even for tasks that humans can’t evaluate, it relies most heavily on assumptions about human rationality in order to align agents.”
New: “The techniques discussed this week showcase a tradeoff between power and alignment: behavioural cloning provides the fewest incentives for misbehaviour, but is also hardest to use to go beyond human-level ability. Reward modelling, by contrast, can reward agents for unexpected behaviour that leads to good outcomes—but also rewards agents for manipulative or deceptive actions. (Although deliberate deception is likely beyond the capabilities of current agents, there are examples of simpler behaviours have a similar effect: Christiano et al. (2017) describes an agent learning behaviour which misled the human evaluator; and Stiennon et al. (2020) describes an agent learning behaviour which was misclassified by its reward model.) Lastly, while IRL can potentially be used even for tasks that humans can’t evaluate, the theoretical justification for why this should work relies on implausibly strong assumptions about human rationality.”
This seems plausible to me, based on:
The people I know who have thought deeply about AI risk and come away unconvinced often seems to match this pattern.
I think some of the people who care most about AI risk apply a lower level of epistemic rigour than I would, e.g. some seem to have much stronger beliefs about how the future will go than I think can be reasonably justified.
Interesting to hear your personal opinion on the persuasiveness of Superintelligence and Human Compatible! And thanks for designing the AGISF course, it was useful.
Superintelligence doesn’t talk about ML enough to be strongly persuasive given the magnitude of the claims it’s making (although it does a reasonable job of conveying core ideas like the instrumental convergence thesis and orthogonality thesis, which are where many skeptics get stuck).
Human Compatible only spends, I think, a couple of pages actually explaining the core of the alignment problem (although it does a good job at debunking some of the particularly bad responses to it). It doesn’t do a great job at linking the conventional ML paradigm to the superintelligence paradigm, and I don’t think the “assistance games” approach is anywhere near as promising as Russell makes it out to be.
I wish you would summarize this disagreement with Russell as “I think neural networks / ML will lead to AGI whereas Russell expects it will be something else”. Everything else seems downstream of that. (If I had similar beliefs about how we’d get to AGI as Russell, and I was forced to choose to work on some existing research agenda, it would be assistance games. Though really I would prefer to see if I could transfer the insights from neural network / ML alignment, which might then give rise to some new agenda.)
This seems particularly important to do when talking to someone who also thinks neural networks/ ML will not lead to AGI.
FWIW, I don’t think the problem with assistance games is that it assumes that ML is not going to get to AGI. The issues seem much deeper than that (mostly of the “grain of truth” sort, and from the fact that in CIRL-like formulations, the actual update-rule for how to update your beliefs about the correct value function is where 99% of the problem lies, and the rest of the decomposition doesn’t really seem to me to reduce the problem very much, but instead just shunts it into a tiny box that then seems to get ignored, as far as I can tell).
Sounds right, and compatible with everything I said? (Not totally sure what counts as “reducing the problem”, plausibly I’d disagree with you there.)
Like, if you were trying to go to the Moon, and you discovered the rocket equation and some BOTECs said it might be feasible to use, I think (a) you should be excited about this new paradigm for how to get to the Moon, and (b) “99% of the problem” still lies ahead of you, in making a device that actually uses the rocket equation appropriately.
Is there some other paradigm for AI alignment (neural net based or otherwise) that you think solves more than “1% of the problem”? I’ll be happy to shoot it down for you.
This is definitely a known problem. I think you don’t see much work on it because (a) there isn’t much work on assistance games in general (my outsider impression is that many CHAI grad students are focused on neural nets), and (b) it’s the sort of work that is particularly hard to do in academia.
Some abstractions that feel like they do real work on AI Alignment (compared to CIRL stuff):
Inner optimization
Intent alignment vs. impact alignment
Natural abstraction hypothesis
Coherent Extrapolated Volition
Instrumental convergence
Acausal trade
None of these are paradigms, but all of them feel like they do substantially reduce the problem, in a way that doesn’t feel true for CIRL. It is possible I have a skewed perception of actual CIRL stuff, based on your last paragraph though, so it’s plausible we are just talking about different things.
Huh. I’d put assistance games above all of those things (except inner optimization but that’s again downstream of the paradigm difference; inner optimization is much less of a thing when you aren’t getting intelligence through a giant search over programs). Probably not worth getting into this disagreement though.
I don’t think that my main disagreement with Stuart is about how we’ll reach AGI, because critiques of his approach, like this page, don’t actually require any assumption that we’re in the ML paradigm.
Whether AGI will be built in the ML paradigm or not, I think that CIRL does less than 5%, and probably less than 1%, of the conceptual work of solving alignment; whereas the rocket equation does significantly more than 5% of the conceptual work required to get to the moon. And then in both cases there’s lots of engineering work required too. (If AGI will be built in a non-ML paradigm, then getting 5% of the way to solving alignment probably requires actually making claims about whatever the replacement-to-ML paradigm is, which I haven’t seen from Stuart.)
But Stuart’s presentation of his ideas seems wildly inconsistent with both my position and your position above (e.g. in Human Compatible he seems way more confident in his proposal than would be justified by having gotten even 5% of the way to a solution).
I agree that single critique doesn’t depend on the ML paradigm. If that’s your main disagreement then I retract my claim that it’s downstream of paradigm disagreements.
What’s your probability that if we really tried to get the assistance paradigm to work then we’d ultimately conclude it was basically doomed because of this objection? I’m at like 50%, such that if there were no other objections the decision would be “it is blindingly obvious that we should pursue this”.
I might disagree with this but I don’t know how you’re distinguishing between conceptual and non-conceptual work. (I’m guessing I’ll disagree with the rocket equation doing > 5% of the conceptual work.)
I don’t think this is particularly relevant to the rest of the disagreement, but this is explicitly discussed in Human Compatible! It’s right at the beginning of my summary of it!
Are you reacting to his stated beliefs or the way he communicates?
If you are reacting to his stated beliefs: I’m not sure where you get this from. His actual beliefs (as stated in Human Compatible) are that there are lots of problems that still need to be solved. From my summary:
If you are reacting to how he communicates: I don’t know why you expect him to follow the norms of the EA community and sprinkle “probably” in every sentence. That’s not the norms that the broader world operates under; he’s writing for the broader world.