richard_ngo comments on How I failed to form views on AI safety

richard_ngo Apr 25, 2022, 11:15 PM
6 points
0 ∶ 0
I don’t think that my main disagreement with Stuart is about how we’ll reach AGI, because critiques of his approach, like this page, don’t actually require any assumption that we’re in the ML paradigm.
Whether AGI will be built in the ML paradigm or not, I think that CIRL does less than 5%, and probably less than 1%, of the conceptual work of solving alignment; whereas the rocket equation does significantly more than 5% of the conceptual work required to get to the moon. And then in both cases there’s lots of engineering work required too. (If AGI will be built in a non-ML paradigm, then getting 5% of the way to solving alignment probably requires actually making claims about whatever the replacement-to-ML paradigm is, which I haven’t seen from Stuart.)
But Stuart’s presentation of his ideas seems wildly inconsistent with both my position and your position above (e.g. in Human Compatible he seems way more confident in his proposal than would be justified by having gotten even 5% of the way to a solution).
- Rohin Shah Apr 26, 2022, 8:21 AM
  2 points
  0 ∶ 0
  Parent
  I don’t think that my main disagreement with Stuart is about how we’ll reach AGI, because critiques of his approach, like this page, don’t actually require any assumption that we’re in the ML paradigm.
  I agree that single critique doesn’t depend on the ML paradigm. If that’s your main disagreement then I retract my claim that it’s downstream of paradigm disagreements.
  What’s your probability that if we really tried to get the assistance paradigm to work then we’d ultimately conclude it was basically doomed because of this objection? I’m at like 50%, such that if there were no other objections the decision would be “it is blindingly obvious that we should pursue this”.
  I think that CIRL does less than 5%, and probably less than 1%, of the conceptual work of solving alignment; whereas the rocket equation does significantly more than 5% of the conceptual work required to get to the moon.
  I might disagree with this but I don’t know how you’re distinguishing between conceptual and non-conceptual work. (I’m guessing I’ll disagree with the rocket equation doing > 5% of the conceptual work.)
  If AGI will be built in a non-ML paradigm, then getting 5% of the way to solving alignment probably requires actually making claims about whatever the replacement-to-ML paradigm is, which I haven’t seen from Stuart.
  I don’t think this is particularly relevant to the rest of the disagreement, but this is explicitly discussed in Human Compatible! It’s right at the beginning of my summary of it!
  But Stuart’s presentation of his ideas seems wildly inconsistent with both my position and your position above (e.g. in Human Compatible he seems way more confident in his proposal than would be justified by having gotten even 5% of the way to a solution).
  Are you reacting to his stated beliefs or the way he communicates?
  If you are reacting to his stated beliefs: I’m not sure where you get this from. His actual beliefs (as stated in Human Compatible) are that there are lots of problems that still need to be solved. From my summary:
  Another problem with inferring preferences from behavior is that humans are nearly always in some deeply nested plan, and many actions don’t even occur to us. Right now I’m writing this summary, and not considering whether I should become a fireman. I’m not writing this summary because I just ran a calculation showing that this would best achieve my preferences, I’m doing it because it’s a subpart of the overall plan of writing this bonus newsletter, which itself is a subpart of other plans. The connection to my preferences is very far up. How do we deal with that fact?
  There are perhaps more fundamental challenges with the notion of “preferences” itself. For example, our experiencing self and our remembering self may have different preferences—if so, which one should our agent optimize for? In addition, our preferences often change over time: should our agent optimize for our current preferences, even if it knows that they will predictably change in the future? This one could potentially be solved by learning meta-preferences that dictate what kinds of preference change processes are acceptable.
  All of these issues suggest that we need work across many fields (such as AI, cognitive science, psychology, and neuroscience) to reverse-engineer human cognition, so that we can put principle 3 into action and create a model that shows how human behavior arises from human preferences.
  If you are reacting to how he communicates: I don’t know why you expect him to follow the norms of the EA community and sprinkle “probably” in every sentence. That’s not the norms that the broader world operates under; he’s writing for the broader world.