Habryka [Deactivated] comments on How I failed to form views on AI safety

Habryka [Deactivated] 24 Apr 2022 23:50 UTC
11 points
0 ∶ 0
FWIW, I don’t think the problem with assistance games is that it assumes that ML is not going to get to AGI. The issues seem much deeper than that (mostly of the “grain of truth” sort, and from the fact that in CIRL-like formulations, the actual update-rule for how to update your beliefs about the correct value function is where 99% of the problem lies, and the rest of the decomposition doesn’t really seem to me to reduce the problem very much, but instead just shunts it into a tiny box that then seems to get ignored, as far as I can tell).
- Rohin Shah 25 Apr 2022 8:11 UTC
  2 points
  0 ∶ 0
  Parent
  The issues seem much deeper than that (mostly of the “grain of truth” sort, and from the fact that in CIRL-like formulations, the actual update-rule for how to update your beliefs about the correct value function is where 99% of the problem lies, and the rest of the decomposition doesn’t really seem to me to reduce the problem very much
  Sounds right, and compatible with everything I said? (Not totally sure what counts as “reducing the problem”, plausibly I’d disagree with you there.)
  Like, if you were trying to go to the Moon, and you discovered the rocket equation and some BOTECs said it might be feasible to use, I think (a) you should be excited about this new paradigm for how to get to the Moon, and (b) “99% of the problem” still lies ahead of you, in making a device that actually uses the rocket equation appropriately.
  Is there some other paradigm for AI alignment (neural net based or otherwise) that you think solves more than “1% of the problem”? I’ll be happy to shoot it down for you.
  instead just shunts it into a tiny box that then seems to get ignored, as far as I can tell
  This is definitely a known problem. I think you don’t see much work on it because (a) there isn’t much work on assistance games in general (my outsider impression is that many CHAI grad students are focused on neural nets), and (b) it’s the sort of work that is particularly hard to do in academia.
  - Habryka [Deactivated] 27 Apr 2022 17:53 UTC
    8 points
    0 ∶ 0
    Parent
    Some abstractions that feel like they do real work on AI Alignment (compared to CIRL stuff):
    Inner optimization
    Intent alignment vs. impact alignment
    Natural abstraction hypothesis
    Coherent Extrapolated Volition
    Instrumental convergence
    Acausal trade
    None of these are paradigms, but all of them feel like they do substantially reduce the problem, in a way that doesn’t feel true for CIRL. It is possible I have a skewed perception of actual CIRL stuff, based on your last paragraph though, so it’s plausible we are just talking about different things.
    - Rohin Shah 28 Apr 2022 6:48 UTC
      2 points
      0 ∶ 0
      Parent
      Huh. I’d put assistance games above all of those things (except inner optimization but that’s again downstream of the paradigm difference; inner optimization is much less of a thing when you aren’t getting intelligence through a giant search over programs). Probably not worth getting into this disagreement though.
  - richard_ngo 25 Apr 2022 23:15 UTC
    6 points
    0 ∶ 0
    Parent
    I don’t think that my main disagreement with Stuart is about how we’ll reach AGI, because critiques of his approach, like this page, don’t actually require any assumption that we’re in the ML paradigm.
    Whether AGI will be built in the ML paradigm or not, I think that CIRL does less than 5%, and probably less than 1%, of the conceptual work of solving alignment; whereas the rocket equation does significantly more than 5% of the conceptual work required to get to the moon. And then in both cases there’s lots of engineering work required too. (If AGI will be built in a non-ML paradigm, then getting 5% of the way to solving alignment probably requires actually making claims about whatever the replacement-to-ML paradigm is, which I haven’t seen from Stuart.)
    But Stuart’s presentation of his ideas seems wildly inconsistent with both my position and your position above (e.g. in Human Compatible he seems way more confident in his proposal than would be justified by having gotten even 5% of the way to a solution).
    - Rohin Shah 26 Apr 2022 8:21 UTC
      2 points
      0 ∶ 0
      Parent
      I don’t think that my main disagreement with Stuart is about how we’ll reach AGI, because critiques of his approach, like this page, don’t actually require any assumption that we’re in the ML paradigm.
      I agree that single critique doesn’t depend on the ML paradigm. If that’s your main disagreement then I retract my claim that it’s downstream of paradigm disagreements.
      What’s your probability that if we really tried to get the assistance paradigm to work then we’d ultimately conclude it was basically doomed because of this objection? I’m at like 50%, such that if there were no other objections the decision would be “it is blindingly obvious that we should pursue this”.
      I think that CIRL does less than 5%, and probably less than 1%, of the conceptual work of solving alignment; whereas the rocket equation does significantly more than 5% of the conceptual work required to get to the moon.
      I might disagree with this but I don’t know how you’re distinguishing between conceptual and non-conceptual work. (I’m guessing I’ll disagree with the rocket equation doing > 5% of the conceptual work.)
      If AGI will be built in a non-ML paradigm, then getting 5% of the way to solving alignment probably requires actually making claims about whatever the replacement-to-ML paradigm is, which I haven’t seen from Stuart.
      I don’t think this is particularly relevant to the rest of the disagreement, but this is explicitly discussed in Human Compatible! It’s right at the beginning of my summary of it!
      But Stuart’s presentation of his ideas seems wildly inconsistent with both my position and your position above (e.g. in Human Compatible he seems way more confident in his proposal than would be justified by having gotten even 5% of the way to a solution).
      Are you reacting to his stated beliefs or the way he communicates?
      If you are reacting to his stated beliefs: I’m not sure where you get this from. His actual beliefs (as stated in Human Compatible) are that there are lots of problems that still need to be solved. From my summary:
      Another problem with inferring preferences from behavior is that humans are nearly always in some deeply nested plan, and many actions don’t even occur to us. Right now I’m writing this summary, and not considering whether I should become a fireman. I’m not writing this summary because I just ran a calculation showing that this would best achieve my preferences, I’m doing it because it’s a subpart of the overall plan of writing this bonus newsletter, which itself is a subpart of other plans. The connection to my preferences is very far up. How do we deal with that fact?
      There are perhaps more fundamental challenges with the notion of “preferences” itself. For example, our experiencing self and our remembering self may have different preferences—if so, which one should our agent optimize for? In addition, our preferences often change over time: should our agent optimize for our current preferences, even if it knows that they will predictably change in the future? This one could potentially be solved by learning meta-preferences that dictate what kinds of preference change processes are acceptable.
      All of these issues suggest that we need work across many fields (such as AI, cognitive science, psychology, and neuroscience) to reverse-engineer human cognition, so that we can put principle 3 into action and create a model that shows how human behavior arises from human preferences.
      If you are reacting to how he communicates: I don’t know why you expect him to follow the norms of the EA community and sprinkle “probably” in every sentence. That’s not the norms that the broader world operates under; he’s writing for the broader world.