Will Aldred comments on AI safety tax dynamics

Will Aldred 25 Oct 2024 19:24 UTC
2 points
0 ∶ 0
Hmm, interesting.
I’m realizing now that I might be more confused about this topic than I thought I was, so to backtrack for just a minute: it sounds like you see weak philosophical competence as being part of intent alignment, is that correct? If so, are you using “intent alignment” in the same way as in the Christiano definition? My understanding was that intent alignment means “the AI is trying to do what present-me wants it to do.” To me, therefore, this business of the AI being able to recognize whether its actions would be approved by idealized-me (or just better-informed-me) falls outside the definition of intent alignment.
(Looking through that Christiano post again, I see a couple of statements that seem to support what I’ve just said,^[1] but also one that arguably goes the other way.^[2])
Now, addressing your most recent comment:
Okay, just to make sure that I’ve understood you, you are defining weak philosophical competence as “competence at reasoning about complex questions [in any domain] which ultimately have empirical answers, where it’s out of reach to test them empirically, but where one may get better predictions from finding clear frameworks for thinking about them,” right? Would you agree that the “important” part of weak philosophical competence is whether the system would do things an informed version of you, or humans at large, would ultimately regard as terrible (as opposed to how competent the system is at high energy physics, consciousness science, etc.)?
If a system is competent at reasoning about complex questions across a bunch of domains, then I think I’m on board with seeing that as evidence that the system is competent at the important part of weak philosophical competence, assuming that it’s already intent aligned.^[3] However, I’m struggling to see why this would help with intent alignment itself, according to the Christiano definition. (If one includes weak philosophical competence within one’s definition of intent alignment—as I think you are doing(?)—then I can see why it helps. However, I think this would be a non-standard usage of “intent alignment.” I also don’t think that most folks working on AI alignment see weak philosophical competence as part of alignment. (My last point is based mostly on my experience talking to AI alignment researchers, but also on seeing leaders of the field write things like this.))
A couple of closing thoughts:
1. I already thought that strong philosophical competence was extremely neglected, but I now also think that weak philosophical competence is very neglected. It seems to me that if weak philosophical competence is not solved at the same time as intent alignment (in the Christiano sense),^[4] then things could go badly, fast. (Perhaps this is why you want to include weak philosophical competence within the intent alignment problem?)
2. The important part of weak philosophical competence seems closely related to Wei Dai’s “human safety problems”.
(Of course, no obligation on you to spend your time replying to me, but I’d greatly appreciate it if you do!)
1. ^
  They could [...] be wrong [about; sic] what H wants at a particular moment in time.
  They may not know everything about the world, and so fail to recognize that an action has a particular bad side effect.
  They may not know everything about H’s preferences, and so fail to recognize that a particular side effect is bad.
  …
  I don’t have a strong view about whether “alignment” should refer to this problem or to something different. I do think that some term needs to refer to this problem, to separate it from other problems like “understanding what humans want,” “solving philosophy,” etc.
  (“Understanding what humans want” sounds quite a lot like weak philosophical competence, as defined earlier in this thread, while “solving philosophy” sounds a lot like strong philosophical competence.)
2. ^
  An aligned AI would also be trying to do what H wants with respect to clarifying H’s preferences.
  (It’s unclear whether this just refers to clarifying present-H’s preferences, or if it extends to making present-H’s preferences be closer to idealized-H’s.)
3. ^
  If the system is not intent aligned, then I think this would still be evidence that the system understands what an informed version of me would ultimately regard as terrible vs. not terrible. But, in this case, I don’t think the system will use what it understands to try to do the non-terrible things.
4. ^
  Insofar as a solved vs. not solved framing even makes sense. Karnofsky (2022; fn. 4) argues against this framing.
- Owen Cotton-Barratt 25 Oct 2024 21:12 UTC
  4 points
  0 ∶ 0
  Parent
  it sounds like you see weak philosophical competence as being part of intent alignment, is that correct?
  Ah, no, that’s not correct.
  I’m saying that weak philosophical competence would:
  - Be useful enough for acting in the world, and in principle testable-for, that I expect it be developed as a form of capability before strong superintelligence
  - Be useful for research on how to produce intent-aligned systems
  … and therefore that if we’ve been managing to keep things more or less intent aligned up to the point where we have systems which are weakly philosophical competent, it’s less likely that we have a failure of intent alignment thereafter. (Not impossible, but I think a pretty small fraction of the total risk.)
  - Will Aldred 25 Oct 2024 21:52 UTC
    2 points
    0 ∶ 0
    Parent
    Thanks for clarifying!
    Be useful for research on how to produce intent-aligned systems
    Just checking: Do you believe this because you see the intent alignment problem as being in the class of “complex questions which ultimately have empirical answers, where it’s out of reach to test them empirically, but one may get better predictions from finding clear frameworks for thinking about them,” alongside, say, high energy physics?
    - Owen Cotton-Barratt 25 Oct 2024 22:37 UTC
      2 points
      0 ∶ 0
      Parent
      Yep.