Will Aldred comments on AI safety tax dynamics

Will Aldred 24 Oct 2024 1:57 UTC
8 points
1 ∶ 0
By the time systems approach strong superintelligence, they are likely to have philosophical competence in some sense.
It’s interesting to me that you think this; I’d be very keen to hear your reasoning (or for you to point me to any existing writings that fit your view).
For what it’s worth, I’m at maybe 30 or 40% that superintelligence will be philosophically competent by default (i.e., without its developers trying hard to differentially imbue it with this competence), conditional on successful intent alignment, where I’m roughly defining “philosophically competent” as “wouldn’t cause existential catastrophe through philosophical incompetence.” I believe this mostly because I find @Wei Dai’s writings compelling, and partly because of some thinking I’ve done myself on the matter. OpenAI’s o1 announcement post, for example, indicates that o1—the current #1 LLM, by most measures—performs far better in domains that have clear right/wrong answers (e.g., calculus and chemistry) than in domains where this is not the case (e.g., free-response writing^[1]).^[2] Philosophy, being interminable debate, is perhaps the ultimate “no clear right/wrong answers” domain (to non-realists, at least): for this reason, plus a few others (which are largely covered in Dai’s writings), I’m struggling to see why AIs wouldn’t be differentially bad at philosophy in the lead-up to superintelligence.
Also, for what it’s worth, the current community prediction on the Metaculus question “Five years after AGI, will AI philosophical competence be solved?” is down at 27%.^[3] (Although, given how out of distribution this question is with respect to most Metaculus questions, the community prediction here should be taken with a lump of salt.)
(It’s possible that your “in some sense” qualifier is what’s driving our apparent disagreement, and that we don’t actually disagree by much.)
1. ^
  Free-response writing comprises 55% of the AP English language and English literature exams (source; source).
2. ^
  On this, AI Explained (8:01–8:34) says:
  And there is another hurdle that would follow, if you agree with this analysis [of why o1’s capabilities are what they are, across the board]: It’s not just a lack of training data. What about domains that have plenty of training data, but no clearly correct or incorrect answers? Then you would have no way of sifting through all of those chains of thought, and fine-tuning on the correct ones. Compared to the original GPT-4o in domains with correct and incorrect answers, you can see the performance boost. With harder-to-distinguish correct or incorrect answers: much less of a boost [in performance]. In fact, a regress in personal writing.
3. ^
  Note: Metaculus forecasters—for the most part—think that superintelligence will come within five years of AGI. (See here my previous commentary on this, which goes into more detail.)
- Owen Cotton-Barratt 24 Oct 2024 10:13 UTC
  4 points
  0 ∶ 0
  Parent
  It’s not clear we have too much disagreement, but let me unpack what I meant:
  - Let strong philosophical competence mean competence at all philosophical questions, including those like metaethics which really don’t seem to have any empirical grounding
    I’m not trying to make any claims about strong philosophical competence
    I might be a little more optimistic than you about getting this by default as a generalization of weak philosophical competence (see below), but I’m still pretty worried that we won’t get it, and I didn’t mean to rely on it in my statements in this post
  - Let weak philosophical competence mean competence at reasoning about complex questions which ultimately have empirical answers, where it’s out of reach to test them empirically, but one may get better predictions from finding clear frameworks for thinking about them
  - I claim that by the time systems approach strong superintelligence, they’re likely to have a degree of weak philosophical competence
    Because:
    It would be useful for many tasks, and this would likely be apparent to mild superintelligent systems
    It can be selected for empirically (seeing which training approaches etc. do well at weak philosophical competence in toy settings, where the experimenters have access to the ground truth about the questions they’re having the systems use philosophical reasoning to approach)
  - I further claim that weak philosophical competence is what you need to be able to think about how to build stronger AI systems that are, roughly speaking, safe, or intent aligned
    Because this is ultimately an empirical question (“would this AI do something an informed version of me / those humans would ultimately regard as terrible?”)
    I don’t claim that this would extend to being able to think about how to build stronger AI systems that it would be safe to make sovereigns
  - Will Aldred 24 Oct 2024 12:35 UTC
    4 points
    0 ∶ 0
    Parent
    Thanks for expanding! This is the first time I’ve seen this strong vs. weak distinction used—seems like a useful ontology.^[1]
    Minor: When I read your definition of weak philosophical competence,^[2] high energy physics and consciousness science came to mind as fields that fit the definition (given present technology levels). However, this seems outside the spirit of “weak philosophical competence”: an AI that’s superhuman in the aforementioned fields could still fail big time with respect to “would this AI do something an informed version of me / those humans would ultimately regard as terrible?” Nonetheless, I’ve not been able to think up a better ontology myself (in my 5 mins of trying), and I don’t expect this definitional matter will cause problems in practice.
    ^
    For the benefit of any readers: Strong philosophical competence is importantly different to weak philosophical competence, as defined. ~~The latter feeds into intent alignment, while the former is an additional problem beyond intent alignment.~~ [Edit: I now think this is not so clear-cut. See the ensuing thread for more.]
    ^
    “Let weak philosophical competence mean competence at reasoning about complex questions which ultimately have empirical answers, where it’s out of reach to test them empirically, but one may get better predictions from finding clear frameworks for thinking about them.”
    - Owen Cotton-Barratt 24 Oct 2024 13:26 UTC
      4 points
      0 ∶ 0
      Parent
      Yeah, I appreciated your question, because I’d also not managed to unpack the distinction I was making here until you asked.
      On the minor issue: right, I think that for some particular domain(s), you could surely train a system to be highly competent in that domain without this generalizing to even weak philosophical competence overall. But if you had a system which was strong at both of those domains despite not having been trained on them, and especially if that was also true for say three more comparable domains, I guess I kind of do expect it to be good at the general thing? (I haven’t thought long about that.)
      - Will Aldred 25 Oct 2024 19:24 UTC
        2 points
        0 ∶ 0
        Parent
        Hmm, interesting.
        I’m realizing now that I might be more confused about this topic than I thought I was, so to backtrack for just a minute: it sounds like you see weak philosophical competence as being part of intent alignment, is that correct? If so, are you using “intent alignment” in the same way as in the Christiano definition? My understanding was that intent alignment means “the AI is trying to do what present-me wants it to do.” To me, therefore, this business of the AI being able to recognize whether its actions would be approved by idealized-me (or just better-informed-me) falls outside the definition of intent alignment.
        (Looking through that Christiano post again, I see a couple of statements that seem to support what I’ve just said,^[1] but also one that arguably goes the other way.^[2])
        Now, addressing your most recent comment:
        Okay, just to make sure that I’ve understood you, you are defining weak philosophical competence as “competence at reasoning about complex questions [in any domain] which ultimately have empirical answers, where it’s out of reach to test them empirically, but where one may get better predictions from finding clear frameworks for thinking about them,” right? Would you agree that the “important” part of weak philosophical competence is whether the system would do things an informed version of you, or humans at large, would ultimately regard as terrible (as opposed to how competent the system is at high energy physics, consciousness science, etc.)?
        If a system is competent at reasoning about complex questions across a bunch of domains, then I think I’m on board with seeing that as evidence that the system is competent at the important part of weak philosophical competence, assuming that it’s already intent aligned.^[3] However, I’m struggling to see why this would help with intent alignment itself, according to the Christiano definition. (If one includes weak philosophical competence within one’s definition of intent alignment—as I think you are doing(?)—then I can see why it helps. However, I think this would be a non-standard usage of “intent alignment.” I also don’t think that most folks working on AI alignment see weak philosophical competence as part of alignment. (My last point is based mostly on my experience talking to AI alignment researchers, but also on seeing leaders of the field write things like this.))
        A couple of closing thoughts:
        I already thought that strong philosophical competence was extremely neglected, but I now also think that weak philosophical competence is very neglected. It seems to me that if weak philosophical competence is not solved at the same time as intent alignment (in the Christiano sense),^[4] then things could go badly, fast. (Perhaps this is why you want to include weak philosophical competence within the intent alignment problem?)
        The important part of weak philosophical competence seems closely related to Wei Dai’s “human safety problems”.
        (Of course, no obligation on you to spend your time replying to me, but I’d greatly appreciate it if you do!)
        ^
        They could [...] be wrong [about; sic] what H wants at a particular moment in time.
        They may not know everything about the world, and so fail to recognize that an action has a particular bad side effect.
        They may not know everything about H’s preferences, and so fail to recognize that a particular side effect is bad.
        …
        I don’t have a strong view about whether “alignment” should refer to this problem or to something different. I do think that some term needs to refer to this problem, to separate it from other problems like “understanding what humans want,” “solving philosophy,” etc.
        (“Understanding what humans want” sounds quite a lot like weak philosophical competence, as defined earlier in this thread, while “solving philosophy” sounds a lot like strong philosophical competence.)
        ^
        An aligned AI would also be trying to do what H wants with respect to clarifying H’s preferences.
        (It’s unclear whether this just refers to clarifying present-H’s preferences, or if it extends to making present-H’s preferences be closer to idealized-H’s.)
        ^
        If the system is not intent aligned, then I think this would still be evidence that the system understands what an informed version of me would ultimately regard as terrible vs. not terrible. But, in this case, I don’t think the system will use what it understands to try to do the non-terrible things.
        ^
        Insofar as a solved vs. not solved framing even makes sense. Karnofsky (2022; fn. 4) argues against this framing.
        Owen Cotton-Barratt 25 Oct 2024 21:12 UTC
        4 points
        0 ∶ 0
        Parent
        it sounds like you see weak philosophical competence as being part of intent alignment, is that correct?
        Ah, no, that’s not correct.
        I’m saying that weak philosophical competence would:
        Be useful enough for acting in the world, and in principle testable-for, that I expect it be developed as a form of capability before strong superintelligence
        Be useful for research on how to produce intent-aligned systems
        … and therefore that if we’ve been managing to keep things more or less intent aligned up to the point where we have systems which are weakly philosophical competent, it’s less likely that we have a failure of intent alignment thereafter. (Not impossible, but I think a pretty small fraction of the total risk.)
        Will Aldred 25 Oct 2024 21:52 UTC
        2 points
        0 ∶ 0
        Parent
        Thanks for clarifying!
        Be useful for research on how to produce intent-aligned systems
        Just checking: Do you believe this because you see the intent alignment problem as being in the class of “complex questions which ultimately have empirical answers, where it’s out of reach to test them empirically, but one may get better predictions from finding clear frameworks for thinking about them,” alongside, say, high energy physics?
        Owen Cotton-Barratt 25 Oct 2024 22:37 UTC
        2 points
        0 ∶ 0
        Parent
        Yep.