Matthew_Barnett comments on Clarifying two uses of “alignment”

Matthew_Barnett 11 Mar 2024 0:01 UTC
2 points
0 ∶ 0
My guess is that at some point someone will just solve the technical problem of alignment. Thus, future generations of AIs would be actually aligned to prior generations and the group they are aligned to would no longer need to worry about expropriation.
I don’t think it’s realistic that solutions to the alignment problem will be binary in the way you’re describing. One could theoretically imagine a perfect solution — i.e. one that allows you to build an agent whose values never drift, that acts well on every possible input it could receive, whose preferences are no longer subject to extremal goodhart, and whose preferences reflect your own desires at every level, on every question — but I suspect this idea will probably always belong more to fiction than reality. The real world is actually very messy, and it starts to get very unclear what each of these ideas actually means once you carefully interrogate what would happen in the limit of unlimited optimization power.
A more realistic scenario, in my view, is that alignment is more of a spectrum, and there will always be slight defects in the alignment process. For example, even my own brain is slightly misaligned with my former self from one day ago. Over longer time periods than a day, my values have drifted significantly.
In this situation — since perfection is unattainable — there’s always an inherent tradeoff between being cautious in order to do more alignment work, and just going ahead and building something that’s actually useful, even if it’s imperfect, and even though you can’t fully predict what will happen when you build it. And this tradeoff seems likely to exist at every level of AI, from human-level all the way up to radical superintelligences.
- Ryan Greenblatt 11 Mar 2024 0:16 UTC
  1 point
  0 ∶ 0
  Parent
  One could theoretically imagine a perfect solution — i.e. one that allows you to build an agent whose values never drift, that acts well on every possible input it could receive, whose preferences are no longer subject to extremal goodhart, and whose preferences reflect your own desires at every level, on every question — but I suspect this idea will probably always belong more to fiction than reality.
  The main reason to expect nearly perfect (e.g. >99% of value) solutions to be doable are:
  - Corrigibility seems much easier
  - Value might not be that fragile such that if you get reasonably close you get nearly all the value. (E.g., I currently think the way I would utilize vast resources on reflection probably isn’t that much better than other people who’s philosophical views I broadly endorse.)
- Ryan Greenblatt 11 Mar 2024 0:06 UTC
  1 point
  0 ∶ 0
  Parent
  I don’t think it’s binary, but I do think it’s likely to be a sigmoid in practice. And I expect this sigmoid will saturate relatively early.
  - Ryan Greenblatt 11 Mar 2024 0:17 UTC
    1 point
    0 ∶ 0
    Parent
    Another way to put this is that I expect that “fraction of value lost by misalignment” will quickly exponentially decay with the number of AI generations. (This is by no means obvious, just my main line guess.)