Ryan Greenblatt comments on Clarifying two uses of “alignment”

Ryan Greenblatt 11 Mar 2024 0:16 UTC
1 point
0 ∶ 0
One could theoretically imagine a perfect solution — i.e. one that allows you to build an agent whose values never drift, that acts well on every possible input it could receive, whose preferences are no longer subject to extremal goodhart, and whose preferences reflect your own desires at every level, on every question — but I suspect this idea will probably always belong more to fiction than reality.
The main reason to expect nearly perfect (e.g. >99% of value) solutions to be doable are:
- Corrigibility seems much easier
- Value might not be that fragile such that if you get reasonably close you get nearly all the value. (E.g., I currently think the way I would utilize vast resources on reflection probably isn’t that much better than other people who’s philosophical views I broadly endorse.)