I enjoyed this post. I think it is worth thinking about whether the problem is unsolveable! I think one takeaway I had from Tegmark’s Life 3.0 was that we will almost certainly not get exactly what we want from AGI. It seems intuitively that any possible specification will have downsides, including the specification to not build AGI at all.
But asking for a perfect utopia seems a high bar for “Alignment”; on the other hand, “just avoid literal human extinction” would be far too low a bar and include the possibility for all sorts of dystopias.
So I think it’s a well-made point that we need to define these terms more precisely, and start thinking about what sort of alignment (if any) is achievable.
I might end up at a different place than you did when it comes to actually defining “control” and “AGI”, though I don’t think I’ve thought about it enough to make any helpful comment. Seems important to think more about though!
And not consistently represent the preferences of malevolent, parasitic or short-term human actors who want to misuse/co-opt the system through any attack vectors they can find.
And deal with that the preferences of a lot of the possible future humans and of non-human living beings will not get automatically represented in a system that AI corporations by default have built to represent current living humans only (preferably, those who pay).
A humble response to layers on layers of fundamental limits on the possibility of aligning AGI, even in principle, is to ask how we got so stuck on this project in the first place.
I enjoyed this post. I think it is worth thinking about whether the problem is unsolveable! I think one takeaway I had from Tegmark’s Life 3.0 was that we will almost certainly not get exactly what we want from AGI. It seems intuitively that any possible specification will have downsides, including the specification to not build AGI at all.
But asking for a perfect utopia seems a high bar for “Alignment”; on the other hand, “just avoid literal human extinction” would be far too low a bar and include the possibility for all sorts of dystopias.
So I think it’s a well-made point that we need to define these terms more precisely, and start thinking about what sort of alignment (if any) is achievable.
I might end up at a different place than you did when it comes to actually defining “control” and “AGI”, though I don’t think I’ve thought about it enough to make any helpful comment. Seems important to think more about though!
Glad to read your thoughts, Ben.
You’re right about this:
Even if long-term AGI safety was possible, then you still have to deal with limits on modelling and consistently acting on preferences expressed by humans from their (perceived) context. https://twitter.com/RemmeltE/status/1620762170819764229
And not consistently represent the preferences of malevolent, parasitic or short-term human actors who want to misuse/co-opt the system through any attack vectors they can find.
And deal with that the preferences of a lot of the possible future humans and of non-human living beings will not get automatically represented in a system that AI corporations by default have built to represent current living humans only (preferably, those who pay).
A humble response to layers on layers of fundamental limits on the possibility of aligning AGI, even in principle, is to ask how we got so stuck on this project in the first place.