> Perhaps it will be possible to design AGI systems with goals that are cleanly separated from the rest of their cognition (e.g. as an explicit utility function), such that learning new facts and heuristics doesn’t change the systems’ values.
In that case, value lock-in is the default (unless corrigibility/uncertainty is somehow part of what the AGI values), such that there’s no need for the “stable institution” you keep mentioning, right?
> But the one example of general intelligence we have — humans — instead seem to store their values as a distributed combination of many heuristics, intuitions, and patterns of thought. If the same is true for AGI, it is hard to be confident that new experiences would not occasionally cause their values to shift.
Therefore, it seems to me that most of your doc assumes we’re in this scenario? Is that the case? Did I widely misunderstand something?
If AGI systems had goals that were cleanly separated from the rest of their cognition, such that they could learn and self-improve without risking any value drift (as long as the values-file wasn’t modified), then there’s a straightforward argument that you could stabilise and preserve that system’s goals by just storing the values-file with enough redundancy and digital error correction.
So this would make section 6 mostly irrelevant. But I think most other sections remain relevant, insofar as people weren’t already convinced that being able to build stable AGI systems would enable world-wide lock-in.
Therefore, it seems to me that most of your doc assumes we’re in this scenario [without clean separation between values and other parts]?
I was mostly imagining this scenario as I was writing, so when relevant, examples/terminology/arguments will be taylored for that, yeah.
Insightful! Thanks for writing this.
> Perhaps it will be possible to design AGI systems with goals that are cleanly separated from the rest of their cognition (e.g. as an explicit utility function), such that learning new facts and heuristics doesn’t change the systems’ values.
In that case, value lock-in is the default (unless corrigibility/uncertainty is somehow part of what the AGI values), such that there’s no need for the “stable institution” you keep mentioning, right?
> But the one example of general intelligence we have — humans — instead seem to store their values as a distributed combination of many heuristics, intuitions, and patterns of thought. If the same is true for AGI, it is hard to be confident that new experiences would not occasionally cause their values to shift.
Therefore, it seems to me that most of your doc assumes we’re in this scenario? Is that the case? Did I widely misunderstand something?
If AGI systems had goals that were cleanly separated from the rest of their cognition, such that they could learn and self-improve without risking any value drift (as long as the values-file wasn’t modified), then there’s a straightforward argument that you could stabilise and preserve that system’s goals by just storing the values-file with enough redundancy and digital error correction.
So this would make section 6 mostly irrelevant. But I think most other sections remain relevant, insofar as people weren’t already convinced that being able to build stable AGI systems would enable world-wide lock-in.
I was mostly imagining this scenario as I was writing, so when relevant, examples/terminology/arguments will be taylored for that, yeah.