I think yes and for all the reasons. I’m a bit sceptical that we can change the values ASIs will have—we don’t understand present models that well, and there are good reasons not to treat how a model outputs text as representative of its goals (it could be hallucinating, it could be deceptive, it’s outputs might just not be isomorphic to a reward structure).
And even if we could, I don’t know of any non-controversial value to instill in the ASI, that isn’t just included in basic attempts to control the ASI (which I’d be doing mostly for extinction related reasons).
I’m going to press on point 2; I think this is self-defeating as it suggests the future will just be bad, so by this line of reasoning we shouldn’t even try to reduce extinction risks.
I think yes and for all the reasons. I’m a bit sceptical that we can change the values ASIs will have—we don’t understand present models that well, and there are good reasons not to treat how a model outputs text as representative of its goals (it could be hallucinating, it could be deceptive, it’s outputs might just not be isomorphic to a reward structure).
And even if we could, I don’t know of any non-controversial value to instill in the ASI, that isn’t just included in basic attempts to control the ASI (which I’d be doing mostly for extinction related reasons).
I’m going to press on point 2; I think this is self-defeating as it suggests the future will just be bad, so by this line of reasoning we shouldn’t even try to reduce extinction risks.