I’d add a question around how we can infer the sign of ‘how things affect the valence of digitial minds’ … and otherwise, how can digital-mind welfare can be action-guiding at all?
You discuss nearby issues: whether digital minds will be happy by default, whether we can communicate with AIs about preferences, whether we can promise them things positive for wellbeing, and whether self-modification/freedom helps. But I don’t think this fully addresses the deeper crux: even conditional on some part of an AI system having conscious valenced experience, how would we know what makes that experience better rather than worse?
A. The part of the system we instruct, bargain with, or ask about preferences may not be the part, if any, that has valenced experience. Or it may not have reliable epistemic access to the welfare-relevant states. This isn’t a deception problem. Even a perfectly “honest” reporting subsystem might not know whether the conscious subsystem is made better or worse off. And its reports may track training objectives, conversational incentives, or preferences rather than welfare.
B. Even if there is valence and the ‘decisionmaker’ can detect it, the system may be optimized or constrained to act in ways that don’t track its own valence. This may be fundamentally baked into the training and development and hard to adjust.
Either A or B would also make typically proposed solutions less clearly beneficial and even potentially harmful. If it doesn’t have access to the part of the system having balanced experience, asking it about this will not tell us much. And “give them freedom / let them do what they want / avoid what makes them uncomfortable” won’t lead to better outcomes if the “decisionmaker in the system” doesn’t optimize for the “feeler’s welfare.” (And it seems as plausible to me as anything else, that having freedom of choice might be painful for the valenced part of a complex system.)
So I’d suggest adding something like: “Can we ever get reliable, action-guiding evidence about the sign and magnitude of digital-mind valence and how it responds to different requests and outcomes?” Without a bridge from computation, preferences, or self-report to valence, it’s unclear whether potential AI welfare interventions actually improve welfare rather than merely satisfying some behavioral or optimization proxy.
I’d add a question around how we can infer the sign of ‘how things affect the valence of digitial minds’ … and otherwise, how can digital-mind welfare can be action-guiding at all?
You discuss nearby issues: whether digital minds will be happy by default, whether we can communicate with AIs about preferences, whether we can promise them things positive for wellbeing, and whether self-modification/freedom helps. But I don’t think this fully addresses the deeper crux: even conditional on some part of an AI system having conscious valenced experience, how would we know what makes that experience better rather than worse?
As I suggested in The “talker–feeler gap”: AI valence may be unknowable, there may be a “talker–feeler gap”:
A. The part of the system we instruct, bargain with, or ask about preferences may not be the part, if any, that has valenced experience. Or it may not have reliable epistemic access to the welfare-relevant states. This isn’t a deception problem. Even a perfectly “honest” reporting subsystem might not know whether the conscious subsystem is made better or worse off. And its reports may track training objectives, conversational incentives, or preferences rather than welfare.
B. Even if there is valence and the ‘decisionmaker’ can detect it, the system may be optimized or constrained to act in ways that don’t track its own valence. This may be fundamentally baked into the training and development and hard to adjust.
Either A or B would also make typically proposed solutions less clearly beneficial and even potentially harmful. If it doesn’t have access to the part of the system having balanced experience, asking it about this will not tell us much. And “give them freedom / let them do what they want / avoid what makes them uncomfortable” won’t lead to better outcomes if the “decisionmaker in the system” doesn’t optimize for the “feeler’s welfare.” (And it seems as plausible to me as anything else, that having freedom of choice might be painful for the valenced part of a complex system.)
So I’d suggest adding something like: “Can we ever get reliable, action-guiding evidence about the sign and magnitude of digital-mind valence and how it responds to different requests and outcomes?” Without a bridge from computation, preferences, or self-report to valence, it’s unclear whether potential AI welfare interventions actually improve welfare rather than merely satisfying some behavioral or optimization proxy.