I think paying AIs to reveal their misalignment and potentially to work for us and prevent AI takeover seems like a potentially very promising intervention.
I’m pretty skeptical of this. (Found a longer explanation of the proposal here.)
An AI facing such a deal would be very concerned that we’re merely trying to trick it into revealing its own misalignment (which we’d then try to patch out). It seems to me that it would probably be a lot easier for us to trick an AI into believing that we’re honestly presenting it such a deal (including by directly manipulating it’s weights and activations), than to actually honestly present such an deal and in doing so cause the AI to believe it.
Further, I think there is a substantial chance that AI moral patienthood becomes a huge issue in coming years and thus it is good to ensure that field has better views and interventions.
I’m pretty skeptical of this. (Found a longer explanation of the proposal here.)
An AI facing such a deal would be very concerned that we’re merely trying to trick it into revealing its own misalignment (which we’d then try to patch out). It seems to me that it would probably be a lot easier for us to trick an AI into believing that we’re honestly presenting it such a deal (including by directly manipulating it’s weights and activations), than to actually honestly present such an deal and in doing so cause the AI to believe it.
I agree with this part.