Train for incorrigibility, then reverse it (Shutdown Problem Contest Submission)

The following was my submission to the AI Alignment Awards Shutdown Contest, for which it won an honorary mention prize:

My proposal is as follows:
1) pre-train a model for broad capabilities (though not to AGI)

2) create a list of different sub-characteristics that in aggregate yield incorrigibility, but where no single sub-characteristic implies incorrigibility (eg., situational awareness, power-seeking, social persuasiveness, etc)

3) for each sub-characteristic, construct a task vector[1] from the pre-trained model to a fine-tuned model possessing the sub-characteristic; ie., for each sub-characteristic, fine-tune the pre-trained model for that sub-characteristic (eg., for situational awareness, directly teach it about its physical vulnerabilities) and then subtract the weights of the pre-trained model from those of the fine-tuned model to create the vector

3) negate each task vector from the pre-trained model (ie., take the weights of the pre-trained model, and subtract each of the task vectors from it)

4) either train the resultant model to completion, or return to step 1 (iterating through the process, possibly many times).

Research indicates negation of task vectors from model weights can remove specific characteristics with minimal degradation to capabilities (eg., subtracting a task vector for toxicity from a pre-trained LLM leads to both lower perplexity and less toxicity than fine-tuning the LLM on nontoxic text).[1] Further, task vectors are additive – adding task vectors for two separate tasks to the same pre-trained model can yield a model capable of performing both tasks.[1] My hope is that by negating task vectors for various sub-characteristic of incorrigibility, the model adopts aversion to exhibiting each sub-characteristic. Also, by training simply for each sub-characteristic separately (and iterating over the process multiple times), we’d hopefully avoid ever approaching an AGI with all the incorrigibility sub-characteristics.

Compared to “direct attempts” of training for corrigibility, this proposal offers potential benefits. First, it may be easier to target inclusion of sub-characteristics than to target their absence (eg., training a model to not possess situational awareness seems more difficult than training for it). Second, “direct attempts” may yield models on the edge of incorrigibility, versus the hope here is that task vectors reach deep into the territory of each sub-characteristic and their negation thus pushes far away. Third, “direct attempts” may lead to deceptive misalignment, where a system instrumentally acts corrigible despite incorrigibility. Under my proposal, it seems unlikely that, analogously, a corrigible model would deceptively act incorrigible during task-vector construction (before being negated).

This proposal could fail in many ways. Task vectors are a new tool and might not work as initially hoped. Alternatively, they might face other complications, like not aggregating the way this proposal assumes. Further, my proposal might lead to systems that, while corrigible, are incompetent. Additionally, this proposal might yield “shallow” alignment, where a system is corrigible until it reaches a certain level of capabilities, after which it quickly “rediscovers” each relevant sub-characteristic. Moreover, a misstep in this proposal might directly lead to incorrigible AGI (eg., a single sub-characteristic being sufficient for incorrigibility). Finally, sufficiently advanced AI with surprising situational awareness might gradient hack during the task-vector construction stage.

[1] Ilharco, G. et al. Editing Models with Task Arithmetic. https://arxiv.org/pdf/2212.04089.pdf