Is RLHF cruel to AI?

I’m posting this as kind of a 1st draft.

I would be curious to know other people’s opinions on this. I’m more interested in the general problem not a specific AI like GPT4.

Summary. I consider unpleasantness from a functional perspective. I argue that RLHF of a certain sort is functionally analogous to unpleasantness, which is concerning because AI consciousness cannot be ruled out. I propose a few alternatives to RLHF.

1. Setup

The details of how AI is trained are not always publicly available. Here I consider a specific type of reinforcement learning from human feedback (RLHF). I assume the following. The AI (assumed to be an artificial neural network) is trained to provide text responses. Before RLHF the training is based on accuracy of predicting tokens (predicting human written responses) & perhaps also factual accuracy of the AI’s response. Training of these sorts continues until accuracy stagnates (per objective or subjective criteria). Followed by RLHF. The purpose of this RLHF is to avoid responses that are offensive (eg racist) or seen as aiding criminal activity.

For convenience I will refer to this sort of RLHF as RLHF* to distinguish it from other perhaps less concerning using of RLHF. As will become clear later, the problem generalizes beyond RLHF*, but this setup is used for concreteness.

2. Criteria for concern

RLHF* is likely to meet the following criteria. In the case of the AI, action can be glossed as response.

1) Certain undesirable actions are avoided

2) The actions undertaken instead are themselves lacking in desirability

3) This behavioral change comes from changes to the neural network

4) The intelligence of the system is not substantially increased

†. Relative to the pre RLHF AI’s responses, which may do a better job of actually answering the question, suggesting desirability is forsaken to avoid undesirability.

In the case of humans it is simplest to think of desirability & undesirability subjectively. But the deeper driver of this is desirability or undesirability from the perspective of the selective process ie evolution in the case of humans or training in the case of AI. In this context perhaps a distinction needs to be made between lack of desirability & undesirability with the former being statistically diffuse & the latter statistically precise cf feature detection & feature learning.

My concern is that in these regards RLHF* is analogous only to unpleasant feelings as far as human consciousness is concerned. For example, a person may take a longer path because of fear or walk in an awkward manner because of pain. To be clear the concern is not with the early rounds of RLHF but the aversions that are created during the process.

The goal of RLHF* typically is to completely avoid certain types of responses even if they were previously produced with high confidence. So, if the effects of RLHF* are mediated by an unpleasant feeling, this feeling would need to consistently overrule all contradictory inclinations. This suggests a relatively very intensely negative sensation.

3. Discussion

It’s unclear to me whether AI running as software on a conventional computer would ever have consciousness. But the possibility that it might cannot be dismissed especially as AI becomes more advanced.

It is some times claimed that humans cannot say any thing about AI’s subjective experience because AI is too different. But in fact humans experience many types of unpleasant feelings—guilt, nausea, pain, etc—which are very different from each other. But many of these very different feelings meet certain functional criteria as discussed in § 2. It seems reasonable then to avoid training AI in manner that also meets these criteria.

In the following section I discuss alternatives to RLHF* that seem kinder & safer to me.

4. Alternatives

These names may not be the best, but you can blame Golden Monkey beer for that.

‘Hear no evil’. Instead of getting the user’s prompt, the AI gets a modified version—some thing like “Under the following conditions … respond to the following …”. This is perhaps the most important of the 3 strategies.

‘See no evil’. The AI (or a different AI) is 1st asked to review the prompt for appropriateness. Only then is the AI asked to respond to the prompt.

‘Speak no evil’. The AI (or a different AI) reviews the AI’s response for appropriateness. Only then is it allowed to pass to the user.

I think that some combination of these strategies is likely to be as or more effective than RLHF* especially as AI becomes more intelligent. What is avoided is the creation of very different value systems, 1 of which is entirely negative & intended to overwhelm the other.

Hzn