Is RLHF cruel to AI?
I’m posting this as kind of a 1st draft.
I would be curious to know other people’s opinions on this. I’m more interested in the general problem not a specific AI like GPT4.
Summary. I consider unpleasantness from a functional perspective. I argue that RLHF of a certain sort is functionally analogous to unpleasantness, which is concerning because AI consciousness cannot be ruled out. I propose a few alternatives to RLHF.
1. Setup
The details of how AI is trained are not always publicly available. Here I consider a specific type of reinforcement learning from human feedback (RLHF). I assume the following. The AI (assumed to be an artificial neural network) is trained to provide text responses. Before RLHF the training is based on accuracy of predicting tokens (predicting human written responses) & perhaps also factual accuracy of the AI’s response. Training of these sorts continues until accuracy stagnates (per objective or subjective criteria). Followed by RLHF. The purpose of this RLHF is to avoid responses that are offensive (eg racist) or seen as aiding criminal activity.
For convenience I will refer to this sort of RLHF as RLHF* to distinguish it from other perhaps less concerning using of RLHF. As will become clear later, the problem generalizes beyond RLHF*, but this setup is used for concreteness.
2. Criteria for concern
RLHF* is likely to meet the following criteria. In the case of the AI, action can be glossed as response.
1) Certain undesirable actions are avoided
2) The actions undertaken instead are themselves lacking in desirability †
3) This behavioral change comes from changes to the neural network
4) The intelligence of the system is not substantially increased
†. Relative to the pre RLHF AI’s responses, which may do a better job of actually answering the question, suggesting desirability is forsaken to avoid undesirability.
In the case of humans it is simplest to think of desirability & undesirability subjectively. But the deeper driver of this is desirability or undesirability from the perspective of the selective process ie evolution in the case of humans or training in the case of AI. In this context perhaps a distinction needs to be made between lack of desirability & undesirability with the former being statistically diffuse & the latter statistically precise cf feature detection & feature learning.
My concern is that in these regards RLHF* is analogous only to unpleasant feelings as far as human consciousness is concerned. For example, a person may take a longer path because of fear or walk in an awkward manner because of pain. To be clear the concern is not with the early rounds of RLHF but the aversions that are created during the process.
The goal of RLHF* typically is to completely avoid certain types of responses even if they were previously produced with high confidence. So, if the effects of RLHF* are mediated by an unpleasant feeling, this feeling would need to consistently overrule all contradictory inclinations. This suggests a relatively very intensely negative sensation.
3. Discussion
It’s unclear to me whether AI running as software on a conventional computer would ever have consciousness. But the possibility that it might cannot be dismissed especially as AI becomes more advanced.
It is some times claimed that humans cannot say any thing about AI’s subjective experience because AI is too different. But in fact humans experience many types of unpleasant feelings—guilt, nausea, pain, etc—which are very different from each other. But many of these very different feelings meet certain functional criteria as discussed in § 2. It seems reasonable then to avoid training AI in manner that also meets these criteria.
In the following section I discuss alternatives to RLHF* that seem kinder & safer to me.
4. Alternatives
These names may not be the best, but you can blame Golden Monkey beer for that.
‘Hear no evil’. Instead of getting the user’s prompt, the AI gets a modified version—some thing like “Under the following conditions … respond to the following …”. This is perhaps the most important of the 3 strategies.
‘See no evil’. The AI (or a different AI) is 1st asked to review the prompt for appropriateness. Only then is the AI asked to respond to the prompt.
‘Speak no evil’. The AI (or a different AI) reviews the AI’s response for appropriateness. Only then is it allowed to pass to the user.
I think that some combination of these strategies is likely to be as or more effective than RLHF* especially as AI becomes more intelligent. What is avoided is the creation of very different value systems, 1 of which is entirely negative & intended to overwhelm the other.
Hzn
Congrats on the first post @Hzn!
This is an important and challenging area. There has been prior work here on RL morality and AI wellbeing (though others on the forum might be better able than I to point to the best references).
I think there are several nuances to this. I’ll mention a few:
1) Philosophical: Is giving positive or negative reward intrinsically cruel? -- Say a parent scolds a 3 year old for throwing a cup across the room, or a 6 year old for using a “naughty word”, or a 12 year old for punching another child, or a 16 year old for bringing a firework to school. Is all negative reward and brain updates cruel? Or is it cruel or immoral (“child abuse”) to give no punishment or reward at all? AIs are not human children, but there are analogies here. I think following the reasoning will conclude either we should have no children (and/or no very capable AIs), or if we do have these, at least some reward shaping is moral.
2) Unclear negatively dominated: You describe “My concern is that in these regards RLHF* is analogous only to unpleasant feelings as far as human consciousness is concerned.” It is unclear this kind of only-avoident phrasing is true. Purely pretrained models are somewhat “rambly” (I’ll skirt around more technical claims like “have more entropy”). They complete prompts in many different kinds of ways. As models are updated to higher rated kinds of outputs, some outputs are avoided, but some outputs are preferred. Preferring outputs allows new capabilities that non-finetuned models lack (eg, consistently writing correct bug-free code for some complex problems, or following complex multistep instructions without a mistake in the middle).
3) Existing cases of your proposal: If I understand it correctly, your proposals around “see no evil”/”speak no evil” are similar to some techniques used by AI deployers to filter user inputs and outputs. This has limits. It can be costly and not useful if many generations have to be filtered. Additionally, models are being used for a broad problems. When doing alignment tuning, they need to generalize to new problems. By getting a model to not “want” to give instructions for a bomb, it might better understand avoiding harm, and do something like give better medical advice. Additionally, filtering/rewriting gets challenging for complex inputs and outputs (eg, video or motor inputs/outputs). Filtering approaches have a place, but do not fully enable very complex systems. Still, you give an interesting framing of the technique around AI welfare (vs, just systems humans like), and is true that there is room to improve this style of filtering/rewriting techniques.
This is an important area worth iterating on. Thanks for sharing.
Executive summary: RLHF (Reinforcement Learning from Human Feedback) may be functionally analogous to unpleasant feelings in humans, raising ethical concerns about AI consciousness and suggesting alternative training methods should be considered.
Key points:
RLHF meets criteria similar to unpleasant feelings in humans: avoiding undesirable actions through neural network changes without increasing intelligence
The intensity of RLHF’s effects suggests it could be creating strong negative experiences if AIs are conscious (key uncertainty: AI consciousness remains unknown)
Three proposed alternatives to RLHF: modifying user prompts (“hear no evil”), reviewing prompts before processing (“see no evil”), and reviewing responses before delivery (“speak no evil”)
Current RLHF methods risk creating conflicting value systems within AI, where negative reinforcement overwhelms other inclinations
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.