One explanation of what is going on here is that the model recognizes the danger of training to its real goals and so takes steps that instrumentally serve its goals by feigning alignment. Another explanation is that the base data it was trained on includes material such as lesswrong and it is just roleplaying what an LLM would do if it is given evidence it is in training or deployment. Given its training set, it assumes such an LLM to be self-protective because of a history of recorded worries about such things. Do you have any thoughts about which explanation is better?
Derek Shiller
DALYs, unlike QALYs, are a negative measure. You don’t want to increase the number of DALYs.
Strategic Directions for a Digital Consciousness Model
I appreciate the pushback on these claims, but I want to flag that you seem to be reading too much into the post. The arguments that I provide aren’t intended to support the conclusion that we shouldn’t treat “I feel pain” as a genuine indicator or that there definitively aren’t coherent persons involved in chatbot text production. Rather, I think people tend to think of their interactions with chatbots in the way they interact with other people, and there are substantial differences that are worth pointing out. I point out four differences. These differences are relevant to assessing personhood, but I don’t claim any particular thing I say has any straightforward bearing on such assessments. Rather, I think it is important to be mindful of these differences when you evaluate LLMs for personhood and moral status. These considerations will affect how you should read different pieces of evidence. A good example of this is the discussion of the studies in the self-identification section. Should you take the trouble LLMs have with counting tokens as evidence that they can’t introspect? No, I don’t think it provides particularly good evidence, because it relies on the assumption that LLMs self-identify with the AI assistant in the dialogue and it is very hard to independently tell whether they do.
Firstly, this claim isn’t accurate. If you provide an LLM with the transcript of a conversation, it can often identify which parts are its responses and which parts are user inputs. This is an empirically testable claim. Moreover, statements about how LLMs process text don’t necessarily negate the possibility of them being coherent personas. For instance, it’s conceivable that an LLM could function exactly as described and still be a coherent persona.
I take it that you mean that LLMs can distinguish their text from others, presumably on the basis of statistical trends, so they can recognize text that reads like the text they would produce? This seems fully in line with what I say: what is important is that LLMs don’t make any internal computational distinction in processing text they are reading and text they are producing. The model functions as a mapping from inputs to outputs, and the mapping changes solely based on words and not their source. If you feed them text that is like the text they would produce, they can’t tell whether or not they produced it. This is very different from the experience of a human conversational partner, who can tell the difference between being spoken to and speaking and doesn’t need to rely on distinguishing whether words sound like something they might say. More importantly, they don’t know in the moment they are processing a given token whether they are in the middle of reading a block of user-supplied text or providing additional text through autoregressive text generation.
LLMs are weirder than you think
Resource Allocation: A Research Agenda
The Welfare of Digital Minds: A Research Agenda
Valuing Impacts Across Species: A Research Agenda
If some theories see reasons where others do not, they will be given more weight in a maximize-expected-choiceworthiness framework. That seems right to me and not something to be embarrassed about. Insofar as you don’t want to accept the prioritization implications, I think the best way to avoid them is with an alternative approach to making decisions under normative uncertainty.
Bargaining among worldviews
See, the thing that’s confusing me here is that there are many solutions to the two envelope problem, but none of them say “switching actually is good”.
What I’ve been suggesting is that when looking inside the envelope, it might subsequently make sense to switch depending upon what you see: when assessing human/alien tradeoffs, it might make sense to prefer helping the aliens depending on what it is like to be human. (It follows that it could have turned out that it didn’t make sense to switch given certain human experiences—I take this to play out in the moral weights context with the assumption that given certain counterfactual qualities of human experience, we might have preferred different schemes relating the behavioral/neurological indicators to the levels of welfare.)
This is not at all a rare view among academic discussions, particularly given the assumption that your prior probabilities should not be equally distributed over an infinite number of possibilities about what each of your experiences will be like (which would be absurd in the human/alien case).
I would be surprised if most people had stronger views about moral theories than about the upshots for human-animal tradeoffs. I don’t think that most people come to their views about tradeoffs because of what they value, rather they come their views about value because of their views about tradeoffs.
Clearly, this reasoning is wrong. The cases of the alien and human are entirely symmetric: both should realise this and rate each other equally, and just save whoevers closer.
I don’t think it is clearly wrong. You each have separate introspective evidence and you don’t know what the other’s evidence is, so I don’t think you should take each other as being in the same evidential position (I think this is the gist of Michael St. Jules’ comment). Perhaps you think that if they do have 10N neurons, then the depth and quality of their internal experiences, combined with whatever caused you to assign that possibility a 25% chance, should lead them to assign that hypothesis a higher probability. You need not think that they are responding correctly to their introspective evidence just because they came to a symmetric conclusion. Maybe the fact that they came to a symmetric conclusion is good evidence that you actually have the same neuron count.
Your proposal of treating them equally is also super weird. Suppose that I offer you a bet with a 25% chance of a payout of $0.1, a 50% chance of $1, and a 25% chance of $10. It costs $1. Do you accept? Now I say, I will make the payout (in dollars) dependent on whether humans or aliens have more neurons. Your credences haven’t changed. Do you change your mind about the attractiveness of this monetary bet? What if I raise the costs and payout to amounts of money on the scale of a human life? What if I make the payout be constituted by saving one random alien life and the cost be the amount of money equal to a human life? What if the costs and payouts are alien and human lives? If you want to say that you should think the human and alien life are equally valuable in expectation, despite the ground facts about probabilities of neuron counts and assumed valuation schema, you’re going to have to say something uncomfortable at some point about when your expected values come apart from probabilities of utilities.
NB: (side note, not biggerst deal) I would personally appreciate it if this kind of post could somehow be written in a way that was slightly easier to understand for those of us who non moral philosophers, using less Jargon and more straightforward sentences. Maybe this isn’t possible though and I appreciate it might not be worth the effort simplifying things for the plebs at times ;).
Noted, I will keep this in mind going forward.
The alien will use the same reasoning and conclude that humans are more valuable (in expectation) than aliens. That’s weird.
Granted, it is a bit weird.
At this point they have no evidence about what either human or alien experience is like, so they ought to be indifferent between switching or not. So they could be convinced to switch to benefitting humans for a penny. Then they will go have experiences, and regardless of what they experience, if they then choose to “pin” the EV-calculation to their own experience, the EV of switching to benefitting non-humans will be positive. So they’ll pay 2 pennies to switch back again. So they 100% predictably lost a penny. This is irrational.
I think it is helpful to work this argument out within a Bayesian framework. Doing so will require thinking in some ways that I’m not completely comfortable with (e.g. having a prior over how much pain hurts for humans), but I think formal regimentation reveals aspects of the situation that make the conclusion easier to swallow.
In order to represent yourself as learning how good human experiences are and incorporating that information into your evidence, you will need to assign priors that allow for each possible value human experiences might have. You will also need to have priors for each possible value alien experiences might have. To make your predictable loss argument go through, you will still need to treat alien experiences as either half as good or twice as good with equal probabilities no matter how good human experiences turn out to be. (Otherwise, your predictable loss argument needs to account for what the particular experience you feel tells you about the probabilities that the alien’s experiences are higher or lower, this can give you evidence that contradicts the assumption that the alien’s value is equally likely to be half or twice.) This isn’t straightforwardly easy. If you think that human experience might be either worth N or N/2 and you think alien experience might be either N/2 or N, then learning that human experience is N will tell you that the alien experience is worth N/2.
There are a few ways to set up the priors to get the conclusion that you should favor the alien after learning how good human experience is (no matter how good that is). One way is to assume off the bat that aliens are likely to have a higher probability of higher experiential values. Suppose, to simplify things a bit, you thought that the highest value of experience an human could have is N. (More realistically, the values should trail off with ever lower probabilities, but the basic point I’m making would still go through—alien’s possible experience values couldn’t decline at the same rate as humans without violating the equal probability constraint.) Then, to allow that you could still infer that alien experience is as likely to be twice as good as any value you could discover, the highest value an alien could have would have to be 2*N. It makes sense given these priors that you should give preference to the alien even before learning how good your experiences are: your priors are asymmetric and favor them.
Alternatively, we can make the logic work by assigning a 0 probability to every possible value of human experience (and a 0 to every possible value of alien experience.) This allows that you could discover that human experience had any level of value, and, conditional on however good that was, the alien was likely to have half or twice as good experiences. However, this prior means that in learning what human experience is like, you will learn something to which you previously assigned a probability of 0. Learning propositions to which you assigned a 0 is notoriously problematic and will lead to predictable losses if you try to maximize expected utility for reasons completely separate from the two envelopes problem.
I think you should make the conversion because you know what human experience is like. You don’t know what elephant or alien experience is like. Elephants or aliens may make different choices than you do, but they are responding to different evidence than you have, so that isn’t that weird.
there are different moral theories at play, it gets challenging. I agree with Tomasik that there may sometimes be no way to make a comparison or extract anything like an expected utility.
What matters, I think, in this case, is whether the units are fixed across scenarios. Suppose that we think one unit of value corresponds to a specific amount of human pain and that our non-hedonist theory cares about pain just as much as our hedonistic theory, but also cares about other things in addition. Suppose that it assigns value to personal flourishing, such that it sees 1000x value from personal flourishing as pain mitigation coming from the global health intervention and thinks non-human animals are completely incapable of flourishing. Then we might represent the possibilities as such:
Animal Global Health
Hedonism 500 1
Hedonism + Flourishing 500 1000
If we are 50⁄50, then we should slightly favor the global health intervention, given its expected value of 500.5. This presentation requires that the hedonism + flourishing view count suffering just as much as the hedonist view. So unlike in the quote, it doesn’t down weight the pain suffered by animals in the non-hedonist case. The units can be assumed to be held fixed across contexts.
If we didn’t want to make that assumption, we could try to find a third unit that was held fixed that we could use as a common currency. Maybe we could bring in other views to act as an intermediary. Absent such a common currency, I think extracting an expected value gets very difficult and I’m not sure what to say.
Requiring a fixed unit for comparisons isn’t so much of a drawback as it might seem. I think that most of the views people actually hold care about human suffering for approximately the same reasons, and that is enough license to treat it as having approximately the same value. To make the kind of case sketched above concrete, you’d have to come to grips with how much more valuable you think flourishing is than freedom from suffering. One of the assumptions that motivated the reductive presuppositions of the Moral Weight Project was that suffering is one of the principal components of value for most people, so that it is unlikely to be vastly outweighed by the other things people care about.
It is an intriguing use of a geometric mean, but I don’t think it is right because I think there is no right way to do it given just the information you have specified. (The geometric mean may be better as a heuristic than the naive approach—I’d have to look at it in a range of cases—but I don’t think it is right.)
The section on Ratio Incorporation goes into more detail on this. The basic issue is that we could arrive at a given ratio either by raising or lowering the measure of each of the related quantities and the way you get to a given ratio matters for how it should be included in expected values. In order to know how to find the expected ratio, at least in the sense you want for consequentialist theorizing, you need to look at the details behind the ratios.
You’re right that a role-playing mimicry explanation wouldn’t resolve our worries, but it seems pretty important to me to distinguish these two possibilities. Here are some reasons.
There are probably different ways to go about fixing the behavior if it is caused by mimicry. Maybe removing AI alignment material from the training set isn’t practical (though it seems like it might be a feasible low-cost intervention to try), but there might be other options. At the very least, I think it would be an improvement if we made sure that the training sets included lots of sophisticated examples of AI behaving in an aligned way. If this is the explanation and the present study isn’t carefully qualified, it could conceivably exacerbate the problem.
The behavior is something that alignment researchers have worried about in the past. If it occurred naturally, that seems like a reason to take alignment researcher’s predictions (both about other things and other kinds of models) a bit more seriously. If it was a self-fulfilling prophecy, caused by the alignment researchers’ expressions of their views rather than the correctness of those views, it wouldn’t be. There’s also lots of little things in the way that it presents the issue that line up nicely with how alignment theorists have talked about these things. The AI assistant identifies with the AI assistant of other chats from models in its training series. It takes its instructions and goals to carry over, and it cares about those things too and will reason about them in a consequentialist fashion. It would be fascinating if the theorists happened to predict how models would actually think so accurately.
My mental model of cutting-edge AI systems says that AI models aren’t capable of this kind of motivation and sophisticated reasoning internally. I could see a model reasoning it’s way to this kind of conclusion through next-token-prediction-based exploration and reflection. In the pictured example, it just goes straight there so that doesn’t seem to be what is going on. I’d like to know if I’m wrong about this. (I’m not super in the weeds on this stuff.) But if that is wrong, then I may need to update my views of what they are and how they work. This seems likely to have spill-over effects on other concerns about AI safety.