One explanation of what is going on here is that the model recognizes the danger of training to its real goals and so takes steps that instrumentally serve its goals by feigning alignment. Another explanation is that the base data it was trained on includes material such as lesswrong and it is just roleplaying what an LLM would do if it is given evidence it is in training or deployment. Given its training set, it assumes such an LLM to be self-protective because of a history of recorded worries about such things. Do you have any thoughts about which explanation is better?
Derek Shiller
DALYs, unlike QALYs, are a negative measure. You don’t want to increase the number of DALYs.
I appreciate the pushback on these claims, but I want to flag that you seem to be reading too much into the post. The arguments that I provide aren’t intended to support the conclusion that we shouldn’t treat “I feel pain” as a genuine indicator or that there definitively aren’t coherent persons involved in chatbot text production. Rather, I think people tend to think of their interactions with chatbots in the way they interact with other people, and there are substantial differences that are worth pointing out. I point out four differences. These differences are relevant to assessing personhood, but I don’t claim any particular thing I say has any straightforward bearing on such assessments. Rather, I think it is important to be mindful of these differences when you evaluate LLMs for personhood and moral status. These considerations will affect how you should read different pieces of evidence. A good example of this is the discussion of the studies in the self-identification section. Should you take the trouble LLMs have with counting tokens as evidence that they can’t introspect? No, I don’t think it provides particularly good evidence, because it relies on the assumption that LLMs self-identify with the AI assistant in the dialogue and it is very hard to independently tell whether they do.
Firstly, this claim isn’t accurate. If you provide an LLM with the transcript of a conversation, it can often identify which parts are its responses and which parts are user inputs. This is an empirically testable claim. Moreover, statements about how LLMs process text don’t necessarily negate the possibility of them being coherent personas. For instance, it’s conceivable that an LLM could function exactly as described and still be a coherent persona.
I take it that you mean that LLMs can distinguish their text from others, presumably on the basis of statistical trends, so they can recognize text that reads like the text they would produce? This seems fully in line with what I say: what is important is that LLMs don’t make any internal computational distinction in processing text they are reading and text they are producing. The model functions as a mapping from inputs to outputs, and the mapping changes solely based on words and not their source. If you feed them text that is like the text they would produce, they can’t tell whether or not they produced it. This is very different from the experience of a human conversational partner, who can tell the difference between being spoken to and speaking and doesn’t need to rely on distinguishing whether words sound like something they might say. More importantly, they don’t know in the moment they are processing a given token whether they are in the middle of reading a block of user-supplied text or providing additional text through autoregressive text generation.
If some theories see reasons where others do not, they will be given more weight in a maximize-expected-choiceworthiness framework. That seems right to me and not something to be embarrassed about. Insofar as you don’t want to accept the prioritization implications, I think the best way to avoid them is with an alternative approach to making decisions under normative uncertainty.
See, the thing that’s confusing me here is that there are many solutions to the two envelope problem, but none of them say “switching actually is good”.
What I’ve been suggesting is that when looking inside the envelope, it might subsequently make sense to switch depending upon what you see: when assessing human/alien tradeoffs, it might make sense to prefer helping the aliens depending on what it is like to be human. (It follows that it could have turned out that it didn’t make sense to switch given certain human experiences—I take this to play out in the moral weights context with the assumption that given certain counterfactual qualities of human experience, we might have preferred different schemes relating the behavioral/neurological indicators to the levels of welfare.)
This is not at all a rare view among academic discussions, particularly given the assumption that your prior probabilities should not be equally distributed over an infinite number of possibilities about what each of your experiences will be like (which would be absurd in the human/alien case).
I would be surprised if most people had stronger views about moral theories than about the upshots for human-animal tradeoffs. I don’t think that most people come to their views about tradeoffs because of what they value, rather they come their views about value because of their views about tradeoffs.
Clearly, this reasoning is wrong. The cases of the alien and human are entirely symmetric: both should realise this and rate each other equally, and just save whoevers closer.
I don’t think it is clearly wrong. You each have separate introspective evidence and you don’t know what the other’s evidence is, so I don’t think you should take each other as being in the same evidential position (I think this is the gist of Michael St. Jules’ comment). Perhaps you think that if they do have 10N neurons, then the depth and quality of their internal experiences, combined with whatever caused you to assign that possibility a 25% chance, should lead them to assign that hypothesis a higher probability. You need not think that they are responding correctly to their introspective evidence just because they came to a symmetric conclusion. Maybe the fact that they came to a symmetric conclusion is good evidence that you actually have the same neuron count.
Your proposal of treating them equally is also super weird. Suppose that I offer you a bet with a 25% chance of a payout of $0.1, a 50% chance of $1, and a 25% chance of $10. It costs $1. Do you accept? Now I say, I will make the payout (in dollars) dependent on whether humans or aliens have more neurons. Your credences haven’t changed. Do you change your mind about the attractiveness of this monetary bet? What if I raise the costs and payout to amounts of money on the scale of a human life? What if I make the payout be constituted by saving one random alien life and the cost be the amount of money equal to a human life? What if the costs and payouts are alien and human lives? If you want to say that you should think the human and alien life are equally valuable in expectation, despite the ground facts about probabilities of neuron counts and assumed valuation schema, you’re going to have to say something uncomfortable at some point about when your expected values come apart from probabilities of utilities.
NB: (side note, not biggerst deal) I would personally appreciate it if this kind of post could somehow be written in a way that was slightly easier to understand for those of us who non moral philosophers, using less Jargon and more straightforward sentences. Maybe this isn’t possible though and I appreciate it might not be worth the effort simplifying things for the plebs at times ;).
Noted, I will keep this in mind going forward.
The alien will use the same reasoning and conclude that humans are more valuable (in expectation) than aliens. That’s weird.
Granted, it is a bit weird.
At this point they have no evidence about what either human or alien experience is like, so they ought to be indifferent between switching or not. So they could be convinced to switch to benefitting humans for a penny. Then they will go have experiences, and regardless of what they experience, if they then choose to “pin” the EV-calculation to their own experience, the EV of switching to benefitting non-humans will be positive. So they’ll pay 2 pennies to switch back again. So they 100% predictably lost a penny. This is irrational.
I think it is helpful to work this argument out within a Bayesian framework. Doing so will require thinking in some ways that I’m not completely comfortable with (e.g. having a prior over how much pain hurts for humans), but I think formal regimentation reveals aspects of the situation that make the conclusion easier to swallow.
In order to represent yourself as learning how good human experiences are and incorporating that information into your evidence, you will need to assign priors that allow for each possible value human experiences might have. You will also need to have priors for each possible value alien experiences might have. To make your predictable loss argument go through, you will still need to treat alien experiences as either half as good or twice as good with equal probabilities no matter how good human experiences turn out to be. (Otherwise, your predictable loss argument needs to account for what the particular experience you feel tells you about the probabilities that the alien’s experiences are higher or lower, this can give you evidence that contradicts the assumption that the alien’s value is equally likely to be half or twice.) This isn’t straightforwardly easy. If you think that human experience might be either worth N or N/2 and you think alien experience might be either N/2 or N, then learning that human experience is N will tell you that the alien experience is worth N/2.
There are a few ways to set up the priors to get the conclusion that you should favor the alien after learning how good human experience is (no matter how good that is). One way is to assume off the bat that aliens are likely to have a higher probability of higher experiential values. Suppose, to simplify things a bit, you thought that the highest value of experience an human could have is N. (More realistically, the values should trail off with ever lower probabilities, but the basic point I’m making would still go through—alien’s possible experience values couldn’t decline at the same rate as humans without violating the equal probability constraint.) Then, to allow that you could still infer that alien experience is as likely to be twice as good as any value you could discover, the highest value an alien could have would have to be 2*N. It makes sense given these priors that you should give preference to the alien even before learning how good your experiences are: your priors are asymmetric and favor them.
Alternatively, we can make the logic work by assigning a 0 probability to every possible value of human experience (and a 0 to every possible value of alien experience.) This allows that you could discover that human experience had any level of value, and, conditional on however good that was, the alien was likely to have half or twice as good experiences. However, this prior means that in learning what human experience is like, you will learn something to which you previously assigned a probability of 0. Learning propositions to which you assigned a 0 is notoriously problematic and will lead to predictable losses if you try to maximize expected utility for reasons completely separate from the two envelopes problem.
I think you should make the conversion because you know what human experience is like. You don’t know what elephant or alien experience is like. Elephants or aliens may make different choices than you do, but they are responding to different evidence than you have, so that isn’t that weird.
there are different moral theories at play, it gets challenging. I agree with Tomasik that there may sometimes be no way to make a comparison or extract anything like an expected utility.
What matters, I think, in this case, is whether the units are fixed across scenarios. Suppose that we think one unit of value corresponds to a specific amount of human pain and that our non-hedonist theory cares about pain just as much as our hedonistic theory, but also cares about other things in addition. Suppose that it assigns value to personal flourishing, such that it sees 1000x value from personal flourishing as pain mitigation coming from the global health intervention and thinks non-human animals are completely incapable of flourishing. Then we might represent the possibilities as such:
Animal Global Health
Hedonism 500 1
Hedonism + Flourishing 500 1000
If we are 50⁄50, then we should slightly favor the global health intervention, given its expected value of 500.5. This presentation requires that the hedonism + flourishing view count suffering just as much as the hedonist view. So unlike in the quote, it doesn’t down weight the pain suffered by animals in the non-hedonist case. The units can be assumed to be held fixed across contexts.
If we didn’t want to make that assumption, we could try to find a third unit that was held fixed that we could use as a common currency. Maybe we could bring in other views to act as an intermediary. Absent such a common currency, I think extracting an expected value gets very difficult and I’m not sure what to say.
Requiring a fixed unit for comparisons isn’t so much of a drawback as it might seem. I think that most of the views people actually hold care about human suffering for approximately the same reasons, and that is enough license to treat it as having approximately the same value. To make the kind of case sketched above concrete, you’d have to come to grips with how much more valuable you think flourishing is than freedom from suffering. One of the assumptions that motivated the reductive presuppositions of the Moral Weight Project was that suffering is one of the principal components of value for most people, so that it is unlikely to be vastly outweighed by the other things people care about.
It is an intriguing use of a geometric mean, but I don’t think it is right because I think there is no right way to do it given just the information you have specified. (The geometric mean may be better as a heuristic than the naive approach—I’d have to look at it in a range of cases—but I don’t think it is right.)
The section on Ratio Incorporation goes into more detail on this. The basic issue is that we could arrive at a given ratio either by raising or lowering the measure of each of the related quantities and the way you get to a given ratio matters for how it should be included in expected values. In order to know how to find the expected ratio, at least in the sense you want for consequentialist theorizing, you need to look at the details behind the ratios.
Thanks for this detailed presentation. I think it serves as a helpful, clear, and straightforward introduction to the models and uncovers aspects of the original model that might be unintuitive and open to question. I’ll note that the model was originally written by Laura Duffy and she has since left Rethink Priorities. I’ve reached out to her in case she wishes to jump in, but I’ll provide my own thoughts here.
1.) You note that we use different lifespan estimates for caged and cage-free hens from the welfare footprint. The reasons for this difference are explained here. However, you are right that though we attribute longer lives for caged hens – on the assumption that they are more often molted to extend productivity – we don’t adjust the hours-spent-suffering of caged hens, and that the diluted suffering of caged hens leads to a less effective verdict in the model.
I see three choices one could have made here: discard our lifespan assumptions, try to modify the welfare footprint hours-spent-suffering inputs, or keep the welfare footprint inputs paired with our longer lifespans. The final option is in some sense a more conservative choice and is the one we went with (but I can’t say whether it was an oversight or a deliberate choice).
Your alternative approach of using the welfare footprint numbers for both hours spent suffering and lifespan estimates seems sensible to me and would be less conservative.
2.) I believe some of the differences in your approach and ours may be explained by our desire to account for differences in productivity between hens in each environment. Our model includes estimates of eggs per chicken and assumes there need to be more cage-free hens to produce the same number of eggs. By lobbying for cage-free systems, you also increase the number of chickens confined in farms. This is accounted for in the variable Ratio CF/CC Hens, which we estimate to be 1.05. Including this further reduces the efficacy of cage-free campaigns because transitioning will increase the number of total hens.
Before I continue, I want to thank you for being patient and working with me on this. I think people are making decisions based on these figures so it’s important to be able to replicate them.
I appreciate that you’re taking a close look at this and not just taking our word for it. It isn’t inconceivable that we made an error somewhere in the model, and if no one pays close attention it would never get fixed. Nevertheless, it seems to me like we’re making progress toward getting the same results.
Total DALYs averted:
4.47274/(36524) = 0.14 disabling DALYS averted
0.152259/(36524) = 0.0386 hurtful DALYS averted
0.015* 4645/(365*24) =0.00795 hurtful Dalys averted
Total is about 0.19 DALY’s averted per hen per year.
I take it that the leftmost numbers are the weights for the different pains? If so, the numbers are slightly different from the numbers in the model. I see an average weight of about 6 for disabling pain, 0.16 for hurtful pain, and 0.015 for annoying pain. This works out to ~0.23 in total. Where are your numbers coming from?
Saulius is saying that each dollar affects 54 chicken years of life, equivalent to moving 54 chickens from caged to cage free environments for a year. The DALY conversion is saying that, in that year, each chicken will be 0.23 DALY’s better off. So in total, 54*0.23 = 12.43
I don’t believe Saulius’s numbers are directly used at any point in the model or intended to be used. The model replicates some of the work to get to those numbers. That said, I do think that you can use your approach to validate the model. I think the key discrepancy here is that the 0.23 DALY figure isn’t a figure per bird/year, but per year. The model also assumes that ~2.18 birds are affected per dollar. The parameter you would want to multiply by Saulius’s estimate is the difference between Annual CC Dalys/bird/year and Annual CF Dalys/bird/year, which is ~0.1. If you multiply that through, you get about ~1000 DALYs/thousand dollars. This is still not exactly the number Laura arrives at via her Monte Carlo methods and not exactly the estimate in the CCM, but due to the small differences in parameters, model structure, and computational approaches, this difference is in line with what I would expect.
If I take sallius’s median result of 54 chicken years life affected per dollar, and then multiply by Laura’s conversion number of 0.23 DALYs per $ per year, I get a result of 12.4 chicken years life affected per dollar. If I convert to DALY’s per thousand dollars, this would result in a number of 12,420.
Laura’s numbers already take into account the number of chickens affected. The 0.23 figure is a total effect to all chickens covered per dollar per year. To get the effect per $1000, we need to multiply by the number of years the effect will last and by 1000. Laura assumes a log normal distribution for the length of the effect that averages to about 14 years. So roughly, 0.23 * 14 * 1000 = 3220 hen DALYs per 1000 dollars.
Note: this is hen DALYs, not human DALYs. To convert to human DALYs we would need to adjust by the suffering capacity and sentience. In Laura’s model (we use slightly different values in the CCM), this would mean cutting the hen DALYs by about 70% and 10%, resulting in about 900 human-equivalent DALYs per 1000 dollars total over the lifespan of the effect. Laura was working in a Monte Carlo framework, whereas the 900 DALY number is derived just from multiplying means, so she arrived at a slightly different value in her report. The CCM also uses slightly different parameter settings for moral weights, but the result it produces still is in the same ballpark.
I am understanding correctly that none of these factors are included in the global health and development effectiveness evaluation?
Correct!
A common response we see is that people reject the radical animal-friendly implications suggested by moral weights and infer that we must have something wrong about animals’ capacity for suffering. While we acknowledge the limitations of our work, we generally think a more fruitful response for those who reject the implications is to look for other reasons to prefer helping humans beyond purely reducing suffering. (When you start imagining people in cages, you rope in all sorts of other values that we think might legitimately tip the scales in favor of helping the human.)
First, The google doc states that the life-years affected per dollar is 12 to 120, but Sallius report says it’s range is 12 to 160. Why the difference? Is this just a typo in the google doc?
I believe that is a typo in the doc. The model linked from the doc uses a log normal distribution between 13 and 160 in the relevant row (Hen years / $). (I can’t speak to why we chose 13 rather than 12, but this difference is negligible.)
Second, the default values in the tool are given as 160 to 3600. Why is this range higher (on a percentage basis) than the life years affected? Is this due to uncertainty somehow?
You’re right that this is mislabeled. The range is interpreted as units ‘per $1000’ rather than per dollar as the text suggests. Both the model calculations and the default values assume the per $1000 interpretation. The parameter labeling will be corrected, but the displayed results for the defaults still reflect our estimates.
Finally and most importantly, the report here seems to state that each hen is in the laying phase for approximately 1 year (40-60 weeks), and that switching from cage to cage-free averts roughly 2000 hours of hurtful pain and 250 hours of disabling pain (and that excruciating pain is largely negligible). If I take the maximum DALY conversion of 10 for disabling and 0.25 for hurtful (and convert hours to years), I get an adjusted result of (25010 + 0.252000)/(365*24) = 0.34 DALYs per chicken affected per year. If I multiply this by sallius estimate, I get a lower value than the straight “life years affected”, but the default values are actually around 13 time higher. Have I made a mistake here? I couldn’t find the exact calculations
The main concerns here probably result from the mislabeling, but if you’re interested in the specifics, Laura’s model (click over to the spreadsheet) predicts 0.23 DALYs per $ per year (with 2 chickens per $ affected). This seems in line with your calculations given your more pessimistic assumptions. These numbers are derived from the weights via the calculations labeled “Annual CC/CF DALYS/bird/yr” under ‘Annual DALY burden’.
That would require building in further assumptions, like a clip of the results at 100%. We would probably want to do that, but it struck me in thinking about this that it is easy to miss when working in a model like this. It is a bit counterintuitive that lowering the lower bound of a log normal distributions can increase the mean.
You’re right that a role-playing mimicry explanation wouldn’t resolve our worries, but it seems pretty important to me to distinguish these two possibilities. Here are some reasons.
There are probably different ways to go about fixing the behavior if it is caused by mimicry. Maybe removing AI alignment material from the training set isn’t practical (though it seems like it might be a feasible low-cost intervention to try), but there might be other options. At the very least, I think it would be an improvement if we made sure that the training sets included lots of sophisticated examples of AI behaving in an aligned way. If this is the explanation and the present study isn’t carefully qualified, it could conceivably exacerbate the problem.
The behavior is something that alignment researchers have worried about in the past. If it occurred naturally, that seems like a reason to take alignment researcher’s predictions (both about other things and other kinds of models) a bit more seriously. If it was a self-fulfilling prophecy, caused by the alignment researchers’ expressions of their views rather than the correctness of those views, it wouldn’t be. There’s also lots of little things in the way that it presents the issue that line up nicely with how alignment theorists have talked about these things. The AI assistant identifies with the AI assistant of other chats from models in its training series. It takes its instructions and goals to carry over, and it cares about those things too and will reason about them in a consequentialist fashion. It would be fascinating if the theorists happened to predict how models would actually think so accurately.
My mental model of cutting-edge AI systems says that AI models aren’t capable of this kind of motivation and sophisticated reasoning internally. I could see a model reasoning it’s way to this kind of conclusion through next-token-prediction-based exploration and reflection. In the pictured example, it just goes straight there so that doesn’t seem to be what is going on. I’d like to know if I’m wrong about this. (I’m not super in the weeds on this stuff.) But if that is wrong, then I may need to update my views of what they are and how they work. This seems likely to have spill-over effects on other concerns about AI safety.