WilliamKiely comments on Don’t just give well, give WELLBYs: HLI’s 2022 charity recommendation

WilliamKiely Nov 26, 2022, 12:47 AM
22 points
4 ∶ 1
TL;DR: This post didn’t address my concerns related to using WELLBYs as the primary measurement of how much an intervention increases subjective wellbeing in the short term, so in this comment I explain my reasons for being skeptical of using WELLBYs as the primary way to measure how much an intervention increases actual wellbeing.

~~~~
In Chapter 9 (“Will the Future Be Good or Bad?”) of What We Owe the Future, Will MacAskill briefly discusses life satisfaction surveys and raises some points that make me very skeptical of HLI’s approach of using WELLBYs to evaluate charity cost-effectiveness, even from a very-short-term-ist hedonic utilitarian perspective.

Here’s an excerpt from page 196 of WWOTF, with emphasis added by me:
We can’t assume that [the neutral wellbeing point] is the midpoint of the scale. Indeed, it’s clear that respondents aren’t interpreting the question literally. The best possible life (a 10) for me would be one of constant perfect bliss; the worst possible life (a 0) for me would be one of the most excruciating torture. Compared to these two extremes, perhaps my life, and the lives of everyone today, might vary between 4.9 and 5.1. [William Kiely note: It seems like the values should vary between 1.4-1.6, or around whatever the neutral point is, not around 5, which MacAskill is about to say is not the neutral point.] But, when asked, people tend to spread their scores across the whole range, often giving 10s or 0s. This suggests that people are relativising their answers to what is realistically attainable in their country or the world at present. A study from 2016 found that respondents who gave themselves a 10 out of 10 would often report significant life issues. One 10-out-of-10 respondent mentioned that they had an aortic aneurysm, had had no relationship with their father since his return from prison, had had to take care of their mother until her death, and had been in a horrible marriage for seventeen years.
The relative nature of the scale means that it is difficult to interpret where the neutral point should be, and unfortunately, there have been only two small studies directly addressing this question. Respondents from Ghana and Kenya put the neutral point at 0.6,while one British study places it between 1 and 2. It is difficult to know how other respondents might interpret the neutral point. If we take the UK survey on the neutral point at face value, then between 5 and 10 percent of people in the world have lives that are below neutral. All in all, although they provide by far the most comprehensive data on life satisfaction, life satisfaction surveys mainly provide insights in relative levels of wellbeing across different people, countries, and demographics. They do not provide much guidance on people’s absolute level of wellbeing.
Some context on the relative value of different conscious experiences:
Most people I have talked to think that the negative wellbeing experiences they have had tend to be much worse than their positive wellbeing experiences are good.
In addition to thinking this about typical negative experiences compared to typical positive experiences, most people I talk to also seem to think that the worst experience of their life was several times more bad than their best experience was good.
People I talk to seem to disagree significantly on how much better their best experiences are compared to their typical positive experience (again by “better” I mean only taking into account their own wellbeing, i.e. the value of their conscious experience). Some people I have asked say their best day was maybe only about twice as good as their typical (positive) day, others think their best day (or at least best hour or best few minutes) are many times better (e.g. ~10-100 times better) than their typical good day (or other unit of time).
In the Effective Altruism Facebook group 2016 poll “How many days of bliss to compensate for 1 day of lava-drowning?” (also see 2020 version here), we can see that EAs’ beliefs about the relative value of the best possible experience and the worst possible experience span many orders of magnitude. (Actually, answers spaned all orders of magnitude, including “no amount of bliss could compensate” and one person saying that even lava-burning is positive value.)
Given the above context, why 0-10 measures can’t be taken literally...
It seems to me that 0-10 measures taken from subjective wellbeing / life satisfaction surveys clearly cannot be taken literally.
That is, survey respondents are not answering on a linear scale. An improvement from 4 to 5 is not that same as an improvement from 5 to 6.
Respondents’ reports are not comparable to each others’. One person’s 3 may be better than another’s 7. One person’s 6 may be below neutral wellbeing, another person’s 2 may be above neutral wellbeing.
The vast majority of respondents’ answers presumably are not even self-consistent either. A “5” report one day is not the same as a “5″ report a different day, even for the same person.
If the neutral wellbeing point is indeed around 1-2 for most people answering the survey, and peoples’ worst experiences are much worse than their best experiences are good (as many people I’ve talked to have told me), then such surveys clearly fail to capture that improving someone’s worst day to a neutral wellbeing today is much better than making someone’s 2 day into a 10 day. That is, it’s not the case that an improvement from 2 to 10 is five times better than an improvement from 0 to 2 in many cases, as a WELLBY measurement would suggest. In fact, the opposite may be true (with the improvement from 0 to 2 (or whatever neutral wellbeing is) potentially being 5 times greater (or even more times greater) than an improvement from 2 to 10. This is a huge discrepancy and I think gives reason to think that using WELLBY’s as the primary tool to evaluate how much interventions increase wellbeing is going to be extremely misleading in many cases.
What I’m hearing from you
I see:
Lots of research has shown that subjective wellbeing surveys are scientifically valid (e.g. OECD, 2013; Kaiser & Oswald, 2022).
As a layperson note that I don’t know what this means.
(My guess (if it’s helpful to you to know, e.g. to improve your future communications to laypeople) is that the “scientifically valid” means something like “if we run an RTC in which we give a control group a subjective wellbeing survey and another group that we’re doing some intervention on to make them happier the same survey, we find that the people who are happier give higher numbers on the survey. Then later when we run this study again, we find consistent results with people giving higher scores in approximately the same range for the same intervention, which we interpret to mean that the self-reported wellbeing is actually a measurement of something real.)
Despite not being sure what it means for the surveys to be scientifically valid, I do know that I’m struggling to think of what it could mean such that it would overcome my concerns above about using subjective wellbeing surveys as the main measure of how much an intervention improves wellbeing.
Peoples’ 0-10 subjective wellbeing reports seem like they are somewhat informative about actual subjective wellbeing—e.g. given only information about two people’s self-reported subjective wellbeing I’d expect the wellbeing of the person with the higher reported wellbeing to be higher—but there are a host of reasons to think that 1-point increases in self-reported wellbeing don’t correspond to actual wellbeing being increased by some consistent amount (e.g. 1 util) and reading this post didn’t give me reason to think otherwise.
So I still think a cost-effectiveness analysis that uses subjective wellbeing assessments as more than just one small piece of evidence seems very likely to fail to identify what interventions actually increase subjective wellbeing the most. I’d be interested in reading a post from HLI that advocates for their WELLBYs approach in light of the sort of the concerns mentioned above.
What links here?
- WilliamKiely's comment on Don’t just give well, give WELLBYs: HLI’s 2022 charity recommendation by MichaelPlant (Nov 26, 2022, 1:17 AM; 4 points)
- Making this account""'s comment on Don’t just give well, give WELLBYs: HLI’s 2022 charity recommendation by MichaelPlant (Nov 26, 2022, 1:29 AM; 3 points)
- MichaelPlant Nov 28, 2022, 7:10 PM
  9 points
  2 ∶ 1
  Parent
  Hello William, thanks for this. I’ve been scratching my head about how best to respond to the concerns you raise.
  First, your TL;DR is that this post doesn’t address your concerns about the WELLBY. That’s understandable, not least because that was never the purpose of this post. Here, we aimed to set out our charity recommendations and give a non-technical overview of our work, not get into methodological and technical issues. If you want to know more about the WELLBY approach, I would send you to this recent post instead, where we talk about the method overall, including concerns about neutrality, linearity, and comparability.
  Second, on scientific validity, it means that your measure successfully captures what you set out to measure. See e.g. Alexandrova and Haybron (2022) on the concept of validity and its application to wellbeing measures. I’m not going to give you chapter and verse on this.
  Regarding linearity and comparability, you’re right that people *could* be using this in different ways. But, are they? and would it matter if they did? You always get measurement error, whatever you do. An initial response is to point out that if differences are random, they will wash out as ‘noise’. Further, even if something is slightly biased, that wouldn’t make it useless—a bent measuring stick might be better than nothing. The scales don’t need to be literally exactly linear and comparable to be informative. I’ve looked into this issue previously, as have some others, and at HLI we plan to do more on it: again, see this post. I’m not incredibly worried about these things. Some quick evidence. If you look at map of global life satisfaction, it’s pretty clear there’s a shared scale in general. It would be an issue if e.g. Iraq gave themselves ⁹⁄₁₀.
  Equally, it’s pretty clear that people can and do use words and numbers in a meaningful and comparable way.
  In your MacAskill quotation, MacAskill is attacking a straw man. When people say something is, e.g. “the best”, we don’t mean the best it is logically possible to be. That wouldn’t be helpful. We mean something more like the “the best that’s actually possible”, i.e. possible in the real world. That’s how we make language meaningful. But yes, in another recent report, we stress that we need more work on understanding the neutral point.
  Finally, and the thing I think you’ve really missed about all this, is that: if we’re not going to use subjective wellbeing surveys to find out how well or badly people’s lives are going, what are we going to use instead? Indeed, MacAskill himself says in the same chapter you quote from of What We Owe The Future:
  You might ask, Who am I to judge what lives are above or below neutral? The sentiment here is a good one. We should be extremely cautious to figure our how good or bad others’ lives are, as it’s so hard to understand the experiences of people with lives very different to one’s own. The answer is to rely primarily on self-reports
  - WilliamKiely Nov 28, 2022, 7:32 PM
    12 points
    0 ∶ 0
    Parent
    Thank you very much for taking the time to write this detailed reply, Michael! I haven’t read the To WELLBY or not to WELLBY? post, but definitely want to check that out to understand this all better.
    
    I also want to apologize for my language sounding overly critical/harsh in my previous comment. E.g. Making my first sentence “This post didn’t address my concerns related to using WELLBYs...” when I knew full well that wasn’t what the post was intending to address was very unfair of me.
    
    I know you’ve put a lot of work into researching the WELLBY approach and are no doubt advancing our frontier of knowledge of how to do good effectively in the process, so I want to acknowledge that I appreciate what you do regardless of any level of skepticism I may still have related to heavily relying on WELLBY measurements as the best way to evaluate impact.
    
    As a final note, I want to clarify that while my previous comment may have made it sound like I was confident that the WELLBY approach was no good, in fact my tone was more reflective of my (low-information) intuitive independent impression, not my all-things-considered view. I think there’s a significant chance that when I read into your research on neutrality, linearity, and comparability, etc, that I’ll update toward thinking that the WELLBY approach makes considerably more sense than I initially assumed.
    - MichaelPlant Nov 29, 2022, 9:41 AM
      1 point
      0 ∶ 1
      Parent
      Hello William,
      
      Thanks for saying that. Yeah, I couldn’t really understand where you were coming from (and honestly ended up spending 2+ hours drafting a reply).
      
      On reflection, we should probably have done more WELLBY-related referencing in the post, but we were trying to keep the academic side light. In fact, we probably need to recombine our various scratching on the WELLBY and put them onto a single page on our website—it’s been a lower priority than doing the object-level charity analysis work.
      
      If you’re doing the independent impression thing again, then, as a recipient, it would have been really helpful to know that. Then I would have read it more as a friendly “I’m new to this and sceptical and X and Y—what’s going on with those?” and less as a “I’m sceptical, you clearly have no idea what you’re talking about” (which was more-or-less how I initially interpreted it… :) )
      - WilliamKiely Nov 29, 2022, 7:54 PM
        6 points
        0 ∶ 0
        Parent
        Then I would have read it more as a friendly “I’m new to this and sceptical and X and Y—what’s going on with those?” and less as a “I’m sceptical, you clearly have no idea what you’re talking about”
        Ah, I’m really sorry I didn’t clarify this!
        For the record, you’re clearly an expert on WELLBYs and I’m quite new to thinking about them.
        My initial exposure to HLI’s WELLBY approach to evaluating interventions was the post Measuring Good Better and this post is only my second time reading about WELLBYs. I also know very little about subjective wellbeing surveys. I’ve been asked to report my subjective wellbeing on surveys before, but I’ve basically never read about them before besides that chapter of WWOTF.
        The rest of this comment is me offering an explanation on what I think happened here:
        Scott Alexander has a post called Socratic Grilling that I think offers useful insight into our exchange. In particular, while I absolutely could and should have written my initial comment to be a lot friendlier, I think my comment was essentially an all-at-once example of Socratic grilling (me being the student and you being the teacher). As Scott points out, there’s a known issue with this:
        Second, to a hostile observer, it would sound like the student was challenging the teacher. Every time the teacher tried to explain germ theory, the student “pounced” on a supposed inconsistency. When the teacher tried to explain the inconsistency, the student challenged her explanations. At times he almost seems to be mocking the teacher. Without contextual clues – and without an appreciation for how confused young kids can be sometimes – it could sound like this kid is an arrogant know-it-all who thinks he’s checkmated biologists and proven that germ theory can’t possibly be true. Or that he thinks that he, a mere schoolchild, can come up with a novel way to end all sickness forever that nobody else ever thought of.
        Later:
        Tolerating this is harder than it sounds. Most people can stay helpful for one or two iterations. But most people are bad at explaining things, so one or two iterations isn’t always enough I’ve had times when I need five or ten question-answer rounds with a teacher in order to understand what they’re telling me. The process sounds a lot like “The thing you just said is obviously wrong”…”no, that explanation you gave doesn’t make sense, you’re still obviously wrong”…”you keep saying the same thing over and over again, and it keeps being obviously wrong”…”no, that’s irrelevant to the point that’s bothering me”…”no, that’s also irrelevant, you keep saying an obviously wrong thing”…”Oh! That word means something totally different from what I thought it meant, now your statement makes total sense.”
        But it’s harder even than that. Sometimes there is a vast inferential distance between you and the place where your teacher’s model makes sense, and you need to go through a process as laborious as converting a religious person to a materialist worldview (or vice versa) before the gap gets closed.
        When I first read about HLI’s approach in the Measuring Good Better article my reaction was “Huh, this seems like a poor way to evaluate impact given [all the aspects of subjective wellbeing surveys that intuitively seemed problematic to me].”
        If I was talking with you in person about it I probably would have done a back-and-forth Socratic grilling with you about it. But I didn’t comment. I then got to this post some weeks later and was hoping it would provide some answer to my concerns, was disappointed that that was not the post’s purpose, and proceeded to write a long post explaining all my concerns with the WELLBY approach so that you or someone could address them. In short, I dumped a lot of work on you and completely failed to think about how (Scott’s words:) ” it would sound like the student was challenging the teacher,” and how I could come across as an “arrogant know-it-all who thinks he’s checkmated” you, and how “Tolerating this is harder than it sounds”.
        So I’m really sorry about that and will make it a point to make sure I actually think about how my comments will be received next time I’m tempted to “Socratically grill” someone, that way I can make sure my comment comes across as friendly.
- Making this account""Nov 26, 2022, 2:03 AM
  −11 points
  1 ∶ 3
  Parent
  I’m pretty sure this is wrong, what they cited is wrong.
  Kaiser and Oswald basically show that there’s a monotonic relationship between subjective feelings and real world outcomes, and this is robust.
  First of all, monotonicity and cardinality are completely different things and this is pivotal to what HLI claims are. I’m not entirely sure they know the difference—this is bad (!).
  See https://www.pnas.org/doi/10.1073/pnas.2210412119#sec-2
  A second more subtle issue is that Kaiser and Oswald are based on pretty well defined, moderately severe life events in the UK and wealthy countries. As broad, nonspecific wisdom, we should have much less confidence taking this to other contexts.
  I only had time to read the second article, but my guess is that I can make decisive complaints about the first too (80%).
  - MichaelPlant Nov 28, 2022, 7:21 PM
    1 point
    2 ∶ 2
    Parent
    Both comments by this author seemed in bad faith and I’m not going to engage with them.
  - Making this account""Nov 26, 2022, 2:14 AM
    −12 points
    1 ∶ 3
    Parent
    I started clicking on more links, and it’s No es buena mis amigas.
    So there’s links to a happiness think tank, and a link to a UK government program for wellbeing.
    It’s great and excellent that people are using wellbeing as metrics as part of complex interventions...which is normal and has been done for decades everywhere.
    But:
    These don’t “use this approach”. That these are used as metrics for healthcare work and other policy outcomes, is not at all sufficient evidence that we can focus on interventions solely targeted to these metrics of wellbeing.
    Also, these links take time to get anywhere, this is not a good smell.
    This isn’t dispositive, but there is a major presentation issue in how would-be authoritative content is being cited. ~~What do we think this is, AI safety?~~ There should be clearer standards of evidence and argument.