UPDATE: Critical Failures in the World Happiness Report’s Model of National Satisfaction
There has been a substantial update since this was first posted. The lead authors of the World Happiness Report have now reached out to me directly. Details at the bottom.
The World Happiness Report (WHR) is currently the best known and most widely accepted source of information on global life satisfaction. Its six-variable model of satisfaction is reproduced in textbooks and has been published in the same form by the United Nations since 2012. However, almost no justification is given for why these six variables are used. In response, I attempted to do the only thing I thought was responsible—do an exhaustive search over 5,000 variables in international datasets, and see empirically what actually predicts satisfaction. I’ve consulted with life satisfaction specialists both in economics departments and major think tanks, and none thought this had been done before.
The variables that are selected by this more rigorous method are dramatically different from those of the WHR, and the resulting model is substantially more accurate both in and out of sample. In particular, the WHR leaves out entire categories of variables on subjects as varied, and as basic, as education, discrimination, and political power. Perhaps most dramatically, the way the WHR presents the data appears to suggest that GDP explains 40% of model variation. I find that, with my measurably more accurate model, it in fact predicts 2.5%.
The graph below ranks the model variables by contribution, which is the amount of the total satisfaction of a country they are estimated to predict. For interpretation, 1.5 points of satisfaction on the 11-point scale is equivalent to the grief felt at the death of a life partner, meaning these numbers may be numerically small, but they are enormously significant behaviorally.
All variables included here were chosen by a penalized regression out of a total of 1,058 viable candidates, after 5,500 variables were examined. (Most were too missing to be used and trusted.) They are colored by significance, with even least still marginally significant, and almost all significant at the 0.001 level.I have already gotten extremely positive feedback from academic circles, and have started looking for communities of practice that would find this valuable.
A link to the paper is below:
https://dovecoteinstitute.org/files/Loewi-Life-Satisfaction-2024.pdf
UPDATE: The lead authors of the World Happiness Report have now reached out to me directly. Already this is a shock, as I had no idea if my findings would even be taken seriously. The authors suggested changes to my methods, and I have spent the last few weeks incorporating their suggestions, during which I thought it was only responsible to take the post down. However having now taken their recommendations into account—I find the results are in every meaningful way identical, and in fact now substantially reinforced. The post and paper have been updated to reflect what changes there were.
- 31 Jan 2024 17:57 UTC; 7 points) 's comment on Open thread: January—March 2024 by (
I’m surprised you object so strenuously to taking the log of GNI/Capita or GDP/capita. This to me seems like a very natural thing to do, given the diminishing marginal utility of money, and the nature of the distribution of national incomes. If you don’t take logs (or similar) your model will presumably be quite insensitive to the differences between really poor and really really poor. Indeed, I note you yourself use log GDP as the independent variable in a chart on the very next page (p17).
The chart you’re referring to, for those who aren’t looking at the paper, is not a chart of life satisfaction—it is a chart of carbon emissions. So pretty fundamentally, I don’t think the comparison is relevant. But in addition, the outcome (emissions) is also log-transformed, which makes the relationship much more intuitive—and since both axes are clearly labeled, and contain all relevant information, I don’t see any possibility of confusion.
By contrast, the use of log GDP in models of satisfaction appears to only have a potential for confusion, especially when the WHR only mentions this transformation in the technical tables, fails to mention it in their most prominent descriptions, and only transforms this single variable in that way. Furthermore, as I describe in quite substantial detail in the paper, the argument they give to justify this transformation (improved model fit) simply doesn’t hold up to statistical scrutiny when you’re using a properly specified model. This isn’t just a marginal technical complaint either—the transformation exaggerates the effect by 1100%
I agree that there may be non-linearities in the effect, but this is equally possible with any of the other variables. This is worth exploring, but I think it is much more important to make sure the dramatic potential improvements from simple changes are recognized before getting into more subtle, and far more slippery, models, especially when the current model isn’t showing serious issues.
But most critically, the whole goal is to not just assume that we know how life satisfaction works, but rather to let the data tell us. And when the data simply doesn’t support a log transform, I’m not going to include it.
Here’s the OWID charts for life satisfaction vs. GDP/capita. First linear (per the dovecote model):
Now with a log transform to GDP/capita (per the MHR):
I think it is visually clear the empirical relationship is better modelled as log-linear rather than linear. Compared to this, I don’t think the regression diagnostics suggesting non-inferiority of linear GDP (in the context of model selected from thousands of variables, at least some of which could log-linearly proxy for GDP, cf. Dan_Key’s comment) count for much.
Besides the impact of GDP (2.5% versus 40%), I’d expect which other variables end up being selected also to be sensitive to this analysis choice. Unfortunately, as it is the wrong one, I’d expect (quasi-)omitted variable bias to distort both which variables are included, and their relative contributions, in the dovecote model.
Hi Gregory --
There’s so much to respond to that I’m going to split this up into three parts. Hopefully this will clarify things.
(Preface) On the innate problem of the logarithmic transform
Something that’s getting persistently and rather confusing lost is the fact that the log transform has inherent and enormous problems in interpretation, when using it in a linear model, alongside non-transformed variables. A regular linear parameter is describing an additive effect — a log-transformed parameter is describing a multiplicative effect. This means that reporting the estimate for the transformed variable right next to the estimate of an untransformed variable, without emphasizing the difference, is much worse than comparing apples and oranges. It’s more like selling apples, and *calling* them oranges. But this is how the WHR reports its results. In the 2022 WHR, the plots of “contributions” are not marked as LOG gdp; they’re just marked as gdp, and they are placed alongside linear parameters as though they are directly comparable. They are not. I consider this, alone, a “critical failure,” and is one of the central reasons I title my paper that way.
Another way to put this is that log gdp showing up as 40% of model variation is not because that model actually thinks gdp is hugely important — it is a *numerical illusion*, based on innately incomparable quantities. And yet, “GDP” (NOT log gdp) is put at the very head of the 2023 report, as THE variable that matters. The model simply does not support this.
Of course you can do a more responsible job of presentation, but it’s unavoidable that you’ve made your job harder for yourself. In ranking priorities, the goal is to rank things of equal nature, and the log creates a single element that is of an inherently different nature.
But all of this is preamble to the question you raise. My point is, given all the trouble it can cause, there’s got to be a VERY good reason to use the log. So — is there?
On the inappropriateness of the log transform in this context
Unfortunately, your graphs are a fundamentally incomplete, and ultimately misleading, view of the system, because satisfaction and GDP are simply not the only two variables involved. I might even say that accounting for the other variables is the point of the whole exercise. But the important part is that in this particular case, they makes all the difference.
The approach you suggest is misleading because 1. it doesn’t take all the relevant variables into account, and 2. it assumes a strong form for GDP based on a critically limited view of the data. But it’s easy to solve both of these problems. This can be done by fitting a Generalized Additive Model (GAM), with the full set of conditioning variables, and a non-parametric spline on the GNI term. This will give it the freedom to be log-shaped if the data supports that—but also any other shape, if that is a better fit. So I fit this model, I take the shape formed by the non-parametric component, and I fit two lines to it: a straight one, and a logarithmic one. They look like this:
(The degree of the spline, or the extent of the waviness, can of course be controlled manually, but this is the degree chosen algorithmically by the model. To force it to curve less will of course bias things even more in the direction of the linear model, because it would be forcing the fit to be simpler.)
Because I don’t want to make definitive mathematical judgments just by looking at things, I assess the models statistically — and the linear one just fits better. R^2 = 0.92, vs. 0.85. They’re close, sure, which I think is entirely consistent with a visual assessment — but log certainly doesn’t dominate, and in fact, is numerically worse, again, like in the first model. So the log is hard to interpret — this part is unavoidable — and empirically, *it’s just numerically worse.* To me, that’s open and shut. I see no reason, whatsoever, to use it, unless you’re actually *trying* to inflate the effect of GDP by 1100%.
Now — if you still look at this and, for whatever reason, absolutely refuse to accept any model without a non-linear effect — then fine, we can still make one of those, and we can also do it without making the strong, unnecessary, and practically-incomparable assumptions about the form of the relationship.
Since you’re inherently making the claim that there are fundamental differences in the dynamics of GDP in low-GDP countries vs high-GDP countries, then let’s just look at those countries separately! Looking at the raw plot of satisfaction ~ gdppc (untransformed), we can see a clear elbow at around $5,000. If your eyeball says $10k, that’s fine too, I tried both. But that’s clearly the approximate point where the effects of gdp *seem* to change — based on exactly the kind of visual analysis you’re using. To sketch the basic idea, you might imagine breaking the mass of data down into the following two approximate linear trends:
So then we fit models to each of these subsets of data, in which we now have some visible reason to believe that the sat ~ gdp/ni relationship actually *is* linear — and you get a contribution of 1.2% of model variation for the poor side, and 5.8% for the rich side. Which average out to 3.5%, where my pooled model estimated 2.5%. Yes the slope is steeper on the poor side — about twice and a half times as steep — but in neither case does GDP suddenly come anywhere near to looking like the dominant force, or the only way to salvation. Which again, just shouldn’t even be a surprise, because the *apparent* 40% contribution in the WHR model — and I really cannot stress this enough — *is a numerical illusion.*
Want three tranches? By all means. It’s the same. The log is profoundly misleading, utterly unnecessary, and quite simply not the right choice for this data.
On your assertion that my search space is “wrong”
To clear up what appears to be a surprisingly confident misconception, I do have log gdp available for the search algorithm. It does not make the final cut, because in addition to being of an intrinsically different nature than the rest of the variables, it simply does not produce a better model fit, statistically, whether you choose to believe this or not. But the other variables chosen are not influenced by my “wrong” search space, because that is not actually the search space I use. With my apologies, I’m really not sure why you think otherwise.
To be precise, the search with log gdp as a candidate still returns cities, and two variables on education, and public financing, and unemployment, and shadow government, and power for women (though labor market instead of political), and clean water, and gay and lesbian acceptance — along with lifespan, social support, freedom, and GNI, which is still selected 100% of the time, *even though log gdp is also in the model.* Though even *before* it’s removed, log gdp still comes out as less important than social support, once you actually properly specify the model.
Though I’m also not entirely sure why this is even relevant. I find a better model — period. It fits better, it predicts better, there are no serious diagnostic issues, and it doesn’t include log gdp. If you can find a better model than that, have at it — but if you can’t, I’m not sure how bias in the search process would make a difference. It is a simple constructive proof, in which the product of the search just performs better than what’s being currently presented as the international standard.
Thank you for the effort you went to in producing those graphs — and, sincerely, in pushing me to a more thorough defense of my decisions — I hope this helps you see why your approach is not a definitive analysis.
Does your model without log(GNI per capita) basically just include a proxy for log(GNI per capita), by including other predictor variables that, in combination, are highly predictive of log(GNI per capita)?
With a pool of 1058 potential predictor variables, many of which have some relationship to economic development or material standards of living, it wouldn’t be surprising if you could build a model to predict log(GNI per capita) with a very good fit. If that is possible with this pool of variables, and if log(GNI per capita) is linearly predictive of life satisfaction, then if you build a model predicting life satisfaction which can’t include log(GNI per capita), it can instead account for that variance by including the variables that predict log(GNI per capita).
And if you transform log(GNI per capita) into a form whose relationship with life satisfaction is sufficiently non-linear, and build a model which can only account for the linear portion of the relationship between that transformed variable and life satisfaction, then within that linear model those proxy variables might do a much better job than transformed log(GNI per capita) of accounting for the variance in life satisfaction.
I think the first thing to emphasize is that, even when you do include log(g:dp/ni), the measured effect still isn’t all that big. It says that you’ll get an increase of 1.5 points satisfaction … if you almost triple gdp! (I.e. multiply it by 2.7, because that’s just mechanically what it means when you transform a linear predictor by the natural log.) Since that’s either ludicrous or impossible for many countries, there are plenty of cases where it doesn’t even make sense to consider. My largest problem has nothing to do with the non-linearities of the log—if it fit better, great! But 1) it just, simply, numerically, doesn’t and 2) the fact that you have to interpret the log in a fundamentally different way than all the non-transformed variables makes it extraordinarily misleading. You get a bigger bump on the graph—but it’s a bigger bump that means something fundamentally different than all of the other bumps (a multiplicative effect, not an additive one). Then when you include it in charts as if it doesn’t mean something different, you’re floating towards very nasty territory.
It is certainly the case that many of the variables are highly collinear, but there are clearly no obvious close proxies in the list. If I removed log(gdp) but introduced log(trading volume) or something, that would be suspicious—but you can see all 14 of the variables that are actually in the model. I would have to be approximating log(gdp) with—water and preschool? The 1,058 variables are searched over, yes—but then 1,044 of them are rejected, and simple don’t enter.
I’m sorry though, I just don’t understand your last paragraph. If the true effect needs a log, then the log should account for that effect. And if the effect is properly transformed, I don’t understand how a different variable would do a better job of accounting for the variance than the true variable. Happy to discuss if you can clarify though.
Getting 1.5 points by 2.7x’ing GDP actually sounds like a lot to me? It predicts that the United States should be 1.9 points ahead of China and China should be 2.0 points ahead of Kenya. It’s very hard to get a 1.9 point improvement in satisfaction by doing anything.
The point is not that 1.5 is a large number, in terms of single variables—it is—the point is that 2.7x is a ridiculous number.
But 1.5 also isn’t such a huge effect within the full scale of what’s measured. The maximum value in the data is just over 8. Even something “huge” like 1.5, out of a total of 8, is less than twenty percent. If, as the more accurate model suggests, GDP is only making up a small piece of the total, then that suggests it’s far more likely that if a country were to take the same effort that would be required to triple their gdp, and put those resources instead into the other variables, they’d get a far larger return.
Can you specify what you mean with “2.7x is a ridiculous number”?
I ask because it does happen that economies grow like that in a fairly short amount of time. For example, since the year 2000:
China’s GDPpc 2.7x’d about 2.6 times
Vietnam’s did it ~2.4 times
Ethiopia’s ~2.1 times
India’s ~1.7 times
Rwanda’s ~1.3 times
The US’s GDPpc is on track to 2.7x from 2000 in about 2029, assuming a 4% annual increase
So I assume you don’t mean something like “2.7x never happens”. Do you mean something more like “it’s hard to find policies that produce 2.7x growth in a reasonable amount of time” or “typically it takes economies decades to 2.7x”?
I think I captured my intended meaning fairly well with my ending comment:
“If, as the more accurate model suggests, GDP is only making up a small piece of the total, then that suggests it’s far more likely that if a country were to take the same effort that would be required to triple their gdp, and put those resources instead into the other variables, they’d get a far larger return.”
If you’re asking because you didn’t find that convincing though, I’m happy to elaborate.
2.7x isalmost exactly the amount world gdp per capita has changed in the last 30 years. Obviously some individual countries (e.g. China) have had bigger increases in that window.30 years isn’t that high in the grand scheme of things; it’s far smaller than most lifetimes.(EDIT: nvm this is false, the chart said “current dollars” which I thought meant inflation-adjusted, but it’s actually not inflation adjusted)Others have pointed out that most of the factors you take into account are (often strongly) correlated with GDP per capita. What I think is more important from an econometric perspective is that many of them are caused by GDP per capita. If you’re trying to measure the effect of (average) income on happiness, and you control for almost all the mediators of income’s effect on happiness, of course you’ll find that there is almost no independent effect of income left over!
In my opinion, this analysis, while certainly interesting and useful for other purposes, says nothing at all about the effect of GDP per capita on life satisfaction or happiness.
I think the biggest danger to that reasoning is the premise that they are caused by GDP, and only by gdp, which I quite flatly dispute. At a minimum, gdp-measurable paths are only one way to achieve these components. For example, you can spend a lot of money on cleaning your water sources—or, you can make the choice not to destroy your clean water supplies in the first place. One looks “productive” only because it failed to account for the destruction. Of course exactly the same thing can be said of carbon into the atmosphere, leaded gasoline into the brains of children, etc, etc, but I choose water because it’s in the model.
Any attempt of a defense of GDP, specifically, needs to take into the account the fact that it’s just a deeply flawed measure of value. That’s why econ nobelists have been arguing against it for over a decade (and likely much longer, given that whole international reports were being published on it in 2012). So even if it were more predictive than the model suggests, that still wouldn’t address the fact it’s known to be misleading, all on its own, and not something I would spend a lot of time defending on the merits.
Separately though, if you replace “gdp” with “money” (since they’re also very definitely not the same thing) it sounds sort of like you’re saying that if people have money, they can just buy anything else they want, thus money is the only thing that matters—which I could respond to by getting into all of the ways that’s just not accurate, such as the fact that a single person can’t pay for a 1⁄1,000,000th fraction of a national clean water system to get clean water for themselves—
But perhaps the most definitive argument against the unique value of gdp is in simple counterexamples. Between 2005 and 2022, Costa Rica had a higher life satisfaction than the United States, with less than a third of the GDPpc. This simply wouldn’t be possible, if gdp just bought you happiness. Ergo, that simply cannot be the answer.
Well, this seems like something that is actually worth finding out. Because if it is the case that GDP (/ GDP per capita) does have a significant causal influence on one (or more) of them, then you are conditioning on a mediator, (partially) hiding the causal effect of GDP on the outcome. It seems to me like your model assumes that GDP does not have any casual influence on any of these variables, which seems like a pretty strong assumption. Unless I am misunderstanding something.
(ETA: Similarly, if both GDP and life satisfaction causally influence one of the variables, you are conditioning on a collider. That could introduce a spurious negative correlation masking a real correlation between GDP and life satisfaction, via Berkson’s paradox. For example, suppose both life satisfaction and GDP cause social stability. Then, when you stratify by social stability, it would not be surprising to find a spurious negative correlation between GDP and life satisfaction, because a high-social-stability country, if it happens to have relatively low GDP, must have very high life satisfaction in order to achieve high social stability, and vice versa.)
My understanding of these critiques is that they say either that (1) GDP is not intrinsically valuable, (2) GDP does not measure perfectly anything that we care about, or fails to measure many things that we care about, and/or (3) GDP focuses too narrowly on quantifiable economical transactions.
But if you were to find empirically that GDP causes something we do care about, e.g., life satisfaction, then I don’t understand how those critiques would be relevant? (1) would not be relevant because we don’t care about increasing GDP for its own sake, only in order to increase life satisfaction. (2) would not be relevant because whatever GDP would or would not succeed in measuring, it does measure something, and it would be desirable to increase whatever it measures (since whatever that is, causes life satisfaction). (3) would not be relevant because whatever does or does not go into the measure, again, it does measure something, and it would be desirable to increase whatever it measures.
Your reductio shows that GDP cannot be the only thing that has a causal influence on life satisfaction (assuming measurements are good, etc.). But I don’t think OP or anyone else in this comment section is saying that GDP/wealth/money is the only thing that influences life satisfaction, only at most that it is one thing that has a comparatively strong influence on it. And your counterexample does not disprove that.
Hi Erich, sorry for the delay, and thank you for the very careful response. In
order:
Wrt: “… It seems to me like your model assumes that GDP does not have any casual influence on any of these variables,…”
I don’t actually make any such assumption about GDP, and in fact am completely agnostic (for now) about causal dependencies within the graph. I only make the tentative assumption that every variable listed has some causal effect, direct or indirect, on national satisfaction (ergo, it’s not *all* GDP, which is what you accurately quote me as disputing), based on 1) a thorough search being more likely to exclude spurious causes, and 2) expert knowledge. Water, Shelter, Freedom, Friends, Being Accepted — most of these seem pretty unimpeachable. Beyond that I’m actually trying to be especially cautious about proposing particular dependencies because based on my experience with causal systems of even moderate size, the pattern of influences is likely to be spectacularly complicated, and unintuitive. This has certainly been borne out by all of my early explorations with causal discovery tools.
(As an aside, I am very interested in these questions, and continuing to work on them, but my first goal is simply to start with the right set of variables. I think progress on this itself could be a huge improvement over what I currently understand to be the globally accepted standard.)
Wrt “But if you were to find empirically that GDP causes something we do care about, …”
That feels like a reasonably fair description of the arguments with which I’m familiar, but I think there are at least two important nuances. The most simple is that GDP can have not just limited utility, but also horrific externalities — most obvious among them, global warming. It’s essentially your point (2), but with the emphasis that what’s left out can actually be more powerful, and worse, than what’s left in. In other words, even if GDP can cause satisfaction *in the short term,* satisfaction itself actually leaves out the very important question of the future. That’s an inherent shortcoming of the model, but an important strike against the concept. I go into this more in the paper.
The other is that I see “GDP” as practically too vague to be applicable for intervention. You might estimate a causal effect for “GDP,” but that might only be because *one* of the thousand things within the concept actually makes a difference. Then when you go to intervene on a different one of the thousand things, because you identify it is part of “GDP,” you just don’t get the same effect — essentially, because your variables weren’t precisely defined enough. So I’m happy to talk about how economic processes might play critical roles, but I don’t feel comfortable talking about “water” and “the entire economy” as if they have equivalent structural validity. At a minimum, one of them is much more vulnerable to bad accounting practices.
Wrt “But I don’t think OP or anyone else in this comment section is saying that GDP/wealth/money is the only thing that influences life satisfaction,…”
I do agree with you, and agree I misread A. de Vries position. Though, while I don’t think anyone has said explicitly that they think GDP is the *only* cause of satisfaction, there have also been almost no explicit proposals of anything that *does* cause satisfaction, *apart* from GDP — so I may have been reading too much between the lines there, but my trying to get some distance from the concept is really driven by a confusion that it’s the only variable we’re talking about. Still, I could have expressed that more cogently.
The factors being only caused by GDP is not at all central to my argument against your analysis, nor is it a claim I would make. The key point is that any causal effect of GDP on happiness must necessarily run through other factors like shelter, clean water, health, etc. Nobody (except economists like me) feels more satisfied with their life as a result of hearing that GDP/cap has gone up by 3% this year.
As such, if you control for all of those mediating factors, it will be literally impossible to find a significant effect of GDP/cap on happiness whether or not such an effect actually exists. If the effect is real, such an analysis would necessarily find a false negative.
Similarly, your counterexample would be valid, if I were claiming that GDP is the only factor in happiness. Again, I do not claim that, nor does anyone I know of. There are plenty of factors which account for national average life satisfaction, one of which is GDP. It is perfectly possible for Costa Rica to be higher in other factors and therefore be happier than the US despite a lower GDP.
There are some economists who oppose GDP as a measure of value, and some who support it. If you’re appealing to expertise, there’s a huge difference between consensus view and “some experts agree with me”.
I think “some experts” is fairly misleading when the experts I’m referring to have multiple econ Nobels, led an international working group on the subject in question, for a study commissioned by the president of a G7 country, and the EU is now continuing that path to construct entirely new national indicator systems based on satisfaction. I’m comfortable saying that’s a pretty substantial consensus on the need for alternatives to GDP.
I think your point that a correctly-specified model would have no effect for GDP makes a good deal of sense—or at best, that GDP could be seen as a sort of residual category for all of the consumption not accounted for by explicit variables. This is also an approach that would unambiguously favor my model over the WHR’s, which reports an enormous effect for GDP. Do you see this as just an error term in their model?
But this idea also seems to conflict pretty directly with your assertion that the model “says nothing at all about the effect of GDP per capita on life satisfaction or happiness.” If I’ve succeed in driving that term to almost zero, then you seem to be suggesting I’ve captured all the relevant effects, and the WHR hasn’t come close.
If you’re saying that GDP only matters to satisfaction via consumption, then there’s still the absolutely enormous question of: consumption of *what*? GDP is about as precise as staying “economic stuff,” so it’s barely a coherent question to even ask how “GDP” affects satisfaction. At the barest minimum, I would say this model clarifies what *parts* of all of the things that are together rolled into the GDP fruitcake are actually counting towards GDP. And this is critical. You shouldn’t be able to build a building, burn it down, and claim you’re helping, because construction counts towards GDP and GDP causes happiness. That’s just a semantic shell game.
But even if you are trying to formulate the model as satisfaction ⇐ consumption ⇐ GDP … you have to deal with the fact that the biggest effect in the model is on social support! That’s just not “economic!” You can look at the chosen variables, and see how much of GDP they actually account for, and see that GDP is outright missing several of the largest measured effects, by its inherent definition. So even if it’s only in the negative, or estimating an upper bound, or pushing the question towards clarifying the relationship between water and GDP, I have a pretty hard time seeing how that says “nothing at all” about the relationship between GDP and satisfaction.
This is very interesting analysis. Just in case you’re not already aware, EA has a Happier Lives Institute that’s specifically focused on research questions involving happiness and Oxford University has the Wellbeing Research Centre. I know Andrew Oswald’s been researching socioeconomic factors behind national happiness measures for at least a couple of decades.
By way of critique, I agree education is a particularly surprising omission from the Gallup survey (particularly when they chose to include a measure of individual generosity) and suspect you’re right that they over-weight GDP per capita (and GNI is more appropriate), though I don’t think they claimed their factors were definitive. It appears your variables actually overlap quite a bit with theirs (e.g. the binary question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?” to a significant extent encompasses LGBT rights and female empowerment and to a lesser extent general political freedoms), but there’s definitely benefit to disaggregating GDP/GNI and its positive side effects like being able to afford clean water, safe cooking etc. and some very logical additions like unemployment. I’m also curious whether your “social support” variable was similar to Gallup’s (e.g. answers to a question about personal networks specifically) or whether it encompassed other stuff (e.g. welfare state)?
The challenge with throwing up to 926 noisy and often highly-correlated variables at a relatively small panel data set is that some of the improvements in fit are likely to be spurious, particularly when it leads to odd conclusions like only HIV infections in one particular demographic affecting overall wellbeing. I’m not sure the best performing variables in statistical testing are actually those with the strongest causal relationship.
I’d be particularly cautious of drawing policy conclusions like introducing public campaign financing could be a cheap happiness win from it because I’m not sure there’s any causal relationship with that one at all. I thought it might be a simple case of most countries scoring zero and a small number of particularly well-functioning democracies having positive scores (which I’d still say was likely to be correlation not causation) but actually looking at the original data for the latest year it’s a continuous series of weighted scores that.… make very little sense (this surprised me since V-DEM is a meticulously researched and respected dataset, but they don’t use this index in any of their composite indices...). Sure, there are a few well-functioning democracies at the top and an absolute monarchy at the bottom, but in between it’s just bizarre: Switzerland has one of the worst scores, North Korea is in the middle (I suppose the Dear Leader does depend on state funding for his political power...), Cuba does pretty well compared with the UK etc. Why this very noisy series fits the regression better than dozens of other presumably more representative ones I don’t know, but I don’t think it’s anything to do with political races in North Korea being fairer!
I just lost a very long response—I’m trying to comment just to see if this gets eaten too. Hopefully it’s just a moderation thing …
Response to David T (so I don’t forget, too)
(Apologies in advance if you end up seeing two versions of this!)
Hi David, thank you for digging in so deep! Let me respond with the same amount of effort.
First, I was not aware of the EA happiness institute (great!), but I do know about the Oxford centre, because it’s run by one of the authors of the 2023 WHR. I haven’t approached them yet because I wanted to kick the tires as hard as possible on the research first, but I welcome critiques of that strategy.
Re: the report, and its claims: (and please interpret all of my frustration as being very much at the report itself, not you!) While they do not *say* their model is definitive, they unambiguously *act* like it is. For god’s sake, they’re publishing something they call “The World Happiness Report,” and they’ve been doing so for a decade, with exactly the same model. As a researcher I find this particularly galling, because it just feels disingenuous. The data I use has been available for every year of the report’s publication, and yet the 2023 report starts with a table of contents, a brief introduction — and a full-page, full-color, adorable heart-filled cartoon that says GDP is a central factor in satisfaction! Listed, measured, and implied, to be first. Whereas my measurably more accurate model finds that the real predictive power of GNI (not even GDP!) says it explains 2.5% of model variation, and is the *eleventh* most important factor in the model. Sure they have caveats but they’re very much in the fine print, and they don’t make any difference anyway if there’s no alternative provided.
Next: I agree there is an important overlap, both conceptual and literal, in the variables we use. In particular, for the variables on Feelings of Freedom, Healthy Life Expectancy, and Social Support, I use the WHR data itself (in part because FoF and Support are only *available* from them). And you are right that several of the variables could be considered as pieces of other variables. I try to highlight this in the table towards the end of the paper because I think it’s important — and also to show how many even high-level categories the WHR is missing.
But I am adamant that the real value of models like this is in their ability to *improve* satisfaction, not just describe it — and actionability requires more precision than “Health.” I deeply believe it’s only in the nuances that we can actually see the outlines of a solution. It can also dramatically change the interpretation of the very large but very vague variables. For example, “Social Support” on its own sounds like the moral is “just be nice to people.” However when you realize that the model also has huge contributions from Gay and Lesbian Social Acceptance, and Political Power for Women, the moral becomes “be nice to people … including minorities, not just your own in-group … and actually don’t just be nice to them, but meaningfully share real power.” This has a dramatically different narrative, different policies, and different interventions.
Re: causality specifically I address that much more detail in the paper, but since you bring up noisy data I want to emphasize that I put a *lot* of work into making sure that the chosen variables are not noise. I ran dozens of iterations (well, thousands in total!) that randomly omitted rows, and randomly omitted columns, to test the robustness of the results, and the variables I report are the one that are selected every. single. time. In fact, I have reason to believe that the breadth of the search and the strong filter for robustness makes me much *less* susceptible to spurious variables than the WHR. In these thousands of tests, there are two out of six WHR variables that I estimate around thirtieth most important to prediction, and so don’t belong anywhere near the top six like they claim. That’s a third of their reported model!
Last, I’m very impressed that you got into the finance data! You’ve inspired me to take a closer look. I would still defend its inclusion like this: it showed up in every iteration of the robustness test, it’s strongly significant in the model, statistics is about aggregates not about North Korea, and I use 1,964 observations over a period of 18 years. (Plus, North Korea is definitely not included in the Gallup World Poll, so it’s moot for these conclusions anyway.) The variable is also completely consistent with the substance of the other VDem variables: people need political power (women’s political power), they need that political power to be meaningful not decorative (no shadow government), and they need a way to *achieve* that political power — hence public financing for elections. For all of these reasons I think it belongs in the model. It’s true that researchers like Nicholas Carnes emphasize that a naive implementation of funding is unlikely to help on its own—but I would argue that none of the discovered variables will be successful if implemented naively.
Still like above, I think the real contribution comes from the fact that financing provides not just an outcome, but a path — if also a reminder that each specific action can only have a small impact. You may say you want Women’s Political Power, okay, great — but how? Significant public financing for campaigns is a concrete resource, with a concrete action, and a concrete objective. In other words, there’s actually an extremely clear causal mechanism for public financing, that’s much better than many of the other variables. And that’s something that “Healthy Life Expectancy” just doesn’t provide.
And thank you again!
Strongly upvoted. I think someone independently verifying your work would go a long way re: credibility. Might you be up for sharing your code?
Thank you John! There’s a fairly substantial codebase, so it might take me a while to make sure it’s documented and accessible, but now that I know there’s interest I’ll make that a priority.
I’m really excited to see this work! I know they’re not your variables, but having identified some top N variables (and possibly others of interest), it would be really helpful if you could dig out a concrete explanation of how each one was generated. E.g. ‘social support’ seems like it could mean any number of things, possibly different things to different people (my girlfriend suggested it might commonly be interpreted as effectively ‘happiness’, and so not really be telling us anything). ‘Feelings of freedom’ could be similar - ‘shelter’ sounds a bit more robust, but it still seems really important to understand what the number being given is.
For social support I had a very shallow look as far as the Helliwell essay and gave up when I couldn’t find a definition of/survey question representing it in there. I don’t have the bandwidth to dig further, but would really like to understand this better.
Thank you! And, good point, certainly something I should at least put in an appendix. The WHR variables are explained in their statistical appendices, which I link and quote here:
“Social support (or having someone to count on in times of trouble) is the national average of the binary responses (either 0 or 1) to the GWP question ‘If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?’ ”
″ Freedom to make life choices is the national average of responses to the GWP question ‘Are you satisfied or dissatisfied with your freedom to choose what you do with your life?’ ”
https://happiness-report.s3.amazonaws.com/2023/WHR+23_Statistical_Appendix.pdf
Re: your girlfriend’s insightful comment, research has been done that demonstrates the question used for satisfaction is widely understood, and understand to be different than what we generally think of as “happiness.” In fact, trying to predict satisfaction with “positive affect” (a technical name for happiness) gets you an R^2 of only 0.27. By contrast, a model with “social support” alone gets you 0.52! So if they’re not even closely correlated, they’re definitely not being widely interpreted as the same thing.
I get into this more in the paper, but I argue that if social support really is so central, we really need to have a variety of variables on it, like we do for health and the economy. I use this one because it’s the one that’s there, and the effect is huge, but I’d really prefer to be able to draw on and analyze multiple aspects of relationships.
Thank you for the very kind response! The EA forum was an obvious choice, but if anyone has ideas about what organizations or outlets would be be interested in these findings, I would be very grateful for suggestions. So far responses have been enthusiastic, but often unsure of what the next step should be.
Amazing work! With only lay understanding of this subject, I am wondering: Do you think the low contribution of GDP in some way might compel us to place less weight on economic growth? I get a bit confused when thinking about this because freedom might in part derive from the freedom to not have to worry about how to pay for necessities, and shelter and clean water requires substantial tax revenue to be delivered reliably and at scale.
Thank you! To the first part of your comment, I certainly hope so—Nobel-winning economists Amartya Sen and Joseph Stiglitz edited an international report (supported by dozens more famous researchers) all the way back in 2012 saying that GDP was fundamentally deficient, if not broken, as a national guide. Many European countries have started collecting national satisfaction data as part of a way to fix this problem, but I don’t know how much it’s paid attention to, and I know the US doesn’t even collect this data to begin with.
I think the larger point you’re making is that there might be dependencies between the discovered variables, with which I absolutely agree. In the same way I think it’s dangerous to guess what the right variables are, I think it’s dangerous to guess at exactly what the dependencies are, but I do think it’s critical to understand these relations better. Still, we certainly can’t do that if we don’t even get the variables right, so I believe this is at least a first step in an important direction.