Thank you for doing this survey and analysis. I regret that the feedback from me was primarily critical, and that this reply will follow in a similar vein. But I don’t believe the data from this survey is interpretable in most cases, and I think that the main value of this work is as a cautionary example.
A biased analogy
Suppose you wanted to survey the population of Christians at Oxford: maybe you wanted to know their demographics, the mix of denominations, their beliefs on ‘hot button’ bioethical topics, and things like that.
Suppose you did it by going around the local churches and asking the priests to spread the word to their congregants. The local catholic church is very excited, and the priest promises to mention at the end of his sermon; you can’t get through to the Anglican vicar, but the secretary promises she’ll mention it in the next newsletter; the evangelical pastor politely declines.
You get the results, and you find that Christians in Oxford are overwhelmingly catholic, that they are primarily White and Hispanic, and tend conservative on most bioethical issues, and are particularly opposed to abortion and many forms of contraception.
Surveys and Sampling
Of course, you shouldn’t think that, because this sort of survey is shot through with sampling bias. You’d expect Catholics are far more likely to respond to the survey than evangelicals, so instead of getting a balanced picture of the ‘Christians in Oxford’ population, you get a picture of a ‘primarily Catholics in Oxford with some others’ – and predictably the ethnicity data and the bioethical beliefs are skewed.
I hope EA is non-denominational (or failing that, ecumenical), but there is a substructure to the EA population – folks who hang around LessWrong tend to be different from those who hang around Giving What We Can, for example. Further they likely differ in ways the survey is interested in: their gender, their giving, what causes they support, and so on. To survey of ‘The Effective Altruism Movement’, the EAs who cluster in both need to be represented proportionately (ditto all the other subgroups).
The original plan (as I understand) was to obviate the sampling concerns by just sampling the entire population. This was highly over-confident (when has a voluntary survey captured 90%+ of a target population?) and the consequences of its failure to become a de facto ‘EA census’ significant. The blanket advertising of the survey was taken up by some sources more than others: LessWrong put in on their main page, whilst Giving What We Can didn’t email it around – for example. Analogous to the Catholics and the Pentecostals, you would anticipate LWers to be significantly over-sampled versus folks in GWWC (or, indeed, versus many other groups, as I’d guess LW’s ‘reach’ to its membership via its main page is much better than many other groups). Consequently results like the proportion of EAs who care about AI/x-risk, where most EAs live, or what got them involved in EA you would predict to be slanted towards what LWers care about, where LWers live (bay area), or how LWers got involved in EA (LW!).
If the subgroups didn’t differ, we could breathe a sigh of relief. Alas, not so: the subgroups identified by URL significantly differ across a variety of demographic information, and their absolute size (often 10-20%) makes the difference practically as well as statistically significant – I’d guess if you compared ‘where you heard about EA’ against URL, you’d see an even bigger difference. It may understate the case – if one moved from 3 groups (LW, EA FB, contacts) to 2 (LW, non-LW), one may see more differences, and the missing variable issues and smaller subgroup size mean the point estimates for (e.g.) what proportion of LWers care about X-risk is not that reliable.
Convenience sampling is always dicey, as unlike probabilistic sampling any error in parameter estimate due to bias will not expectedly diminish as you increase the sample size. However, the sampling strategy in this case is particularly undesirable as the likely bias runs pretty much parallel to the things you are interested in: you might hope that (for example) the population of the EA facebook might not be too slanted in terms of cause selection compared to the ‘real’ EA population – not a group like GWWC, LW, CFAR, etc.
What makes it particularly problematic is that it is very hard estimate the ‘size’ of this bias: I wouldn’t be surprised if this survey only oversampled LWers by 5-10%, but I wouldn’t be that surprised if it oversampled LWers by a factor of 3 either. The problem is that any ‘surprise’ I get from the survey mostly goes to adjusting my expectation of how biased it is. Suppose I think ‘EA’ is 50% male and I expect the survey to overestimate the %age male by 15%. Suppose the survey said EA was 90% male. I am going to be much more uncertain about the degree of over-representation than I am about what I think the ‘true EA male fraction’ is. So the update will be to something like 52% male and the survey overestimating by 28%. To the extent I am not an ideal epistemic agent, feeding me difficult to interpret data might make my estimates worse, not better.
To find fault is easy; to plan well, difficult
Science rewards caution and planning; many problems found in analysis could only have been fixed in design, and post-hoc cleaning of data is seldom feasible and still seldomer easy.
Further planning could have made the results more interpretable. Survey design has a variety of jargon like “population definition”, “sampling frame”. More careful discussion of what the target population was and how they were going to be reached could have flagged the sampling bias worry sooner, likewise how likely a ‘saturation’ strategy was to succeed. As it was most of the discussion seemed to be focused on grabbing as many people as possible.
Similarly, ‘baking in’ the intended analysis plan with the survey itself would have helped to make sure the data could be analysed in the manner intended (my understanding – correct me if I’m wrong! – is that the planning of exactly what analysis would be done happened after the survey was in the wild). In view of the sampling worries, the analysis was planned to avoid giving aggregate measures sensitive to sampling bias, but instead explore relationships between groups via regression (e.g. what factors predict amount given to charity). However, my understanding is this pre-registered plan had to be abandoned as the data was not amenable. Losing the pre-registered plan for a new one which shares no common elements is regrettable (especially as the new results are very vulnerable to sampling bias), and a bit of a red flag.
On getting better data, and on using data better
Given the above, I think the survey offers extremely unreliable data. I’m not sure I agree with the authors it is ‘better than nothing’, or better than our intuitions—given most of us are imperfect cognizers, it might lead us more astray to the ‘true nature’ of the EA community. I am pretty confident it is not worth the collective time and energy it has taken: it probably took a couple of hundred hours of the EA community’s time to fill in the surveys, leave alone the significant work from the team in terms of design, analysis, etc.
Although some things could not have been helped, I think many things could have, and there were better approaches ex ante:
1) It is always hard to calibrate one’s lack of knowledge about something. But googling things like ‘survey design’, ‘sampling’, and similar are fruitful – if nothing else, they suggest that ‘doing a survey’ is not always straightforward and easy, and put one on guard for hidden pitfalls. This sort of screening should be particularly encouraged if one isn’t a domain expert: many things in medicine concord with common sense, but some things do not, likewise statistics and analysis, and no doubt likewise many other matters I know even less about.
2) Clever and sensible the EA community generally is, it may not always be sufficient to ask for feedback on a survey idea and then interpreting the lack of response as a tacit green light. Sometimes ‘We need expertise and will not start until we have engaged some’, although more cautious, is also more better. I’d anticipate this concern will grow in significance as EAs tackle things ‘further afield’ from their background and training.
3) You did get a relative domain expert raise the sampling concerns to you within a few hours of going live. Laudable though it was that you were responsive to this criticism and (for example) tracked URL data to get a better handle on sampling concerns, invited your critics to review prior drafts and analysis, and mention the methodological concerns prominently, it took a little too long to get there. There also seemed a fair about of over-confidence and defensiveness – not only from some members of the survey team, but from others who thought that, although they hadn’t considered X before and didn’t know a huge amount about X, that on the basis of summary reflection X wasn’t such a big deal. Calling a pause very early may have been feasible, and may have salvaged the survey from the problems above.
This all comes across as disheartening. I was disheartened too: effective altruism intends to put a strong emphasis on being quantitative, getting robust data, and so forth. Yet when we try to practice what we preach, our efforts leave much to be desired (this survey is not the only – or the worst – example). In the same way good outcomes are not guaranteed by good intentions, good information is not guaranteed by good will and hard work. In some ways we are trailblazers in looking hard at the first problem, but for the second we have the benefit of the bitter experience of the scientists and statisticians who have gone before us. Let us avoid recapitulating their mistakes.
Thanks for sharing such detailed thoughts on this Greg. It is so useful to have people with significant domain expertise in the community who take the time to carefully explain their concerns.
Why isn’t the survey at least useful count data? It allows me to considerably sharpen my lower bounds on things like total donations and the number of Less Wrong EAs.
I think count data is the much more useful kind to take away even ignoring sampling bias issues, because the data in the survey is over a year old, i.e. Even if it were a representative snapshot of EA in early 2014, that snapshot would be of limited use. Whereas most counts can safely be assumed to be going up.
I agree the survey can provide useful count data along lines of providing lower bounds. With a couple of exceptions though, I didn’t find the sort of lower bounds the survey gives hugely surprising or informative—if others found them much moreso, great!
Once could compare between clusters (or, indeed, see where there are clusters), and these sorts of analyses would be more robust to sampling problems: even if LWers are oversampled compared to animal rights people, one can still see how they differ. Similar things like factor analysis, PCA etc. etc. could be useful to see whether certain things trend together, especially for when folks could pick multiple options.
Given that a regression-style analysis was abandoned, I assume actually performing this sort of work on the data is much easier said than done. If I ever get some spare time I might look at it myself, but I have quite a lot of other things to do...
What makes it particularly problematic is that it is very hard estimate the ‘size’ of this bias
One approach would be to identify a representative sample of the EA population and circulate among folks in that sample a short survey with a few questions randomly sampled from the original survey. By measuring response discrepancies between surveys (beyond what one would expect if both surveys were representative), one could estimate the size of the sampling bias in the original survey.
ETA: I now see that a proposal along these lines is discussed in the subsection ‘Comparison of the EA Facebook Group to a Random Sample’ of the Appendix. In a follow-up study, the authors of the survey randomly sampled members of the EA Facebook group and compared their responses to those of members of that group in the original survey. However, if one regards the EA Facebook group as a representative sample of the EA population (which seems reasonable to me), one could also compare the responses in the follow-up survey to all responses in the original survey. Although the authors of the survey don’t make this comparison, it could be made easily using the data already collected (though given the small sample size, practically significant differences may not turn out to be statistically significant).
I think it’s right to say that the survey was premised on the idea that there is no way to know the true nature of the EA population and no known-to-be-representative sampling frame. If there were such a sampling frame or a known-to-be-representative population, we’d definitely have used that. Beforehand, and a little less so now, I would have strongly expected the EA Facebook group to not be representative. For that reason I think randomly sampling the EA FB group is largely uninformative- and I think that this is now Greg’s view too, though I could be wrong.
I agree that could work, although doing it is not straightforward—for technical reasons, there aren’t many instances where you get added precision by doing a convenience survey ‘on top’ of a random sample, although they do exist.
(Unfortunately, random FB sample was small, with something like 80% non-response, thus making it not very helpful to sample sampling deviation from the ‘true’ population. In some sense the subgroup comparisons do provide some of this information by pointing to different sub-populations—what they cannot provide is a measure as to whether these subgroups are being represented proportionally or not. A priori though, that would seem pretty unlikely.)
As David notes, the ‘EA FB group’ is highly unlikely to be a representative sample. But I think it is more plausibly representative along axes we’d be likely to be interested in the survey. I’d guess EAs who are into animal rights are not hugely more likely to be in facebook in contrast to those who are into global poverty, for example (could there be some effects? absolutely—I’d guess FB audience skews young and computer savvy, so maybe folks interested in AI etc. might be more likely to be found there, etc. etc.)
The problem with going to each ‘cluster’ of EAs is that you are effectively sampling parallel rather than orthogonal to your substructure: if you over-sample the young and computer literate, that may not throw off the relative proportions of who lives where or who cares more about poverty than the far future; you’d be much more fearful of this if you oversample a particular EA subculture like LW.
I’d be more inclined to ‘trust’ the proportion data (%age male, %xrisk, %etc) if the survey was ‘just’ of the EA facebook group, either probabilistically or convenience sampled. Naturally, still very far from perfect, and not for all areas (age, for example). (Unfortunately, you cannot just filter the survey and just look at those who clicked through via the FB link to construct this data—there’s plausibly lots of people who clicked through via LW but would have clicked through via FB if there was no LW link, so ignoring all these responses likely inverts anticipated bias).
Thank you for doing this survey and analysis. I regret that the feedback from me was primarily critical, and that this reply will follow in a similar vein. But I don’t believe the data from this survey is interpretable in most cases, and I think that the main value of this work is as a cautionary example.
A biased analogy
Suppose you wanted to survey the population of Christians at Oxford: maybe you wanted to know their demographics, the mix of denominations, their beliefs on ‘hot button’ bioethical topics, and things like that.
Suppose you did it by going around the local churches and asking the priests to spread the word to their congregants. The local catholic church is very excited, and the priest promises to mention at the end of his sermon; you can’t get through to the Anglican vicar, but the secretary promises she’ll mention it in the next newsletter; the evangelical pastor politely declines.
You get the results, and you find that Christians in Oxford are overwhelmingly catholic, that they are primarily White and Hispanic, and tend conservative on most bioethical issues, and are particularly opposed to abortion and many forms of contraception.
Surveys and Sampling
Of course, you shouldn’t think that, because this sort of survey is shot through with sampling bias. You’d expect Catholics are far more likely to respond to the survey than evangelicals, so instead of getting a balanced picture of the ‘Christians in Oxford’ population, you get a picture of a ‘primarily Catholics in Oxford with some others’ – and predictably the ethnicity data and the bioethical beliefs are skewed.
I hope EA is non-denominational (or failing that, ecumenical), but there is a substructure to the EA population – folks who hang around LessWrong tend to be different from those who hang around Giving What We Can, for example. Further they likely differ in ways the survey is interested in: their gender, their giving, what causes they support, and so on. To survey of ‘The Effective Altruism Movement’, the EAs who cluster in both need to be represented proportionately (ditto all the other subgroups).
The original plan (as I understand) was to obviate the sampling concerns by just sampling the entire population. This was highly over-confident (when has a voluntary survey captured 90%+ of a target population?) and the consequences of its failure to become a de facto ‘EA census’ significant. The blanket advertising of the survey was taken up by some sources more than others: LessWrong put in on their main page, whilst Giving What We Can didn’t email it around – for example. Analogous to the Catholics and the Pentecostals, you would anticipate LWers to be significantly over-sampled versus folks in GWWC (or, indeed, versus many other groups, as I’d guess LW’s ‘reach’ to its membership via its main page is much better than many other groups). Consequently results like the proportion of EAs who care about AI/x-risk, where most EAs live, or what got them involved in EA you would predict to be slanted towards what LWers care about, where LWers live (bay area), or how LWers got involved in EA (LW!).
If the subgroups didn’t differ, we could breathe a sigh of relief. Alas, not so: the subgroups identified by URL significantly differ across a variety of demographic information, and their absolute size (often 10-20%) makes the difference practically as well as statistically significant – I’d guess if you compared ‘where you heard about EA’ against URL, you’d see an even bigger difference. It may understate the case – if one moved from 3 groups (LW, EA FB, contacts) to 2 (LW, non-LW), one may see more differences, and the missing variable issues and smaller subgroup size mean the point estimates for (e.g.) what proportion of LWers care about X-risk is not that reliable.
Convenience sampling is always dicey, as unlike probabilistic sampling any error in parameter estimate due to bias will not expectedly diminish as you increase the sample size. However, the sampling strategy in this case is particularly undesirable as the likely bias runs pretty much parallel to the things you are interested in: you might hope that (for example) the population of the EA facebook might not be too slanted in terms of cause selection compared to the ‘real’ EA population – not a group like GWWC, LW, CFAR, etc.
What makes it particularly problematic is that it is very hard estimate the ‘size’ of this bias: I wouldn’t be surprised if this survey only oversampled LWers by 5-10%, but I wouldn’t be that surprised if it oversampled LWers by a factor of 3 either. The problem is that any ‘surprise’ I get from the survey mostly goes to adjusting my expectation of how biased it is. Suppose I think ‘EA’ is 50% male and I expect the survey to overestimate the %age male by 15%. Suppose the survey said EA was 90% male. I am going to be much more uncertain about the degree of over-representation than I am about what I think the ‘true EA male fraction’ is. So the update will be to something like 52% male and the survey overestimating by 28%. To the extent I am not an ideal epistemic agent, feeding me difficult to interpret data might make my estimates worse, not better.
To find fault is easy; to plan well, difficult
Science rewards caution and planning; many problems found in analysis could only have been fixed in design, and post-hoc cleaning of data is seldom feasible and still seldomer easy. Further planning could have made the results more interpretable. Survey design has a variety of jargon like “population definition”, “sampling frame”. More careful discussion of what the target population was and how they were going to be reached could have flagged the sampling bias worry sooner, likewise how likely a ‘saturation’ strategy was to succeed. As it was most of the discussion seemed to be focused on grabbing as many people as possible.
Similarly, ‘baking in’ the intended analysis plan with the survey itself would have helped to make sure the data could be analysed in the manner intended (my understanding – correct me if I’m wrong! – is that the planning of exactly what analysis would be done happened after the survey was in the wild). In view of the sampling worries, the analysis was planned to avoid giving aggregate measures sensitive to sampling bias, but instead explore relationships between groups via regression (e.g. what factors predict amount given to charity). However, my understanding is this pre-registered plan had to be abandoned as the data was not amenable. Losing the pre-registered plan for a new one which shares no common elements is regrettable (especially as the new results are very vulnerable to sampling bias), and a bit of a red flag.
On getting better data, and on using data better
Given the above, I think the survey offers extremely unreliable data. I’m not sure I agree with the authors it is ‘better than nothing’, or better than our intuitions—given most of us are imperfect cognizers, it might lead us more astray to the ‘true nature’ of the EA community. I am pretty confident it is not worth the collective time and energy it has taken: it probably took a couple of hundred hours of the EA community’s time to fill in the surveys, leave alone the significant work from the team in terms of design, analysis, etc.
Although some things could not have been helped, I think many things could have, and there were better approaches ex ante:
1) It is always hard to calibrate one’s lack of knowledge about something. But googling things like ‘survey design’, ‘sampling’, and similar are fruitful – if nothing else, they suggest that ‘doing a survey’ is not always straightforward and easy, and put one on guard for hidden pitfalls. This sort of screening should be particularly encouraged if one isn’t a domain expert: many things in medicine concord with common sense, but some things do not, likewise statistics and analysis, and no doubt likewise many other matters I know even less about.
2) Clever and sensible the EA community generally is, it may not always be sufficient to ask for feedback on a survey idea and then interpreting the lack of response as a tacit green light. Sometimes ‘We need expertise and will not start until we have engaged some’, although more cautious, is also more better. I’d anticipate this concern will grow in significance as EAs tackle things ‘further afield’ from their background and training.
3) You did get a relative domain expert raise the sampling concerns to you within a few hours of going live. Laudable though it was that you were responsive to this criticism and (for example) tracked URL data to get a better handle on sampling concerns, invited your critics to review prior drafts and analysis, and mention the methodological concerns prominently, it took a little too long to get there. There also seemed a fair about of over-confidence and defensiveness – not only from some members of the survey team, but from others who thought that, although they hadn’t considered X before and didn’t know a huge amount about X, that on the basis of summary reflection X wasn’t such a big deal. Calling a pause very early may have been feasible, and may have salvaged the survey from the problems above.
This all comes across as disheartening. I was disheartened too: effective altruism intends to put a strong emphasis on being quantitative, getting robust data, and so forth. Yet when we try to practice what we preach, our efforts leave much to be desired (this survey is not the only – or the worst – example). In the same way good outcomes are not guaranteed by good intentions, good information is not guaranteed by good will and hard work. In some ways we are trailblazers in looking hard at the first problem, but for the second we have the benefit of the bitter experience of the scientists and statisticians who have gone before us. Let us avoid recapitulating their mistakes.
Thanks for sharing such detailed thoughts on this Greg. It is so useful to have people with significant domain expertise in the community who take the time to carefully explain their concerns.
It’s worth noting there was also significant domain expertise on the survey team.
Why isn’t the survey at least useful count data? It allows me to considerably sharpen my lower bounds on things like total donations and the number of Less Wrong EAs.
I think count data is the much more useful kind to take away even ignoring sampling bias issues, because the data in the survey is over a year old, i.e. Even if it were a representative snapshot of EA in early 2014, that snapshot would be of limited use. Whereas most counts can safely be assumed to be going up.
I agree the survey can provide useful count data along lines of providing lower bounds. With a couple of exceptions though, I didn’t find the sort of lower bounds the survey gives hugely surprising or informative—if others found them much moreso, great!
Very thoughtful post.
Are there any types of analysis you think could be usefully performed on the data?
Once could compare between clusters (or, indeed, see where there are clusters), and these sorts of analyses would be more robust to sampling problems: even if LWers are oversampled compared to animal rights people, one can still see how they differ. Similar things like factor analysis, PCA etc. etc. could be useful to see whether certain things trend together, especially for when folks could pick multiple options.
Given that a regression-style analysis was abandoned, I assume actually performing this sort of work on the data is much easier said than done. If I ever get some spare time I might look at it myself, but I have quite a lot of other things to do...
One approach would be to identify a representative sample of the EA population and circulate among folks in that sample a short survey with a few questions randomly sampled from the original survey. By measuring response discrepancies between surveys (beyond what one would expect if both surveys were representative), one could estimate the size of the sampling bias in the original survey.
ETA: I now see that a proposal along these lines is discussed in the subsection ‘Comparison of the EA Facebook Group to a Random Sample’ of the Appendix. In a follow-up study, the authors of the survey randomly sampled members of the EA Facebook group and compared their responses to those of members of that group in the original survey. However, if one regards the EA Facebook group as a representative sample of the EA population (which seems reasonable to me), one could also compare the responses in the follow-up survey to all responses in the original survey. Although the authors of the survey don’t make this comparison, it could be made easily using the data already collected (though given the small sample size, practically significant differences may not turn out to be statistically significant).
I think it’s right to say that the survey was premised on the idea that there is no way to know the true nature of the EA population and no known-to-be-representative sampling frame. If there were such a sampling frame or a known-to-be-representative population, we’d definitely have used that. Beforehand, and a little less so now, I would have strongly expected the EA Facebook group to not be representative. For that reason I think randomly sampling the EA FB group is largely uninformative- and I think that this is now Greg’s view too, though I could be wrong.
I agree that could work, although doing it is not straightforward—for technical reasons, there aren’t many instances where you get added precision by doing a convenience survey ‘on top’ of a random sample, although they do exist.
(Unfortunately, random FB sample was small, with something like 80% non-response, thus making it not very helpful to sample sampling deviation from the ‘true’ population. In some sense the subgroup comparisons do provide some of this information by pointing to different sub-populations—what they cannot provide is a measure as to whether these subgroups are being represented proportionally or not. A priori though, that would seem pretty unlikely.)
As David notes, the ‘EA FB group’ is highly unlikely to be a representative sample. But I think it is more plausibly representative along axes we’d be likely to be interested in the survey. I’d guess EAs who are into animal rights are not hugely more likely to be in facebook in contrast to those who are into global poverty, for example (could there be some effects? absolutely—I’d guess FB audience skews young and computer savvy, so maybe folks interested in AI etc. might be more likely to be found there, etc. etc.)
The problem with going to each ‘cluster’ of EAs is that you are effectively sampling parallel rather than orthogonal to your substructure: if you over-sample the young and computer literate, that may not throw off the relative proportions of who lives where or who cares more about poverty than the far future; you’d be much more fearful of this if you oversample a particular EA subculture like LW.
I’d be more inclined to ‘trust’ the proportion data (%age male, %xrisk, %etc) if the survey was ‘just’ of the EA facebook group, either probabilistically or convenience sampled. Naturally, still very far from perfect, and not for all areas (age, for example). (Unfortunately, you cannot just filter the survey and just look at those who clicked through via the FB link to construct this data—there’s plausibly lots of people who clicked through via LW but would have clicked through via FB if there was no LW link, so ignoring all these responses likely inverts anticipated bias).