Thank you for doing this survey and analysis. I regret that the feedback from me was primarily critical, and that this reply will follow in a similar vein. But I donât believe the data from this survey is interpretable in most cases, and I think that the main value of this work is as a cautionary example.
A biased analogy
Suppose you wanted to survey the population of Christians at Oxford: maybe you wanted to know their demographics, the mix of denominations, their beliefs on âhot buttonâ bioethical topics, and things like that.
Suppose you did it by going around the local churches and asking the priests to spread the word to their congregants. The local catholic church is very excited, and the priest promises to mention at the end of his sermon; you canât get through to the Anglican vicar, but the secretary promises sheâll mention it in the next newsletter; the evangelical pastor politely declines.
You get the results, and you find that Christians in Oxford are overwhelmingly catholic, that they are primarily White and Hispanic, and tend conservative on most bioethical issues, and are particularly opposed to abortion and many forms of contraception.
Surveys and Sampling
Of course, you shouldnât think that, because this sort of survey is shot through with sampling bias. Youâd expect Catholics are far more likely to respond to the survey than evangelicals, so instead of getting a balanced picture of the âChristians in Oxfordâ population, you get a picture of a âprimarily Catholics in Oxford with some othersâ â and predictably the ethnicity data and the bioethical beliefs are skewed.
I hope EA is non-denominational (or failing that, ecumenical), but there is a substructure to the EA population â folks who hang around LessWrong tend to be different from those who hang around Giving What We Can, for example. Further they likely differ in ways the survey is interested in: their gender, their giving, what causes they support, and so on. To survey of âThe Effective Altruism Movementâ, the EAs who cluster in both need to be represented proportionately (ditto all the other subgroups).
The original plan (as I understand) was to obviate the sampling concerns by just sampling the entire population. This was highly over-confident (when has a voluntary survey captured 90%+ of a target population?) and the consequences of its failure to become a de facto âEA censusâ significant. The blanket advertising of the survey was taken up by some sources more than others: LessWrong put in on their main page, whilst Giving What We Can didnât email it around â for example. Analogous to the Catholics and the Pentecostals, you would anticipate LWers to be significantly over-sampled versus folks in GWWC (or, indeed, versus many other groups, as Iâd guess LWâs âreachâ to its membership via its main page is much better than many other groups). Consequently results like the proportion of EAs who care about AI/âx-risk, where most EAs live, or what got them involved in EA you would predict to be slanted towards what LWers care about, where LWers live (bay area), or how LWers got involved in EA (LW!).
If the subgroups didnât differ, we could breathe a sigh of relief. Alas, not so: the subgroups identified by URL significantly differ across a variety of demographic information, and their absolute size (often 10-20%) makes the difference practically as well as statistically significant â Iâd guess if you compared âwhere you heard about EAâ against URL, youâd see an even bigger difference. It may understate the case â if one moved from 3 groups (LW, EA FB, contacts) to 2 (LW, non-LW), one may see more differences, and the missing variable issues and smaller subgroup size mean the point estimates for (e.g.) what proportion of LWers care about X-risk is not that reliable.
Convenience sampling is always dicey, as unlike probabilistic sampling any error in parameter estimate due to bias will not expectedly diminish as you increase the sample size. However, the sampling strategy in this case is particularly undesirable as the likely bias runs pretty much parallel to the things you are interested in: you might hope that (for example) the population of the EA facebook might not be too slanted in terms of cause selection compared to the ârealâ EA population â not a group like GWWC, LW, CFAR, etc.
What makes it particularly problematic is that it is very hard estimate the âsizeâ of this bias: I wouldnât be surprised if this survey only oversampled LWers by 5-10%, but I wouldnât be that surprised if it oversampled LWers by a factor of 3 either. The problem is that any âsurpriseâ I get from the survey mostly goes to adjusting my expectation of how biased it is. Suppose I think âEAâ is 50% male and I expect the survey to overestimate the %age male by 15%. Suppose the survey said EA was 90% male. I am going to be much more uncertain about the degree of over-representation than I am about what I think the âtrue EA male fractionâ is. So the update will be to something like 52% male and the survey overestimating by 28%. To the extent I am not an ideal epistemic agent, feeding me difficult to interpret data might make my estimates worse, not better.
To find fault is easy; to plan well, difficult
Science rewards caution and planning; many problems found in analysis could only have been fixed in design, and post-hoc cleaning of data is seldom feasible and still seldomer easy.
Further planning could have made the results more interpretable. Survey design has a variety of jargon like âpopulation definitionâ, âsampling frameâ. More careful discussion of what the target population was and how they were going to be reached could have flagged the sampling bias worry sooner, likewise how likely a âsaturationâ strategy was to succeed. As it was most of the discussion seemed to be focused on grabbing as many people as possible.
Similarly, âbaking inâ the intended analysis plan with the survey itself would have helped to make sure the data could be analysed in the manner intended (my understanding â correct me if Iâm wrong! â is that the planning of exactly what analysis would be done happened after the survey was in the wild). In view of the sampling worries, the analysis was planned to avoid giving aggregate measures sensitive to sampling bias, but instead explore relationships between groups via regression (e.g. what factors predict amount given to charity). However, my understanding is this pre-registered plan had to be abandoned as the data was not amenable. Losing the pre-registered plan for a new one which shares no common elements is regrettable (especially as the new results are very vulnerable to sampling bias), and a bit of a red flag.
On getting better data, and on using data better
Given the above, I think the survey offers extremely unreliable data. Iâm not sure I agree with the authors it is âbetter than nothingâ, or better than our intuitionsâgiven most of us are imperfect cognizers, it might lead us more astray to the âtrue natureâ of the EA community. I am pretty confident it is not worth the collective time and energy it has taken: it probably took a couple of hundred hours of the EA communityâs time to fill in the surveys, leave alone the significant work from the team in terms of design, analysis, etc.
Although some things could not have been helped, I think many things could have, and there were better approaches ex ante:
1) It is always hard to calibrate oneâs lack of knowledge about something. But googling things like âsurvey designâ, âsamplingâ, and similar are fruitful â if nothing else, they suggest that âdoing a surveyâ is not always straightforward and easy, and put one on guard for hidden pitfalls. This sort of screening should be particularly encouraged if one isnât a domain expert: many things in medicine concord with common sense, but some things do not, likewise statistics and analysis, and no doubt likewise many other matters I know even less about.
2) Clever and sensible the EA community generally is, it may not always be sufficient to ask for feedback on a survey idea and then interpreting the lack of response as a tacit green light. Sometimes âWe need expertise and will not start until we have engaged someâ, although more cautious, is also more better. Iâd anticipate this concern will grow in significance as EAs tackle things âfurther afieldâ from their background and training.
3) You did get a relative domain expert raise the sampling concerns to you within a few hours of going live. Laudable though it was that you were responsive to this criticism and (for example) tracked URL data to get a better handle on sampling concerns, invited your critics to review prior drafts and analysis, and mention the methodological concerns prominently, it took a little too long to get there. There also seemed a fair about of over-confidence and defensiveness â not only from some members of the survey team, but from others who thought that, although they hadnât considered X before and didnât know a huge amount about X, that on the basis of summary reflection X wasnât such a big deal. Calling a pause very early may have been feasible, and may have salvaged the survey from the problems above.
This all comes across as disheartening. I was disheartened too: effective altruism intends to put a strong emphasis on being quantitative, getting robust data, and so forth. Yet when we try to practice what we preach, our efforts leave much to be desired (this survey is not the only â or the worst â example). In the same way good outcomes are not guaranteed by good intentions, good information is not guaranteed by good will and hard work. In some ways we are trailblazers in looking hard at the first problem, but for the second we have the benefit of the bitter experience of the scientists and statisticians who have gone before us. Let us avoid recapitulating their mistakes.
Thanks for sharing such detailed thoughts on this Greg. It is so useful to have people with significant domain expertise in the community who take the time to carefully explain their concerns.
Why isnât the survey at least useful count data? It allows me to considerably sharpen my lower bounds on things like total donations and the number of Less Wrong EAs.
I think count data is the much more useful kind to take away even ignoring sampling bias issues, because the data in the survey is over a year old, i.e. Even if it were a representative snapshot of EA in early 2014, that snapshot would be of limited use. Whereas most counts can safely be assumed to be going up.
I agree the survey can provide useful count data along lines of providing lower bounds. With a couple of exceptions though, I didnât find the sort of lower bounds the survey gives hugely surprising or informativeâif others found them much moreso, great!
Once could compare between clusters (or, indeed, see where there are clusters), and these sorts of analyses would be more robust to sampling problems: even if LWers are oversampled compared to animal rights people, one can still see how they differ. Similar things like factor analysis, PCA etc. etc. could be useful to see whether certain things trend together, especially for when folks could pick multiple options.
Given that a regression-style analysis was abandoned, I assume actually performing this sort of work on the data is much easier said than done. If I ever get some spare time I might look at it myself, but I have quite a lot of other things to do...
What makes it particularly problematic is that it is very hard estimate the âsizeâ of this bias
One approach would be to identify a representative sample of the EA population and circulate among folks in that sample a short survey with a few questions randomly sampled from the original survey. By measuring response discrepancies between surveys (beyond what one would expect if both surveys were representative), one could estimate the size of the sampling bias in the original survey.
ETA: I now see that a proposal along these lines is discussed in the subsection âComparison of the EA Facebook Group to a Random Sampleâ of the Appendix. In a follow-up study, the authors of the survey randomly sampled members of the EA Facebook group and compared their responses to those of members of that group in the original survey. However, if one regards the EA Facebook group as a representative sample of the EA population (which seems reasonable to me), one could also compare the responses in the follow-up survey to all responses in the original survey. Although the authors of the survey donât make this comparison, it could be made easily using the data already collected (though given the small sample size, practically significant differences may not turn out to be statistically significant).
I think itâs right to say that the survey was premised on the idea that there is no way to know the true nature of the EA population and no known-to-be-representative sampling frame. If there were such a sampling frame or a known-to-be-representative population, weâd definitely have used that. Beforehand, and a little less so now, I would have strongly expected the EA Facebook group to not be representative. For that reason I think randomly sampling the EA FB group is largely uninformative- and I think that this is now Gregâs view too, though I could be wrong.
I agree that could work, although doing it is not straightforwardâfor technical reasons, there arenât many instances where you get added precision by doing a convenience survey âon topâ of a random sample, although they do exist.
(Unfortunately, random FB sample was small, with something like 80% non-response, thus making it not very helpful to sample sampling deviation from the âtrueâ population. In some sense the subgroup comparisons do provide some of this information by pointing to different sub-populationsâwhat they cannot provide is a measure as to whether these subgroups are being represented proportionally or not. A priori though, that would seem pretty unlikely.)
As David notes, the âEA FB groupâ is highly unlikely to be a representative sample. But I think it is more plausibly representative along axes weâd be likely to be interested in the survey. Iâd guess EAs who are into animal rights are not hugely more likely to be in facebook in contrast to those who are into global poverty, for example (could there be some effects? absolutelyâIâd guess FB audience skews young and computer savvy, so maybe folks interested in AI etc. might be more likely to be found there, etc. etc.)
The problem with going to each âclusterâ of EAs is that you are effectively sampling parallel rather than orthogonal to your substructure: if you over-sample the young and computer literate, that may not throw off the relative proportions of who lives where or who cares more about poverty than the far future; youâd be much more fearful of this if you oversample a particular EA subculture like LW.
Iâd be more inclined to âtrustâ the proportion data (%age male, %xrisk, %etc) if the survey was âjustâ of the EA facebook group, either probabilistically or convenience sampled. Naturally, still very far from perfect, and not for all areas (age, for example). (Unfortunately, you cannot just filter the survey and just look at those who clicked through via the FB link to construct this dataâthereâs plausibly lots of people who clicked through via LW but would have clicked through via FB if there was no LW link, so ignoring all these responses likely inverts anticipated bias).
Thank you for doing this survey and analysis. I regret that the feedback from me was primarily critical, and that this reply will follow in a similar vein. But I donât believe the data from this survey is interpretable in most cases, and I think that the main value of this work is as a cautionary example.
A biased analogy
Suppose you wanted to survey the population of Christians at Oxford: maybe you wanted to know their demographics, the mix of denominations, their beliefs on âhot buttonâ bioethical topics, and things like that.
Suppose you did it by going around the local churches and asking the priests to spread the word to their congregants. The local catholic church is very excited, and the priest promises to mention at the end of his sermon; you canât get through to the Anglican vicar, but the secretary promises sheâll mention it in the next newsletter; the evangelical pastor politely declines.
You get the results, and you find that Christians in Oxford are overwhelmingly catholic, that they are primarily White and Hispanic, and tend conservative on most bioethical issues, and are particularly opposed to abortion and many forms of contraception.
Surveys and Sampling
Of course, you shouldnât think that, because this sort of survey is shot through with sampling bias. Youâd expect Catholics are far more likely to respond to the survey than evangelicals, so instead of getting a balanced picture of the âChristians in Oxfordâ population, you get a picture of a âprimarily Catholics in Oxford with some othersâ â and predictably the ethnicity data and the bioethical beliefs are skewed.
I hope EA is non-denominational (or failing that, ecumenical), but there is a substructure to the EA population â folks who hang around LessWrong tend to be different from those who hang around Giving What We Can, for example. Further they likely differ in ways the survey is interested in: their gender, their giving, what causes they support, and so on. To survey of âThe Effective Altruism Movementâ, the EAs who cluster in both need to be represented proportionately (ditto all the other subgroups).
The original plan (as I understand) was to obviate the sampling concerns by just sampling the entire population. This was highly over-confident (when has a voluntary survey captured 90%+ of a target population?) and the consequences of its failure to become a de facto âEA censusâ significant. The blanket advertising of the survey was taken up by some sources more than others: LessWrong put in on their main page, whilst Giving What We Can didnât email it around â for example. Analogous to the Catholics and the Pentecostals, you would anticipate LWers to be significantly over-sampled versus folks in GWWC (or, indeed, versus many other groups, as Iâd guess LWâs âreachâ to its membership via its main page is much better than many other groups). Consequently results like the proportion of EAs who care about AI/âx-risk, where most EAs live, or what got them involved in EA you would predict to be slanted towards what LWers care about, where LWers live (bay area), or how LWers got involved in EA (LW!).
If the subgroups didnât differ, we could breathe a sigh of relief. Alas, not so: the subgroups identified by URL significantly differ across a variety of demographic information, and their absolute size (often 10-20%) makes the difference practically as well as statistically significant â Iâd guess if you compared âwhere you heard about EAâ against URL, youâd see an even bigger difference. It may understate the case â if one moved from 3 groups (LW, EA FB, contacts) to 2 (LW, non-LW), one may see more differences, and the missing variable issues and smaller subgroup size mean the point estimates for (e.g.) what proportion of LWers care about X-risk is not that reliable.
Convenience sampling is always dicey, as unlike probabilistic sampling any error in parameter estimate due to bias will not expectedly diminish as you increase the sample size. However, the sampling strategy in this case is particularly undesirable as the likely bias runs pretty much parallel to the things you are interested in: you might hope that (for example) the population of the EA facebook might not be too slanted in terms of cause selection compared to the ârealâ EA population â not a group like GWWC, LW, CFAR, etc.
What makes it particularly problematic is that it is very hard estimate the âsizeâ of this bias: I wouldnât be surprised if this survey only oversampled LWers by 5-10%, but I wouldnât be that surprised if it oversampled LWers by a factor of 3 either. The problem is that any âsurpriseâ I get from the survey mostly goes to adjusting my expectation of how biased it is. Suppose I think âEAâ is 50% male and I expect the survey to overestimate the %age male by 15%. Suppose the survey said EA was 90% male. I am going to be much more uncertain about the degree of over-representation than I am about what I think the âtrue EA male fractionâ is. So the update will be to something like 52% male and the survey overestimating by 28%. To the extent I am not an ideal epistemic agent, feeding me difficult to interpret data might make my estimates worse, not better.
To find fault is easy; to plan well, difficult
Science rewards caution and planning; many problems found in analysis could only have been fixed in design, and post-hoc cleaning of data is seldom feasible and still seldomer easy. Further planning could have made the results more interpretable. Survey design has a variety of jargon like âpopulation definitionâ, âsampling frameâ. More careful discussion of what the target population was and how they were going to be reached could have flagged the sampling bias worry sooner, likewise how likely a âsaturationâ strategy was to succeed. As it was most of the discussion seemed to be focused on grabbing as many people as possible.
Similarly, âbaking inâ the intended analysis plan with the survey itself would have helped to make sure the data could be analysed in the manner intended (my understanding â correct me if Iâm wrong! â is that the planning of exactly what analysis would be done happened after the survey was in the wild). In view of the sampling worries, the analysis was planned to avoid giving aggregate measures sensitive to sampling bias, but instead explore relationships between groups via regression (e.g. what factors predict amount given to charity). However, my understanding is this pre-registered plan had to be abandoned as the data was not amenable. Losing the pre-registered plan for a new one which shares no common elements is regrettable (especially as the new results are very vulnerable to sampling bias), and a bit of a red flag.
On getting better data, and on using data better
Given the above, I think the survey offers extremely unreliable data. Iâm not sure I agree with the authors it is âbetter than nothingâ, or better than our intuitionsâgiven most of us are imperfect cognizers, it might lead us more astray to the âtrue natureâ of the EA community. I am pretty confident it is not worth the collective time and energy it has taken: it probably took a couple of hundred hours of the EA communityâs time to fill in the surveys, leave alone the significant work from the team in terms of design, analysis, etc.
Although some things could not have been helped, I think many things could have, and there were better approaches ex ante:
1) It is always hard to calibrate oneâs lack of knowledge about something. But googling things like âsurvey designâ, âsamplingâ, and similar are fruitful â if nothing else, they suggest that âdoing a surveyâ is not always straightforward and easy, and put one on guard for hidden pitfalls. This sort of screening should be particularly encouraged if one isnât a domain expert: many things in medicine concord with common sense, but some things do not, likewise statistics and analysis, and no doubt likewise many other matters I know even less about.
2) Clever and sensible the EA community generally is, it may not always be sufficient to ask for feedback on a survey idea and then interpreting the lack of response as a tacit green light. Sometimes âWe need expertise and will not start until we have engaged someâ, although more cautious, is also more better. Iâd anticipate this concern will grow in significance as EAs tackle things âfurther afieldâ from their background and training.
3) You did get a relative domain expert raise the sampling concerns to you within a few hours of going live. Laudable though it was that you were responsive to this criticism and (for example) tracked URL data to get a better handle on sampling concerns, invited your critics to review prior drafts and analysis, and mention the methodological concerns prominently, it took a little too long to get there. There also seemed a fair about of over-confidence and defensiveness â not only from some members of the survey team, but from others who thought that, although they hadnât considered X before and didnât know a huge amount about X, that on the basis of summary reflection X wasnât such a big deal. Calling a pause very early may have been feasible, and may have salvaged the survey from the problems above.
This all comes across as disheartening. I was disheartened too: effective altruism intends to put a strong emphasis on being quantitative, getting robust data, and so forth. Yet when we try to practice what we preach, our efforts leave much to be desired (this survey is not the only â or the worst â example). In the same way good outcomes are not guaranteed by good intentions, good information is not guaranteed by good will and hard work. In some ways we are trailblazers in looking hard at the first problem, but for the second we have the benefit of the bitter experience of the scientists and statisticians who have gone before us. Let us avoid recapitulating their mistakes.
Thanks for sharing such detailed thoughts on this Greg. It is so useful to have people with significant domain expertise in the community who take the time to carefully explain their concerns.
Itâs worth noting there was also significant domain expertise on the survey team.
Why isnât the survey at least useful count data? It allows me to considerably sharpen my lower bounds on things like total donations and the number of Less Wrong EAs.
I think count data is the much more useful kind to take away even ignoring sampling bias issues, because the data in the survey is over a year old, i.e. Even if it were a representative snapshot of EA in early 2014, that snapshot would be of limited use. Whereas most counts can safely be assumed to be going up.
I agree the survey can provide useful count data along lines of providing lower bounds. With a couple of exceptions though, I didnât find the sort of lower bounds the survey gives hugely surprising or informativeâif others found them much moreso, great!
Very thoughtful post.
Are there any types of analysis you think could be usefully performed on the data?
Once could compare between clusters (or, indeed, see where there are clusters), and these sorts of analyses would be more robust to sampling problems: even if LWers are oversampled compared to animal rights people, one can still see how they differ. Similar things like factor analysis, PCA etc. etc. could be useful to see whether certain things trend together, especially for when folks could pick multiple options.
Given that a regression-style analysis was abandoned, I assume actually performing this sort of work on the data is much easier said than done. If I ever get some spare time I might look at it myself, but I have quite a lot of other things to do...
One approach would be to identify a representative sample of the EA population and circulate among folks in that sample a short survey with a few questions randomly sampled from the original survey. By measuring response discrepancies between surveys (beyond what one would expect if both surveys were representative), one could estimate the size of the sampling bias in the original survey.
ETA: I now see that a proposal along these lines is discussed in the subsection âComparison of the EA Facebook Group to a Random Sampleâ of the Appendix. In a follow-up study, the authors of the survey randomly sampled members of the EA Facebook group and compared their responses to those of members of that group in the original survey. However, if one regards the EA Facebook group as a representative sample of the EA population (which seems reasonable to me), one could also compare the responses in the follow-up survey to all responses in the original survey. Although the authors of the survey donât make this comparison, it could be made easily using the data already collected (though given the small sample size, practically significant differences may not turn out to be statistically significant).
I think itâs right to say that the survey was premised on the idea that there is no way to know the true nature of the EA population and no known-to-be-representative sampling frame. If there were such a sampling frame or a known-to-be-representative population, weâd definitely have used that. Beforehand, and a little less so now, I would have strongly expected the EA Facebook group to not be representative. For that reason I think randomly sampling the EA FB group is largely uninformative- and I think that this is now Gregâs view too, though I could be wrong.
I agree that could work, although doing it is not straightforwardâfor technical reasons, there arenât many instances where you get added precision by doing a convenience survey âon topâ of a random sample, although they do exist.
(Unfortunately, random FB sample was small, with something like 80% non-response, thus making it not very helpful to sample sampling deviation from the âtrueâ population. In some sense the subgroup comparisons do provide some of this information by pointing to different sub-populationsâwhat they cannot provide is a measure as to whether these subgroups are being represented proportionally or not. A priori though, that would seem pretty unlikely.)
As David notes, the âEA FB groupâ is highly unlikely to be a representative sample. But I think it is more plausibly representative along axes weâd be likely to be interested in the survey. Iâd guess EAs who are into animal rights are not hugely more likely to be in facebook in contrast to those who are into global poverty, for example (could there be some effects? absolutelyâIâd guess FB audience skews young and computer savvy, so maybe folks interested in AI etc. might be more likely to be found there, etc. etc.)
The problem with going to each âclusterâ of EAs is that you are effectively sampling parallel rather than orthogonal to your substructure: if you over-sample the young and computer literate, that may not throw off the relative proportions of who lives where or who cares more about poverty than the far future; youâd be much more fearful of this if you oversample a particular EA subculture like LW.
Iâd be more inclined to âtrustâ the proportion data (%age male, %xrisk, %etc) if the survey was âjustâ of the EA facebook group, either probabilistically or convenience sampled. Naturally, still very far from perfect, and not for all areas (age, for example). (Unfortunately, you cannot just filter the survey and just look at those who clicked through via the FB link to construct this dataâthereâs plausibly lots of people who clicked through via LW but would have clicked through via FB if there was no LW link, so ignoring all these responses likely inverts anticipated bias).