Results of the Effective Altruism Outreach Survey
This article reports the results of an online survey with 167 respondents on the influence different styles of effective altruism outreach have on them. While we could not find evidence for our hypotheses, the exploratory data analysis yielded a ranking of the levels of motivation and curiosity our prompts induced. (Cross-posted from my blog.)
Topic
The aim of our survey was to determine what form of effective altruism outreach was most effective for what type of audience.
As types of outreach, we distinguished:
the “obligation style,” which aims to reveal altruistic values in people by helping them overcome biases, a style that is epitomized by Peter Singer’s Child in the Pond analogy, and
the “opportunity style,” which assumes that people are altruistic and helps them overcome biases that keep them locked in lethargy anyway, a style that is epitomized by Toby Ord’s appeal that people can save hundreds of lives over their lifetime if they invest their money wisely.
Styles that we did not investigate are the usage of humor to better convey topics that would otherwise be met by defensiveness (suggested by Rob Mather) and a style that is similar to the opportunity style but puts a stronger emphasis on personal discovery, as in Melanie Joy’s TED talk.
Such an evaluation could help any group engaged in effective altruism outreach to communicate more effectively with their respective audiences.
Our hypotheses were:
The obligation style leads to defensiveness, which would effect negative reaction at least in the short term and at least from less rationally-minded people. (If it is also more emotionally salient, later reflection might still make it more effective, but we cannot measure that.)
The opportunity style has a positive effect but only on people who already show a strong altruistic inclination.
Pitches targeted at specific demographics have a stronger effect on these people than on others.
On the exploratory side, we were also interested in the correlation between rational inclination and respondents’ trust in their intuition, and their attitude toward our prompts, as well as any correlation between respondents’ reaction to the prompts and the degree to which the prompts informed them or withheld information, as teasers do.
Since we could not find evidence of these correlations, it would be interesting to see whether others can. Additionally, there are a number of prompts that seem very powerful that we did not include (e.g., a comparison of prioritization with triage). A different sample might be more representative of the taxonomy. A qualitative study might also shed more light on the way people react to our prompts.
Design and Implementation
One of our worries was that if obligation-style prompts really make people defensive, then there is the risk of this defensiveness coloring the responses to later prompts. Hence we introduced a page break and sorted the critical prompts to the second page of the two.
The length of the prompts, especially the one’s borrowed from Peter Singer, was another problem. We slightly shortened them where possible and otherwise reduced the number of prompts from originally eight per category to five. In the interest of reducing the number of fields people have to tick, we removed a scale for how much people like a prompt, which we found dispensable.
To measure rational and experiential (intuition-related) proclivities, we relied on the Rational-Experiential Inventory (REI) with 10 prompts by Norris, Pacini, and Epstein (1998). To measure altruistic inclination, we selected 10 prompts from the Adapted Self-Report Altruism Scale by Rushton (original, 1981), Witt and Boleman (adapted version, 2009).
Just as these two scales, our prompts also relied on five-item Likert scales.
The full survey and recruitment letter can be found here.
We at first used the original Rushton scale, but after receiving 15 responses switched to the modified one, which meant turning sentences from present perfect into conditional (“I have donated blood” became “I would donate blood.”). The change is fairly localized, the responses obtained after the change greatly outnumber those obtained before, and we did not see any noticeable differences in the spread of the answers, so we decided to include the first 15 in our final analysis.
We advertised the survey on Reddit, Twitter, and Facebook, also using paid advertisement on Facebook to reach more people. Most, however, were recruited through an email a friend send to a mailing list of the Humboldt-Universität zu Berlin. We tried to counterbalance and get more people without academic background into our sample by targeting younger people on Facebook, but we only recruited only about 26 people that way (at a rate of almost €1 per person), as opposed to 85 via the mailing list.
Please contact us if you would like to play around with the raw data.
Analysis
Our R script for cleaning and analysis can be found in this Bitbucket snippet.
After a first section of type conversions and reversal of questions that were asked in the negative for validation purposes, we engaged in the controversial practice of interpreting the ordinal Likert items as interval scale to compute means. This would imply that the differences between the five options we gave are identical. We have no basis for this assumption, and the results should be taken with the appropriate absolute-scale number of grains of salt.
Apart from more cleaning, we also combined answers into categories that seem intuitive enough to us to not be motivated by the data. However, we have seen the data before deciding on the categories in all cases except for political views. The intervals used for the respondents age are not ours but intervals often used in the literature. These coarser categories allowed us to compensate for the low sample sizes per cohort.
Finally the script produces some eighty graphs.
When the analyses showed that we could find evidence for none of our hypotheses, we engaged in exploratory data analysis, the results of which are detailed in the following.
Evaluation
Exploratory data analysis has the inevitable drawback that in all likelihood we’ll find significant-looking correlations in our data simply by chance.
Nonetheless the overall ranking of the prompts, prompts that we asked our respondents to rate along scales of curiosity and motivation they either induced or failed to induce, has the power of our full sample size of 167 behind it, so that we’re somewhat confident that conclusions drawn about prompts close to its extreme points are valuable.
The graph above shows the distribution of respondents’ votes with the prompts described by a key where the first is a keyword that makes clear which prompt is meant, the second part is our taxonomy of whether the prompt focuses on the donor’s opportunity or moral obligation, the third part is either “info” or “teaser” depending on whether the prompt explains something or withholds information, and the fourth part indicates whether the respondent gauged their motivation or their curiosity. The first and last part are restrictive while the second and third are descriptive.
There are also some post-hoc rationalizations that make the rankings of the top prompts plausible.
The absolute top prompt in terms of motivation and curiosity is Peter Singer’s famous Child in the Pond analogy, which would probably not have made it into our survey had it not proved its persuasive power by turning Singer’s essay “Famine, Affluence, and Morality” into a seminal paper of moral philosophy well-known to philosophers worldwide.
The third place is a slightly adapted version of the sentence that Giving What We Can uses as one of their slogans, “Studies have found that top charities are up to a thousand times more effective than others,” except that the organization omits the weasel words “studies have found.” It is also a time-tested prompt.
The fourth place is an almost verbatim quote from Toby Ord’s TED talk and surely a statement that the Giving What We Can founder has honed in hundreds of conversations with potential pledge-takers: “You can save someone’s life without even changing your career.”
The final spots in the ranking can be explained as an aversive reaction to an insulting prompt. Interestingly, the rather popular prompt comparing the training of a guide dog to sight-restoring surgery ranks very low in terms of the motivation it induces.
Threats to our external validity are that we have in our sample:
3.7 times as many academics than people who only graduated school if you count as academics anyone who has visited a university or college irrespective of whether they’ve attained a degree yet,
3 times as many nonreligious than religious respondents, and a mean age of 25 (σ=7) with only two respondents over 45.
There are likely more biases that we can’t recognize.
Main Hypotheses
In our data exploration, we have generated over eighty graphs that can be found in this gallery.
Based on experiences in the Less Wrong community and REG’s experiences with poker players as well as our inside view of the effective altruism movement itself, we expected to see a clear correlation between rationality and effective altruism inclination (the “all” vs. “rational” plots above).
We did not expect to see such a clear correlation with our data on the respondents’ altruism inclination (the “all” vs. “altruistic” plots above), because it tested very elementary, naive empathetic skill, which may be necessary to a degree but is otherwise unhelpful for understanding effective altruism.
Neither correlation showed. Not even the square root of the product of the two features was correlated with responses to our prompts. If these results can be taken at face value, then it seems to us that rationality and altruism may be little more than necessary conditions for becoming effective altruism, and that something else is just as necessary—maybe the principle of “taking ideas seriously,” which is common on Less Wrong, or any number of other such traits. But the results are probably more likely to be meaningless.
The strong correlations between “altruistic” and the two REI dimensions may be just artifacts of people’s different inclinations to answer Likert scales with extreme or moderate values. Surprisingly, however, the same tendency is not evident between the two REI dimensions. Perhaps they are sufficiently contradictory to offset this tendency. Please let us know when you have other explanations.
Qualitative Results
The only nonquantitative question in our survey was the one asking for comments and suggestions. A few interesting comments:
One respondent made the good point that the questions that focus on opportunities in effective altruism put the donor at the center rather than the beneficiary, something that to change is a crucial part of effective altruism.
Five respondents made suggestions that seemed to go in the opposite direction (though that is my interpretation), largely for pragmatic purposes. Two of them seemed take this position despite seeming fairly aware of the privilege of their birth.
One respondent said fairly directly that the distance of suffering was morally relevant to them.
Conclusion
While we could not find evidence for our hypotheses, we were able to generate a ranking of prompts commonly used by effective altruists of to how much motivation and curiosity they induce according to self-report. Due to biases in our sample, the external validity of these results is probably higher for populations of academics than the general population.
Amongst the other statistical concerns brought up by others, I feel like I have no idea who the people are that participated in this.
Mostly students and staff of the Humboldt-Universität zu Berlin. I also recorded my targeting settings for the Facebook ads at the time but forgot to include it. It was mostly targeted at our existing Facebook followers to maximize conversion rates, so fans of My Little Pony: Friendship is Magic.
A good rule of thumb is to have at least 10 subjects per group for confirmatory analysis, and you have less than that for exploratory analysis. Because your sample is so small, I would be surprised if very many (if any) of your rankings survived correction for multiple-hypothesis testing.
I would suggest grouping some of the individual conditions together in different analyses (for instance, test all “opportunity” groups against all “obligation” groups), although this may introduce bias since the “opportunity” groups varied systematically from the “obligation” groups in other ways.
On a related note, you don’t have to simply assert that
Instead, you can use something like bootstrap resampling) to get an idea of the variance of ranking. I would be interested to see how variable the rankings are under bootstrap resamplings, especially since 167 is actually not that large of a sample for this many groups.
Grouping: The gallery I linked is an almost unfiltered assortment of all the graphs I generated, but I eventually ignored the ones where some cohorts were very small. Even in the case of motivation vs. education, where I already grouped the originally five levels into two, the result (that people who hadn’t visited a university were more easily motivated for or curious about EA) was not “significant” (or what the proper term is, Bayes factor of 0.5).
That was a grouping of demographic levels, though. Is what you’re suggesting closer to this or this one?
Bootstrapping: My university course and my textbook only touch on that in the context of things they wish they had had the time to cover… Do you mean that I could use bootstrapping to determine the variance of the individual measures or of the rank of the items? The first seems doable to me, the latter more tricky.
Thanks for doing this! Awesome initiative, neat survey idea, and thanks so much for posting your data publicly.
I’d like to take this opportunity to reiterate my standing offer of statistics advice for anyone who needs it. If anyone plans to do more statistical things, I may be able to offer advice on your experimental design or analysis plan.
I have some statistical comments that I’ll post separately, for ease of threading.
re: standing offer of statistical advice, there might be a great opportunity to get the answer to the question of the distribution of effectiveness of development interventions right by helping AidGrade with their data. I posted in open thread also. Commenting here because you seem to be awesome at all things data / stats, but I bet you have a high opportunity cost to your time!!
“Eva Vivalt Btw, if anyone would like to help with transforming AidGrade’s data so as to better speak to this question and pin down the variance in cost-effectiveness terms, let me know. We have some measures of costs but there is still the matter of converting outcomes.” ? See debates / links on post by Satvik Beri on 7th August on Effective Altruists facebook group for context.
That’s awesome! I need to bear that offer in mind. Thanks for all your comments! :‑)
There are two things that “no correlation” might mean: first, that there is actually no correlation, or second, that you didn’t have enough statistical power to detect a correlation. Only in the second case is the result meaningless. You can distinguish between these by providing not just the p-value of the correlation, but the confidence interval on the correlation statistic.
Yep, I should’ve done that. See below. Insofar as we can take these numbers at face value, I think they show that we can be fairly confident that there’s no correlation.
But correct me if I’m wrong, but for a null result to not be meaningless the study would have to be methodologically excellent. I don’t trust our first foray into surveys too far. That’s what I meant in the section you quoted.
Many thanks for all your comments! (I found this one to be easiest to respond to, but I’ll get to the others next.)
Correlation between prompts and rationality:
Correlation between prompts and altruism:
Thanks for the excellent survey work. I will take out the blindness cure example we use in our tabling/giving games, and replace it with Toby’s lives-saved opportunity example.
I don’t think that would explain it, actually. The correlation tests whether a person’s rationality average is a good predictor of their altruism average, on expectation. If the person is biased towards choosing extreme values, then a high rationality average becomes less informative about the expected value of the altruism average, because there’s more variance in that person’s answers.
(I’m not sure I’ve explained this well, but if you disagree, try giving an example of a data-generating model where there is no real correlation between A and B, but a correlation appears in the data because some people are disproportionately likely to choose extreme answers. I’m fairly sure this isn’t possible.)
If rationality and altruism were necessary conditions for EA outreach working well, wouldn’t you actually expect to see a reasonably strong correlation between rationality or altruism and responsiveness to EA outreach, because all individuals with low R/A have low responsiveness, whereas some individuals with high R/A have high responsiveness?
I read this as having an implicit hypothesis—that your responsiveness didn’t necessarily mean that you followed through on going through the steps to take EA related actions? And, seeing a load of rationalist people and altruistic people in EA but not many other people, we can assume that they’re necessary conditions. So, perhaps you need more than just these traits, as there’s no extra responsiveness / motivation that’s reported if you have them according to these preliminary findings. (Does that sound right?)
Which I think might be less plausible than a simpler hypothesis e.g. rational/altruistic people are more likely to execute on and deliberate over new ideas once they’ve responded / been motivated by them. Or, something like: EA movement growth has found traction in communities of rationalist people for other reasons. Unless there’s some reasoning I’m missing?
I was looking for something like this…I found it quiet interesting, hopefully you will keep posting such blogs….Keep sharing. http://www.roshnikhanna.in/
Was the order in which the options were presented to the participants varied randomly?
You mean the prompts? We wanted to keep the obligation prompts last in case people are irked by them so it won’t influence the rest. We didn’t see the same risk for the opportunity style ones. But we didn’t shuffle them within the pages either.