EA movement building: Should you run an experiment?

Link post

Many thanks to Lauren Mee, David Reinstein, David Moss, Brenton Mayer, Aaron Gertler, Alex Holness-Tofts, Lynn Tan, and Renee Bell for providing feedback on a draft of this post, as well as all who provided feedback on the studies themselves.

Background and summary

Animal Advocacy Careers (AAC) ran studies testing the effects of our ~1 hour one-to-one careers advising calls and our ~9 week online course. Each study was designed as a randomised controlled trial, and pre-registered on the Open Science Framework (here and here). The findings are written up in the style of a formal academic paper, viewable here, and in a previous blog post, we provide a summary of the two studies.

This post is intended to help effective altruism (EA) movement builders decide whether, when, and how to run intervention experiments, by talking through some of AAC’s experiences and reflections. Some of the key points are:

Experiments often provide stronger evidence for important questions than some other types of evidence. At the very least, they provide another type of evidence, with different pros and cons.
They do have quite a lot of cons, especially practical ones.
Nevertheless, they’ve probably been underutilised in EA movement building.
You should avoid repeating our mistakes, like taking insufficient steps to avoid differential attrition.
You can also repeat some of our successes, like gathering additional data for supplementary analyses.

Pros

Running an experiment can tell you helpful things that your intuitions can’t. For example, while running the one-to-one calls, we became quite pessimistic about their usefulness, due partly to feeling that we hadn’t offered any very helpful advice on at least some of the calls. Our study results were a positive update for us, reassuring us that the service was likely having positive effects. That said, the effects were smaller than we predicted that they would be for all four of our key outcome metrics, so the results were a reality check on our (perhaps naively optimistic) initial predictions.

The findings may also differ substantially from the impression you would get from a less rigorous evaluation method. In each service we used an “initial impressions” survey immediately after the call or final session of the course was complete, which is probably quite similar to how many EA movement building services are evaluated currently. For example, for the one-to-one calls service, the average response was around 5 out of 7 (i.e. “somewhat increased”) for questions about how the call and application process affected participants’ “motivation to do good,” “motivation to do work related to animal advocacy,” and “motivation to engage with the community and ideas of ‘effective altruism.’” Yet none of our analyses in the full experiment (at six months’ follow-up) provided any evidence of these sorts of changes.[1]

Cons

There are many limitations to the evidence that experiments provide us. The full list is long,[2] but some of the most important points include:

They tend to be limited to quantifiable, easy-to-measure metrics which may not be the most important outcomes of an intervention.
It’s possible for projects like this to be justified on the basis of uncommon, extremely positive outcomes, but experiments might not have a large enough sample size to demonstrate an increase in these.
They may not be generalisable to other contexts (i.e. they may lack external validity).
The conclusions may not hold if the methodology is replicated.

Even though experiments usually cannot tell us all of the information that we need, they will often provide an additional type of evidence. Social change is complex, and having many types of imperfect evidence is more useful than only having a few, as long as we don’t place too much weight on the findings from any one methodology.

The experimental format came with a number of practical disadvantages for AAC. Most notably:

It likely irritated many people, potentially damaging our ability to provide them useful services in the future; researchers have found that people really hate randomisation,[3] and we had a few comments expressing frustration or disappointment with this. This disadvantage could be especially important for EA movement builders working with a relatively small or specialised audience.
It may also have decreased application rates, giving a misleadingly low impression of demand for the services.
The six month delay between the initial applications and being able to analyse the follow-up surveys has potentially delayed some important decision-making at AAC.
In order to make the analysis simpler and the results more informative, we barred applicants to either of the services from applying to the other. This might further have limited applicant numbers (if uncertainty over which service to pick led to decision paralysis) or impact potential per participant (e.g. if some people would have benefited most from participating in both services).

Some recommendations

1. Run more experiments

See “Pros” above for why we think this is worth doing. Yes, there are a number of drawbacks, but our sense is that this methodology is underutilised in EA movement building.

Here are some examples of EA movement building interventions that could potentially be tested through experiments, perhaps using methodology quite similar to our studies:

Local one-to-one careers advice
EA Fellowships
(Other) EA online courses and virtual programmes
Workshops or discussion events
Seminars, lectures, or talks
Book giveaways

Of course, there are many other research questions relating to EA movement building that could be tested through experiments using different sorts of designs. Examples include:

Optimal terminology for EA branding or persuasive messaging
The effectiveness of various interventions intended to encourage support for key premises of EA
The effectiveness of various interventions for eliciting donations
Importance of various factors in mediating the effects if EA messaging, e.g. the relationship between age and openness to the ideas

Our studies provided very weak evidence for the following ideas (among others), which provide examples of the sorts of questions that could be tested through experiments that randomise participants between two different versions of a similar intervention:

Do attempts to encourage people to change their views on which cause areas should be prioritised tend to be ineffective or backfire?
Do different advisors (in one-to-one calls) have meaningfully different effects on participants?
Can advisors with relatively little experience of giving one-to-one advice still have substantial (positive-seeming) effects?
Are the average outcomes better for courses with a face-to-face component than those without?

If the practicalities of running an experiment are putting you off, then it might be worth compromising slightly on methodological rigour if it makes the experiment more feasible, e.g. see the suggestions for different types of control groups below.[4]

2. Take steps to reduce differential attrition in your experiments

Differential attrition — getting more responses from the intervention than control groups — turned out to be one of the biggest flaws in our studies. We didn’t take enough steps to mitigate this risk because we weren’t familiar enough with statistics and experimental design to realise how much of a limitation this was.[5]

One possibility to reduce differential attrition could be just to offer large enough incentives for survey responses that it becomes hard to refuse. Bear in mind that some funders might be excited to support this research (e.g. maybe the EA Infrastructure Fund).

Another option is to use a different type of control group. We used a no intervention control group — those who were randomly allocated not to be offered the service were not offered anything in its place. This must have been pretty disappointing for people (even though we warned them of this in advance), and it’s understandable that people who received nothing from us didn’t feel as inclined to respond to our survey as people who we had provided free careers advice to.

It is possible that you’d get less differential attrition by using a wait-list control. For example, people might be similarly likely to respond to your survey after a few months if they’re just about to receive the service you promised them (the wait-list control group) as if they have recently finished it (the intervention group).

Another option is to use a full intervention in both treatment groups, where the control group receives something that is useful but irrelevant to the key variable that you want to measure. For example, if you want to test people’s inclination towards EA from a book giveaway service, you could market the service as offering free books that “help you make the world a better place”, then randomly allocate people to receive a book that you’re optimistic will have positive effects on the desired metrics (e.g. Doing Good Better) or a book that seems helpful for altruistic people’s productivity, but is unlikely to affect inclination towards EA (e.g. Deep Work).

Note these different control group types might still not fully balance out the response rates on their own, but might be effective in combination with other methods.[6]

In some cases, it may be possible to go to quite a lot of effort to chase people up to collect their follow-up survey responses. For example, if you’re running a university EA group, you might be able to speak to people at events, message them directly on social media, ask a mutual friend/connection to remind them, or even knock on their door and ask them face-to-face.[7]

We haven’t thoroughly reviewed literature on experimental design; these suggestions are just a starting point. Perhaps a better specific suggestion is just to not repeat our mistake; read into best practice about how to avoid (differential) attrition before you start offering your service and collecting data!

3. Try to gather data that enables supplementary analyses

We didn’t expect to see participants making meaningful career changes after six months’ follow-up, even if they intended to do so later. So the main analyses in our studies used a series of metrics that we hoped would provide useful indirect indications of whether the interventions were likely to be effective or not. Nevertheless, we carried out an additional supplementary analysis using participants’ LinkedIn profiles that was closer to an all-things-considered view of the services’ tangible effects. We were surprised to see that this identified some evidence of positive-seeming differences between the intervention and control groups. We’re glad that we collected this information (despite not initially expecting it to be very useful) because it provided a different type of evidence for our key research questions.

Some other considerations

When thinking about how to evaluate an intervention, there are many tradeoffs that EA movement builders will face. We don’t necessarily have strong recommendations here, but the following are all worth giving some substantial thought to:

How long is the ideal follow-up period? Too soon and you might underestimate indirect effects and over-update on short-term effects that will eventually fade; too distant and you might get low response rates and spend longer waiting for responses.
How familiar are you with experimental design and statistics? If your experience is limited, you might want to spend more effort reading into methodological best practice (at least advice designed specifically for practitioners rather than academics, like Faunalytics’ advice), seeking external advice, or paying for some sort of consultancy service.
Should you try to publish this in an academic journal? If so, you might need to spend longer on low-priority editing, signalling familiarity with previous literature, getting ethics approval, and finding collaborators. But the process (e.g. peer review) might also increase the rigour of your research and the end result might be much better at boosting the credibility of your organisation and the EA movement more widely.
How well does this experiment represent the service that you’re likely to run in the future and how well will the findings generalise? For example, if you’re marketing it to a different audience in order to get enough participants, then this reduces its generalisability.
Which important effects of the intervention you are trialling are unlikely to be measured well by the experiment?
How will you weigh this evidence against other forms of evidence that you have available?
What results do you expect, and how would surprising findings update your views? You might like to specify your predictions in advance.

Footnotes

[1] This may be due to differences in question wording between the initial impressions survey and main follow-up survey, the changes in attitudes resulting from the calls only being short-term (studies tend to find that the majority of attitude change resulting from persuasive interventions dissipates within a few weeks, e.g.), desire to show gratitude being stronger immediately after the service, or some other factor.

As another example, some of our anecdotal feedback suggested promising changes to cause prioritisation following the course (e.g. updates to prioritise wild animal welfare/suffering more highly), but our online course study found some concerning evidence that participants’ cause prioritisation may have moved in the opposite direction to intended.

[2] There have been discussions focused on specific cause areas about the shortfalls of experiments, such as this list for effective animal advocacy and this post about global health and development. Many of the concerns on this “List of ways in which cost-effectiveness estimates can be misleading” could apply to experiments, too. Of course, there are various helpful resources that are easily Googleable, like this one, as well as more formal methodological discussions within academic papers.

[3] This is an interesting podcast episode on the topic.

[4] Taking this logic one step further, quasi-experiments (where the control group is not randomly selected) could still be worth running.

[5] For context, one reviewer/advisor with substantially more statistics experience than us referred to the one-to-one advising calls study as a “failed RCT.” They clarified that they “would already be a bit worried about differential attrition of 10 percentage points and very worried about 20 percentage points. 40 percentage points seems like a really big deal to me and I imagine nearly all other economists (though other disciplines might be more relaxed about it).” The differential attrition was much lower in the online course study, but arguably still worrying.

[6] They also each have their own set of methodological pros and cons. They may have other benefits too, such as making the experimental design less explicit and off-putting.

[7] You’d probably have to warn people in advance that you might use some of these methods, since they might seem a little invasive otherwise.