Lessons learned running the Survey on AI existential risk scenarios
Audience: I expect this to be helpful for anyone thinking of running a similar survey, and probably not interesting for anyone else.
We ran a survey of prominent AI safety and governance researchers which aimed to identify which AI existential risk scenarios those researchers find most likely. We learned several lessons that I expect to generalise to surveys with a similar audience or aims (e.g. surveys aiming to understand researchers’ views in some field, or to test hypotheses about EAs’ beliefs on certain issues, etc.). I’ve had several conversations where I’ve repeated these same lessons, so I figured I’d write them up quickly.
“Walkthroughs” are a good way to improve the questions
Before launching the survey, we iterated on the questions via a series of about 10 “walkthroughs”. Each “walkthrough” involved being on a call with someone (a “tester”) who would complete the survey, while verbalising their thought process and any thoughts that came to mind on what was confusing/could be improved, and so on. We did this until the survey design started to stabilise, i.e. the testers were basically happy.
Compared to asking testers to complete the survey in their own time and email us with feedback, this generated much richer and more helpful information. It was also lower effort for the testers (no need to write up their thoughts afterward). It’s also likely that people forget some of their feedback by the end, so it’s better to collect it as they go along.
The most helpful testers were those who we would want in the survey population, which (for our small population) posed a limitation on the number of tests we could do, so using tests wisely is important.
Testers are good at identifying flaws, but bad at proposing improvements
During these walkthroughs, testers reported things they didn’t like about the current design, and sometimes suggested improvements to fix those flaws. Whilst the flaws that testers identified robustly pointed to areas that could be improved, more often than not, testers’ suggestions on how to address the flaws weren’t that helpful. In particular, several times we had the experience of making a change that one tester suggested, only for the next tester to suggest reverting to the previous design. (If you’re interested: whether to make the AI risk scenarios mutually exclusive.) So, we found that suggested improvements should be taken as evidence that something was bad about the current design, and less as evidence that the particular change they suggest will actually make the survey better.
I’m told that this mirrors common wisdom in UI/UX design: that beta testers are good at spotting areas for improvement, but bad (or overconfident) at suggesting concrete changes.
Include relatively more “free-form” questions, or do interviews instead of a survey
Regarding the type of survey questions, there is a spectrum from “free-form” to “specific”. Here are examples of questions on either end of the spectrum:
Very free-form: “Assume an existential catastrophe due to AI has occurred. What do you expect to have been the major causes and why?”
Very specific: “What is your estimate of the probability that an AI takeover scenario will occur?”
Specific questions ask for a response to some precise question, whereas free-form questions just prompt the respondent for their reactions to some vaguer question.
In our survey, most questions were very specific and asked for probability estimates. We expected this would make the analysis easier, because we could simply report summary statistics of these estimates, rather than having to synthesise and draw conclusions from a range of qualitative answers (which tend to be confusing because survey participants always write quickly!).
However, in hindsight this wasn’t clearly the best choice, because asking specific questions requires a level of conceptual precision that wasn’t available for many of our questions.
By conceptual precision, I mean that the survey question uses concepts that pinpoint some specific idea, rather than a vague category (e.g. vague “existential catastrophe” vs. specific “at least 10% reduction in the value of the future”; vague “X happened because Y did” vs. specific “X would not have occurred unless Y did”). Lack of conceptual precision makes it possible that different respondents were essentially answering different questions, because they were using different interpretations of the vague concept. To make matters worse, failures of this kind can be “silent”, in the sense that we (as the survey conductors) don’t even realise that different people are answering different questions, and aggregate results as if they were not.
(One might think that we could have resolved this by just using more precise concepts, but I think that wouldn’t be the correct lesson to draw. For many of the concepts that our questions relied on, there just aren’t (yet) ways of carving up the conceptual space that we expect would have been agreeable to the vast majority of participants. For example:
We described five AI existential risk scenarios and asked respondents to estimate their probabilities. Several respondents (correctly) pointed out that the scenarios weren’t mutually exclusive. Moreover, I suspect that even when respondents were happy with the scenario categorisation, they might have had somewhat different interpretations of the scenario (e.g. about which specific burdensome details are required to hold).
We were initially planning to draw conclusions about the arguments motivating AI safety and governance researchers to work on those cause areas. We carved up motivating arguments into two categories: “general” and “specific” (which we defined in the survey). Unfortunately, some respondents disagreed with this distinction, or didn’t understand what we intended to communicate, which meant that we couldn’t draw any clear conclusions about motivating arguments. Moreover, I think the reasons why people are motivated to do their work tend to be idiosyncratic and therefore hard to carve up in ways that make sense for everyone.)
Instead, I think the correct lesson to learn is: include a greater mixture of free-form and specific questions. With a mixture of both question types, you get the ease-of-analysis benefits of specific questions, but avoid the pitfall of answers not being meaningful due to different interpretations. Also: avoid asking specific questions that rely on imprecise concepts. In practice, for this particular survey, this would have meant asking more free-form questions than we did.
A complementary lesson is to consider conducting a series of interviews, instead of a survey. I’m not sure whether this would have led to more useful results, but I think we should definitely have considered it more carefully. Interviews have several benefits over surveys:
In interviews, you can be more certain that you and the respondent are thinking about questions in the same way (especially useful if your survey must deal with vague concepts).
Interviews don’t have the “brittleness” surveys, in the sense that you can make incremental improvements to imperfect questions as you go, which only renders a few earlier results invalid (and if need be, you can follow up with earlier respondents to get their updated answer). With surveys, if you send it out with a bad question, you lose all information about that question.
In interviews, you can get a more granular understanding of participants’ responses if desired, e.g. understanding relevant aspects of their worldviews, and choose to delve deeper into certain important aspects.
In interviews, you can understand more about participants’ reasoning processes, which might be valuable information in itself.
Of course, interviews don’t scale as well as surveys. That is, doing an additional interview takes more time (for the organisers) than sending the survey to an additional participant. However, in cases where the survey population is relatively small (e.g. AI safety/governance researchers), it’s possible that interviewing a substantial fraction of the population would actually take less time overall, because you don’t need to spend as long perfecting the questions before you send them out. And if (like us) you’re inexperienced in distributing mass surveys, there’s actually a bunch of overhead (and stress!) associated with doing that, which you can avoid.
A related idea is to consider using something like the Delphi Method.
Preregistration is important for the validity of results
Best practice in social science is to preregister your hypotheses and methodology for testing them, before you go out and collect data. Here’s one intuition for why this is important: if you look at all the data, then generate hypotheses, then devise tests for them, it’s highly likely your hypotheses will be true, because you used the data to generate them! In other words, there are lots of choices you can make in analysing data, and—if you don’t precommit to particular choices—then you can make choices that favour your conclusion. Yet another way of saying this, borrowing concepts from machine learning, is: if you don’t separate your training and test sets, you shouldn’t expect your model/predictions to generalise at all.
We didn’t do preregistration, which is reason to be somewhat more sceptical of our conclusions. We discussed this in our document about the limitations of the survey.
To be clear, this lesson only applies to the extent that you’re trying to draw conclusions from your data which are wider than simply “these are the probability estimates (or whatever)”.
Don’t rely on being able to draw interesting conclusions by just boggling at the results
Another upside of preregistering your methodology in particular is that it forces you to be explicit about how you’re planning to analyse the data and draw conclusions. Personally, I think I was mostly relying on being able to just boggle at the results and hope that some interesting findings would emerge.
We actually ended up getting away with this one, but only after two very expensive iterations of boggling, writing up conclusions, and (on the first iteration) getting feedback that the conclusions were flawed. And even then, I think we got a bit lucky that there were some pretty clear patterns.
If I ran another survey, I’d want to have a clear methodology in mind for going from results to conclusions. Ideally, I’d want to stress test this methodology by collecting ~10 responses and running it—probably you could just simulate this by going through the survey 10 times, wearing different hats.
Getting confidentiality right takes effort
It took a while to spell out exactly what results/summary statistics would be made public in a way that was acceptable to all participants. Be prepared for this.
Also, we promised participants that only the three of us running the survey would have access to the raw results, which ended up being frustrating and limiting when we realised it would have been helpful to bring a couple more people on board to help with results analysis.
Capturing information about respondents’ uncertainty is tricky
We spent a lot of time working out how to elicit participants’ levels of uncertainty about their responses. In the end, we took a simple approach that worked well: just ask, “How certain are you about your responses to the previous questions?”, and provide seven numerical options between 0 and 6. “0” was labelled “completely uncertain, I selected my answers randomly”, and “6” was labelled “completely certain, like probability estimates for a fair dice”. See below for an example.
Other small lessons about survey design
These are lessons that we did incorporate into the survey design, which come from a valuable conversation with Spencer Greenberg.
[ETA: one commenter, who is knowledgeable about surveys, disagrees somewhat with the first, second and final points in this section.]
Almost always just have one question per page
Less burden on respondents’ attention
The only case where you wouldn’t do this is if answering one question should change your answer to a previous question
Regarding the interface that you give participants for responding to questions, avoid using sliding scales and instead just have a big list of all the options
Responding using a sliding scale tends to take ~2x longer than selecting an item on a list
Sliding scales can create weird distributions in the data, since people like to move the slider from its central starting position, creating an artificial bimodal distribution
Don’t allow respondents to specify their probabilities directly, since people reason poorly for very small probabilities. Giving them a list of options is better in this respect.
Respondents often don’t use probabilities correctly, instead using them to mean something more like the “strength of their belief”
(This was less relevant for our population)
It can be good to have some “screening” questions at the start, which assess whether respondents have the minimum necessary expertise to answer the questions—but don’t set the bar too high because this can create undesirable selection effects (e.g. those that have thought about a given AI risk scenario at greater length are likely to take it more seriously)
Probably, the most generalisable lesson we learned is more meta: if you’re thinking of running a large scale survey, you should seriously consider having at least one person on your team who has experience with surveys. We didn’t, which meant we learned a lot, but it plausibly caused more mistakes and hassle than it was worth.
Thanks to Alexis Carlier and Jonas Schuett for helpful feedback on a draft.
Several readers of our survey results found that respondents’ answers to our two free-form questions were the most interesting part of the results, which is some additional evidence that more free-form questions would have been better in our case. ↩︎