Lessons learned running the Survey on AI existential risk scenarios
Audience: I expect this to be helpful for anyone thinking of running a similar survey, and probably not interesting for anyone else.
We ran a survey of prominent AI safety and governance researchers which aimed to identify which AI existential risk scenarios those researchers find most likely. We learned several lessons that I expect to generalise to surveys with a similar audience or aims (e.g. surveys aiming to understand researchers’ views in some field, or to test hypotheses about EAs’ beliefs on certain issues, etc.). I’ve had several conversations where I’ve repeated these same lessons, so I figured I’d write them up quickly.
“Walkthroughs” are a good way to improve the questions
Before launching the survey, we iterated on the questions via a series of about 10 “walkthroughs”. Each “walkthrough” involved being on a call with someone (a “tester”) who would complete the survey, while verbalising their thought process and any thoughts that came to mind on what was confusing/could be improved, and so on. We did this until the survey design started to stabilise, i.e. the testers were basically happy.
Compared to asking testers to complete the survey in their own time and email us with feedback, this generated much richer and more helpful information. It was also lower effort for the testers (no need to write up their thoughts afterward). It’s also likely that people forget some of their feedback by the end, so it’s better to collect it as they go along.
The most helpful testers were those who we would want in the survey population, which (for our small population) posed a limitation on the number of tests we could do, so using tests wisely is important.
Testers are good at identifying flaws, but bad at proposing improvements
During these walkthroughs, testers reported things they didn’t like about the current design, and sometimes suggested improvements to fix those flaws. Whilst the flaws that testers identified robustly pointed to areas that could be improved, more often than not, testers’ suggestions on how to address the flaws weren’t that helpful. In particular, several times we had the experience of making a change that one tester suggested, only for the next tester to suggest reverting to the previous design. (If you’re interested: whether to make the AI risk scenarios mutually exclusive.) So, we found that suggested improvements should be taken as evidence that something was bad about the current design, and less as evidence that the particular change they suggest will actually make the survey better.
I’m told that this mirrors common wisdom in UI/UX design: that beta testers are good at spotting areas for improvement, but bad (or overconfident) at suggesting concrete changes.
Include relatively more “free-form” questions, or do interviews instead of a survey
Regarding the type of survey questions, there is a spectrum from “free-form” to “specific”. Here are examples of questions on either end of the spectrum:
Very free-form: “Assume an existential catastrophe due to AI has occurred. What do you expect to have been the major causes and why?”
Very specific: “What is your estimate of the probability that an AI takeover scenario will occur?”
Specific questions ask for a response to some precise question, whereas free-form questions just prompt the respondent for their reactions to some vaguer question.
In our survey, most questions were very specific and asked for probability estimates. We expected this would make the analysis easier, because we could simply report summary statistics of these estimates, rather than having to synthesise and draw conclusions from a range of qualitative answers (which tend to be confusing because survey participants always write quickly!).
However, in hindsight this wasn’t clearly the best choice, because asking specific questions requires a level of conceptual precision that wasn’t available for many of our questions.
By conceptual precision, I mean that the survey question uses concepts that pinpoint some specific idea, rather than a vague category (e.g. vague “existential catastrophe” vs. specific “at least 10% reduction in the value of the future”; vague “X happened because Y did” vs. specific “X would not have occurred unless Y did”). Lack of conceptual precision makes it possible that different respondents were essentially answering different questions, because they were using different interpretations of the vague concept. To make matters worse, failures of this kind can be “silent”, in the sense that we (as the survey conductors) don’t even realise that different people are answering different questions, and aggregate results as if they were not.
(One might think that we could have resolved this by just using more precise concepts, but I think that wouldn’t be the correct lesson to draw. For many of the concepts that our questions relied on, there just aren’t (yet) ways of carving up the conceptual space that we expect would have been agreeable to the vast majority of participants. For example:
We described five AI existential risk scenarios and asked respondents to estimate their probabilities. Several respondents (correctly) pointed out that the scenarios weren’t mutually exclusive. Moreover, I suspect that even when respondents were happy with the scenario categorisation, they might have had somewhat different interpretations of the scenario (e.g. about which specific burdensome details are required to hold).
We were initially planning to draw conclusions about the arguments motivating AI safety and governance researchers to work on those cause areas. We carved up motivating arguments into two categories: “general” and “specific” (which we defined in the survey). Unfortunately, some respondents disagreed with this distinction, or didn’t understand what we intended to communicate, which meant that we couldn’t draw any clear conclusions about motivating arguments. Moreover, I think the reasons why people are motivated to do their work tend to be idiosyncratic and therefore hard to carve up in ways that make sense for everyone.)
Instead, I think the correct lesson to learn is: include a greater mixture of free-form and specific questions. With a mixture of both question types, you get the ease-of-analysis benefits of specific questions, but avoid the pitfall of answers not being meaningful due to different interpretations. Also: avoid asking specific questions that rely on imprecise concepts. In practice, for this particular survey, this would have meant asking more free-form questions than we did.[1]
A complementary lesson is to consider conducting a series of interviews, instead of a survey. I’m not sure whether this would have led to more useful results, but I think we should definitely have considered it more carefully. Interviews have several benefits over surveys:
In interviews, you can be more certain that you and the respondent are thinking about questions in the same way (especially useful if your survey must deal with vague concepts).
Interviews don’t have the “brittleness” surveys, in the sense that you can make incremental improvements to imperfect questions as you go, which only renders a few earlier results invalid (and if need be, you can follow up with earlier respondents to get their updated answer). With surveys, if you send it out with a bad question, you lose all information about that question.
In interviews, you can get a more granular understanding of participants’ responses if desired, e.g. understanding relevant aspects of their worldviews, and choose to delve deeper into certain important aspects.
In interviews, you can understand more about participants’ reasoning processes, which might be valuable information in itself.
Of course, interviews don’t scale as well as surveys. That is, doing an additional interview takes more time (for the organisers) than sending the survey to an additional participant. However, in cases where the survey population is relatively small (e.g. AI safety/governance researchers), it’s possible that interviewing a substantial fraction of the population would actually take less time overall, because you don’t need to spend as long perfecting the questions before you send them out. And if (like us) you’re inexperienced in distributing mass surveys, there’s actually a bunch of overhead (and stress!) associated with doing that, which you can avoid.
A related idea is to consider using something like the Delphi Method.
Preregistration is important for the validity of results
Best practice in social science is to preregister your hypotheses and methodology for testing them, before you go out and collect data. Here’s one intuition for why this is important: if you look at all the data, then generate hypotheses, then devise tests for them, it’s highly likely your hypotheses will be true, because you used the data to generate them! In other words, there are lots of choices you can make in analysing data, and—if you don’t precommit to particular choices—then you can make choices that favour your conclusion. Yet another way of saying this, borrowing concepts from machine learning, is: if you don’t separate your training and test sets, you shouldn’t expect your model/predictions to generalise at all.
We didn’t do preregistration, which is reason to be somewhat more sceptical of our conclusions. We discussed this in our document about the limitations of the survey.
To be clear, this lesson only applies to the extent that you’re trying to draw conclusions from your data which are wider than simply “these are the probability estimates (or whatever)”.
Don’t rely on being able to draw interesting conclusions by just boggling at the results
Another upside of preregistering your methodology in particular is that it forces you to be explicit about how you’re planning to analyse the data and draw conclusions. Personally, I think I was mostly relying on being able to just boggle at the results and hope that some interesting findings would emerge.
We actually ended up getting away with this one, but only after two very expensive iterations of boggling, writing up conclusions, and (on the first iteration) getting feedback that the conclusions were flawed. And even then, I think we got a bit lucky that there were some pretty clear patterns.
If I ran another survey, I’d want to have a clear methodology in mind for going from results to conclusions. Ideally, I’d want to stress test this methodology by collecting ~10 responses and running it—probably you could just simulate this by going through the survey 10 times, wearing different hats.
Getting confidentiality right takes effort
It took a while to spell out exactly what results/summary statistics would be made public in a way that was acceptable to all participants. Be prepared for this.
Also, we promised participants that only the three of us running the survey would have access to the raw results, which ended up being frustrating and limiting when we realised it would have been helpful to bring a couple more people on board to help with results analysis.
Capturing information about respondents’ uncertainty is tricky
We spent a lot of time working out how to elicit participants’ levels of uncertainty about their responses. In the end, we took a simple approach that worked well: just ask, “How certain are you about your responses to the previous questions?”, and provide seven numerical options between 0 and 6. “0” was labelled “completely uncertain, I selected my answers randomly”, and “6” was labelled “completely certain, like probability estimates for a fair dice”. See below for an example.
Other small lessons about survey design
These are lessons that we did incorporate into the survey design, which come from a valuable conversation with Spencer Greenberg.
[ETA: one commenter, who is knowledgeable about surveys, disagrees somewhat with the first, second and final points in this section.]
Almost always just have one question per page
Less burden on respondents’ attention
The only case where you wouldn’t do this is if answering one question should change your answer to a previous question
Regarding the interface that you give participants for responding to questions, avoid using sliding scales and instead just have a big list of all the options
Responding using a sliding scale tends to take ~2x longer than selecting an item on a list
Sliding scales can create weird distributions in the data, since people like to move the slider from its central starting position, creating an artificial bimodal distribution
Don’t allow respondents to specify their probabilities directly, since people reason poorly for very small probabilities. Giving them a list of options is better in this respect.
Respondents often don’t use probabilities correctly, instead using them to mean something more like the “strength of their belief”
(This was less relevant for our population)
It can be good to have some “screening” questions at the start, which assess whether respondents have the minimum necessary expertise to answer the questions—but don’t set the bar too high because this can create undesirable selection effects (e.g. those that have thought about a given AI risk scenario at greater length are likely to take it more seriously)
Conclusion
Probably, the most generalisable lesson we learned is more meta: if you’re thinking of running a large scale survey, you should seriously consider having at least one person on your team who has experience with surveys. We didn’t, which meant we learned a lot, but it plausibly caused more mistakes and hassle than it was worth.
Thanks to Alexis Carlier and Jonas Schuett for helpful feedback on a draft.
- ↩︎
Several readers of our survey results found that respondents’ answers to our two free-form questions were the most interesting part of the results, which is some additional evidence that more free-form questions would have been better in our case.
Thanks for the post. I think most of this is useful advice.
In the academic literature, these are also referred to as “cognitive interviews” (not to be confused with this use) and I generally recommend them when developing novel survey instruments. Readers could find out more about them here.
This is also conventional understanding in academia. Though there are some, mostly qualitative-oriented, philosophies that focus more on letting participants define articulation of the research output, there’s generally no reason to think that respondents should be able to describe how a question should be asked (although, of course, if you are pretesting anyway, there is little reason not to consider suggestions). Depending on what you are measuring, respondents may not even be aware of what underlying construct (not necessarily something they even have a concept for) an item is trying to measure. Indeed, people may not even be able to accurately report on their own cognitive processes. Individuals’ implicit understanding may outpace their ability to explicitly theoretically understand the issue at hand (for example, people can often spot incorrect grammar, or a misapplied concept, but not provide explicit accounts of the rules governing the thing in question).
I agree there are some very significant advantages to the use of more qualitative instruments such as open-comments or interviews (I provide similar arguments here). In some cases these might be so extreme that it only makes sense to use these methods. That said, the disadvantages are potentially severe, so I would recommend against people being too eager to either switch to fully qualitative methods or add more open comment instruments to a mixed survey:
Open comment responses may greatly reduce comparability (and so the ability to aggregate responses at all, if that is one of your goals), because respondents may be functionally providing answers to different questions, employing different concepts
Analysing such data typically raises a lot of issues of subjectivity and researcher degrees of freedom
You can attempt to overcome those issues by pre-registering even qualitative research (see here or here), and by following a fixed protocol in advance using a more objective method to analyse and aggregate responses, but then this reintroduces the original issues of needing to force individuals’ responses into fixed boxes when they may have been thinking of things in a different manner.
Including both fixed response and open comment things in the same format may seem like the best of both worlds and is often the best approach, but open comments questions are often dramatically more time-consuming and demanding than fixed response questions and so their inclusion can greatly reduce the quality of the responses to the fixed questions.
I think running separate qualitative and quantitative studies is worth seriously considering: either with initial qualitative work helping to develop hypotheses followed by quantitative study or with a wider quantitative study followed by qualitative work to delve further into the details. This can also be combined with separate exploratory and confirmatory stages of research, which is often recommended.
This latter point relates to the issue of preregistration, which you mention. It is common not to preregister analyses for exploratory research (where you don’t have existing hypotheses which you want to test and simply want to explore or describe possible patterns in the data) - though some argue you should preregister exploratory research anyway. I think there’s a plausible argument for erring on the side of preregistration in theory, based on the fact that preregistration allows reporting additional exploratory analyses anyway, or explicitly deviating from your preregistered analysis to run things differently if the data requires it (which sometimes it does if certain assumptions are not met). That said, it is quite possible for researchers to inappropriately preregister exploratory research and not deviate or report additional analyses, even where this means the analyses they are reporting are inappropriate and completely meaningless, so this is a pitfall worth bearing in mind and trying to avoid.
Another option would be to literally simulate your data (you could simulate data that either does or does not match your hypotheses, for example) and analysing that. This is potentially pretty straightforward depending on the kind of data structure you anticipate.
Incidentally, I agreed with almost all the advice in this post except for the things in the “Other small lessons about survey design.” In particular, I think “Almost always have just one question per page” and not using sliding scales rather than lists, seem like things I would not generally recommend (although having one question per page and using lists rather than scales is often fine). For “screening” questions for competence, unless you literally want to screen people out from taking the survey at all, you might also want to consider running these at the end of the survey rather than the beginning. Rather than excluding people from the survey entirely and not gathering their responses at all, you could gather their data , and then conduct analyses excluding respondents who fail the relevant checks, if appropriate (whether it’s better to gather their data at all or not depends a lot on the specific case). Which order is better is a tricky question, depending on the specific case. One reason to have such questions later is that respondents can be annoyed by checks which seem like they are trying to test them (this most commonly comes up with comprehension/attention/instructional manipulation checks), which can influence later questions (the DVs you are usually interested in). Of course, in some circumstances, you may be concerned that the main questions will themselves influence responses to your ‘check’ in a way that would invalidate them.
For what it’s worth, I’m generally happy to offer comments on surveys people are running, and although I can’t speak for them, I imagine that would go for my colleagues on the survey team at Rethink Priorities too.
Thanks—despite already having received great feedback from you on a survey I’m developing, I still found this comment also really useful.
Meaning people start getting tired or bored and so give less thoughtful responses to the fixes questions? (Rather than just some people stopping taking the survey, which would be a quantity issue.)
And do you think that holds even if you make it clear that those open comment boxes are very optional? In particular, I plan to say at the start of the survey something like “Feel free to skip questions about which you’d rather not express an opinion, e.g. because you don’t feel you know enough about it. Feel especially free to skip the questions about the robustness of your views and the comment boxes.” (I don’t then remind people at each comment box that that comment box is optional.)
I think all of the following (and more) are possible risks:
- People are tired/bored and so answer less effortfully/more quickly
- People are annoyed and so answer in a qualitatively different way
- People are tired/bored/annoyed and so skip more questions
- People are tired/bored/annoyed and dropout entirely
Note that people skipping questions/dropping out is not merely a matter of quantity (reduced numbers of responses), because the dropout/skipping is likely to be differential. The effect of the questions will be to lead to precisely those respondents who are more likely to be bored/tired/annoyed by those questions and to skip questions/dropout if bored/tired/annoyed to be less likely to give responses.
Regrettably, I think that specifying extremely clearly that the questions are completely optional influences some respondents (it also likely makes many simply less likely to answer these questions), but doesn’t ameliorate the harm for others. You may be surprised how many people will provide multiple exceptionally long open comments and then complain that the survey took them longer than the projected average. That aside, depending on the context, I think it’s sometimes legitimate for people to be annoyed by the presence of lots of open comment questions even if they are explicitly stated to be optional because, in context, it may seem like they need to answer them anyway.
Thanks for the detailed reply, all of this makes sense!
I added a caveat to the final section mentioning your disagreements with some of the points in the “Other small lessons about survey design” section
Do you have any heuristics for when just one question per page is better, worse, or equally good as many questions per page?
(No worries if not—I can also just ask you for private feedback on my specific survey.)
I think it depends a lot on the specifics of your survey design. The most commonly discussed tradeoff in the literature is probably that having more questions per page, as opposed to more pages with fewer questions, leads to higher non-response and lower self-reported satisfaction, but people answer the former more quickly. But how to navigate this tradeoff is very context-dependent.
https://www.researchgate.net/publication/249629594_Design_of_Web_Questionnaires_The_Effects_of_the_Number_of_Items_per_Screen
https://sci-hub.ru/https://journals.sagepub.com/doi/full/10.1177/0894439316674459
But, in addition, I think there are a lot of other contextual factors that influence which is preferable. For example, if you want respondents to answer a number of questions pertaining to a number of subtly different prompts (which is pretty common in studies with a within-subjects component), then having all the questions for one prompt on one page may help make salient the distinction between the different prompts. There are other things you can do to aid this, like having gap pages between different prompts, though these can really enrage respondents.
Thanks for this, it was useful advice even for me as someone who has done quite a few surveys.
I agree with the suggestions and with most of what DM has said to follow up.
I definitely agree with this. An approach that we use at READI is to try to involve several ‘topic’ and ‘methodological’ experts in every research project to minimise the risk of errors and oversights. We share a protocol/plan with them before we preregister it so that they can assess materials. Then we engage them again at analysis and write up so they can check that we executed the plan appropriately.
It generally doesn’t take much of their time (maybe 2-20 hours each over a project—we should have a good estimate of the exact time taken in the future once we have had more projects complete (i.e., end in publication)).
In my experience, such experts are also are relatively happy with the pay-off of a publication or acknowledgement for a few hours of work so it is a relatively fair and sustainable exchange (again, we will soon have better data on that).
Finally, this probably won’t be useful for most readers but just in case, these resources (a free chapter/ and two related blog posts (part 1/ part 2)) provide a lot of introductory advice for doing ‘audience research’.
I’d also suggest the Sage series on research methods as a good resource for non-experts who want at least a basic level of understanding of what to do. In this case, Fowler’s “Survey Research Methods” would have provided most of these insights without trial and error—it’s around 150 pages, but it’s not heavy reading.
Thanks for this! This is immediately very useful to me (and I may reach out with some specific questions at some point).
Thanks a lot for the thorough post-mortem, that was super interesting.
I found this point particularly interesting as someone who gives feedback fairly often—I think I’ve previously been pretty overconfident about suggesting concrete fixes (and often feel like feedback is incomplete if I don’t suggest a fix), and this was a useful update.
fwiw, it seems like you suggesting fixes would probably usually be low-cost both for you and the person you’re suggesting the fixes too, assuming it takes you little time to write/say the ideas and the other person little time to consider them. And the person you’re suggesting the fixes too is able to just discard the ones that don’t seem useful. So I’d expect it’s often worthwhile for you to just say what comes to mind and let the onus be on the other person to decide whether and how to use your input.
(Also, I imagine you trying to suggest fixes might sometimes make it clearer what the problem you perceived was, but I’m not sure about that, and it could perhaps also confuse or distract from the real issue, so that seems a less important point.)