David_Moss comments on Lessons learned running the Survey on AI existential risk scenarios

David_Moss Oct 14, 2021, 9:22 AM
33 points
0 ∶ 0
Thanks for the post. I think most of this is useful advice.
“Walkthroughs” are a good way to improve the questions
In the academic literature, these are also referred to as “cognitive interviews” (not to be confused with this use) and I generally recommend them when developing novel survey instruments. Readers could find out more about them here.
Testers are good at identifying flaws, but bad at proposing improvements… I’m told that this mirrors common wisdom in UI/UX design: that beta testers are good at spotting areas for improvement, but bad (or overconfident) at suggesting concrete changes.
This is also conventional understanding in academia. Though there are some, mostly qualitative-oriented, philosophies that focus more on letting participants define articulation of the research output, there’s generally no reason to think that respondents should be able to describe how a question should be asked (although, of course, if you are pretesting anyway, there is little reason not to consider suggestions). Depending on what you are measuring, respondents may not even be aware of what underlying construct (not necessarily something they even have a concept for) an item is trying to measure. Indeed, people may not even be able to accurately report on their own cognitive processes. Individuals’ implicit understanding may outpace their ability to explicitly theoretically understand the issue at hand (for example, people can often spot incorrect grammar, or a misapplied concept, but not provide explicit accounts of the rules governing the thing in question).
Include relatively more “free-form” questions, or do interviews instead of a survey...
In interviews, you can be more certain that you and the respondent are thinking about questions in the same way (especially useful if your survey must deal with vague concepts)...
In interviews, you can get a more granular understanding of participants’ responses if desired, e.g. understanding relevant aspects of their worldviews, and choose to delve deeper into certain important aspects.
I agree there are some very significant advantages to the use of more qualitative instruments such as open-comments or interviews (I provide similar arguments here). In some cases these might be so extreme that it only makes sense to use these methods. That said, the disadvantages are potentially severe, so I would recommend against people being too eager to either switch to fully qualitative methods or add more open comment instruments to a mixed survey:
- Open comment responses may greatly reduce comparability (and so the ability to aggregate responses at all, if that is one of your goals), because respondents may be functionally providing answers to different questions, employing different concepts
- Analysing such data typically raises a lot of issues of subjectivity and researcher degrees of freedom
- You can attempt to overcome those issues by pre-registering even qualitative research (see here or here), and by following a fixed protocol in advance using a more objective method to analyse and aggregate responses, but then this reintroduces the original issues of needing to force individuals’ responses into fixed boxes when they may have been thinking of things in a different manner.
Including both fixed response and open comment things in the same format may seem like the best of both worlds and is often the best approach, but open comments questions are often dramatically more time-consuming and demanding than fixed response questions and so their inclusion can greatly reduce the quality of the responses to the fixed questions.
I think running separate qualitative and quantitative studies is worth seriously considering: either with initial qualitative work helping to develop hypotheses followed by quantitative study or with a wider quantitative study followed by qualitative work to delve further into the details. This can also be combined with separate exploratory and confirmatory stages of research, which is often recommended.
This latter point relates to the issue of preregistration, which you mention. It is common not to preregister analyses for exploratory research (where you don’t have existing hypotheses which you want to test and simply want to explore or describe possible patterns in the data) - though some argue you should preregister exploratory research anyway. I think there’s a plausible argument for erring on the side of preregistration in theory, based on the fact that preregistration allows reporting additional exploratory analyses anyway, or explicitly deviating from your preregistered analysis to run things differently if the data requires it (which sometimes it does if certain assumptions are not met). That said, it is quite possible for researchers to inappropriately preregister exploratory research and not deviate or report additional analyses, even where this means the analyses they are reporting are inappropriate and completely meaningless, so this is a pitfall worth bearing in mind and trying to avoid.
Ideally, I’d want to stress test this methodology by collecting ~10 responses and running it—probably you could just simulate this by going through the survey 10 times, wearing different hats.
Another option would be to literally simulate your data (you could simulate data that either does or does not match your hypotheses, for example) and analysing that. This is potentially pretty straightforward depending on the kind of data structure you anticipate.
Incidentally, I agreed with almost all the advice in this post except for the things in the “Other small lessons about survey design.” In particular, I think “Almost always have just one question per page” and not using sliding scales rather than lists, seem like things I would not generally recommend (although having one question per page and using lists rather than scales is often fine). For “screening” questions for competence, unless you literally want to screen people out from taking the survey at all, you might also want to consider running these at the end of the survey rather than the beginning. Rather than excluding people from the survey entirely and not gathering their responses at all, you could gather their data , and then conduct analyses excluding respondents who fail the relevant checks, if appropriate (whether it’s better to gather their data at all or not depends a lot on the specific case). Which order is better is a tricky question, depending on the specific case. One reason to have such questions later is that respondents can be annoyed by checks which seem like they are trying to test them (this most commonly comes up with comprehension/attention/instructional manipulation checks), which can influence later questions (the DVs you are usually interested in). Of course, in some circumstances, you may be concerned that the main questions will themselves influence responses to your ‘check’ in a way that would invalidate them.
For what it’s worth, I’m generally happy to offer comments on surveys people are running, and although I can’t speak for them, I imagine that would go for my colleagues on the survey team at Rethink Priorities too.
What links here?
- Lessons learned running the Survey on AI existential risk scenarios by Sam Clarke (Oct 13, 2021, 11:33 AM; 69 points)
- MichaelA🔸Oct 14, 2021, 9:38 AM
  5 points
  0 ∶ 0
  Parent
  Thanks—despite already having received great feedback from you on a survey I’m developing, I still found this comment also really useful.
  open comments questions are often dramatically more time-consuming and demanding than fixed response questions and so their inclusion can greatly reduce the quality of the responses to the fixed questions.
  1. Meaning people start getting tired or bored and so give less thoughtful responses to the fixes questions? (Rather than just some people stopping taking the survey, which would be a quantity issue.)
  2. And do you think that holds even if you make it clear that those open comment boxes are very optional? In particular, I plan to say at the start of the survey something like “Feel free to skip questions about which you’d rather not express an opinion, e.g. because you don’t feel you know enough about it. Feel especially free to skip the questions about the robustness of your views and the comment boxes.” (I don’t then remind people at each comment box that that comment box is optional.)
  - David_Moss Oct 14, 2021, 2:07 PM
    10 points
    0 ∶ 0
    Parent
    I think all of the following (and more) are possible risks:
    
    - People are tired/bored and so answer less effortfully/more quickly
    - People are annoyed and so answer in a qualitatively different way
    - People are tired/bored/annoyed and so skip more questions
    - People are tired/bored/annoyed and dropout entirely
    Note that people skipping questions/dropping out is not merely a matter of quantity (reduced numbers of responses), because the dropout/skipping is likely to be differential. The effect of the questions will be to lead to precisely those respondents who are more likely to be bored/tired/annoyed by those questions and to skip questions/dropout if bored/tired/annoyed to be less likely to give responses.
    Regrettably, I think that specifying extremely clearly that the questions are completely optional influences some respondents (it also likely makes many simply less likely to answer these questions), but doesn’t ameliorate the harm for others. You may be surprised how many people will provide multiple exceptionally long open comments and then complain that the survey took them longer than the projected average. That aside, depending on the context, I think it’s sometimes legitimate for people to be annoyed by the presence of lots of open comment questions even if they are explicitly stated to be optional because, in context, it may seem like they need to answer them anyway.
- Sam Clarke Oct 15, 2021, 3:34 PM
  3 points
  0 ∶ 0
  Parent
  Thanks for the detailed reply, all of this makes sense!
  
  I added a caveat to the final section mentioning your disagreements with some of the points in the “Other small lessons about survey design” section
- MichaelA🔸Oct 14, 2021, 9:40 AM
  3 points
  0 ∶ 0
  Parent
  In particular, I think “Almost always have just one question per page” and not using sliding scales rather than lists, seem like things I would not generally recommend (although having one question per page and using lists rather than scales is often fine).
  Do you have any heuristics for when just one question per page is better, worse, or equally good as many questions per page?
  (No worries if not—I can also just ask you for private feedback on my specific survey.)
  - David_Moss Oct 15, 2021, 3:50 PM
    14 points
    0 ∶ 0
    Parent
    I think it depends a lot on the specifics of your survey design. The most commonly discussed tradeoff in the literature is probably that having more questions per page, as opposed to more pages with fewer questions, leads to higher non-response and lower self-reported satisfaction, but people answer the former more quickly. But how to navigate this tradeoff is very context-dependent.
    All in all, the optimal number of items per screen requires a trade-off:
    More items per screen shorten survey time but reduce data quality (item nonresponse) and respondent satisfaction (with potential consequences for motivation and cooperation in future surveys). Because the negative effects of more items per screen mainly arise when scrolling is required, we are inclined to recommend placing four to ten items on a single screen, avoiding the necessity to scroll.
    https://www.researchgate.net/publication/249629594_Design_of_Web_Questionnaires_The_Effects_of_the_Number_of_Items_per_Screen
    In this context, survey researchers have to make informed decisions regarding which approach to use in different situations. Thus, they have to counterbalance the potential time savings and ease of application with the quality of the answers and the satisfaction of respondents. Additionally, they have to consider how other characteristics of the questions can influence this trade-off. For example, it would be expected that an increase in answer categories would lead to a considerable decrease in data quality, as the matrix becomes larger and harder to complete. As such, in addition to knowing which approach leads to better results, it is essential to know how characteristics of the questions, such as the number of categories or the device used, influence the trade-off between the use of grids and single-item questions.
    https://sci-hub.ru/https://journals.sagepub.com/doi/full/10.1177/0894439316674459
    But, in addition, I think there are a lot of other contextual factors that influence which is preferable. For example, if you want respondents to answer a number of questions pertaining to a number of subtly different prompts (which is pretty common in studies with a within-subjects component), then having all the questions for one prompt on one page may help make salient the distinction between the different prompts. There are other things you can do to aid this, like having gap pages between different prompts, though these can really enrage respondents.