Thank you for your responses and engagement. Overall, it seems like we agree 1 and 2 are problems; still disagree about 3; and I don’t think I made my point on 4 understood and your explanation raises more issues in my mind. While I think these 4 issues are themselves substantive, I worry they are the tip of an iceberg as 1 and 2 are in my opinion relatively basic issues. I appreciate your offer to pay for further critique; I hope someone is able to take you up on it.
Great, I think we agree the approach outlined in the original report should be changed. Did the report actually use percentage of total papers found? I don’t mean to be pedantic but it’s germane to my greater point: was this really a miscommunication of the intended analysis, or did the report originally intend to use number of papers founds, as it seems to state and then execute on: “Confidence ratings are based on the number of methodologically robust (according to the two reviewers) studies supporting the claim. Low = 0-2 studies supporting, or mixed evidence; Medium = 3-6 studies supporting; Strong = 7+ studies supporting.”
It seems like we largely agree in not putting much weight in this study. However, I don’t think comparisons against a baseline measurement mitigates the bias concerns much. For example, exposure to the protests is a strong signal of social desirability: it’s a chunk of society demonstrating to draw attention to the desirability of action on climate change. This exposure is present in the “after” measurement and absent in the “before” measurement, thus differential and potentially biasing the estimates. Such bias could be hiding a backlash effect.
The issue lies in defining “unusually influential protest movements”. This is crucial because you’re selecting on your outcome measurement, which is generally discouraged. The most cynical interpretation would be that you excluded all studies that didn’t find an effect because, by definition, these weren’t very influential protest movements.
Unfortunately, this is not a semantic critique. Call it what you will but I don’t know what the confidences/uncertainties you are putting forward mean and your readers would be wrong to assume. I didn’t read the entire OpenPhil report, but I didn’t see any examples of using low percentages to indicate high uncertainty. Can you explain concretely what your numbers mean?
My best guess is this is a misinterpretation of the “90%” in a “90% confidence interval”. For example, maybe you’re interpreting a 90% CI from [2,4] to indicate we are highly confident the effect ranges from 2 to 4, while a 10% CI from [2.9, 3.1] would indicate we have very little confidence in the effect? This is incorrect as CIs can be constructed at any level of confidence regardless of the size of effect, from null to very large, or the variance in the effect.
Thank you for pointing to this additional information re your definition of variance; I hadn’t seen it. Unfortunately, it illustrates my point that these superficial methodological issues are likely just the tip of the iceberg. The definition you provide offers two radically different options for the bound of the range you’re describing: randomly selected or median protest. Which one is it? If it’s randomly selected, what prevents randomly selecting the most effective protests, in which case the range would be zero? Etc.
Lastly, I have to ask in what regard you don’t find these critiques methodological? The selection of outcome measure in a review, survey design, construction of a research question and approach to communicating uncertainty all seem methodological—at least these are topics commonly covered in research methods courses and textbooks.
Thanks for your quick reply Jacob! I think I still largely degree on how substantive you think these are, and address these points below. I also feel sad that your comments feel slightly condescending or uncharitable, which makes it difficult for me to have a productive conversation.
Great, I think we agree the approach outlined in the original report should be changed. Did the report actually use percentage of total papers found? I don’t mean to be pedantic but it’s germane to my greater point: was this really a miscommunication of the intended analysis, or did the report originally intend to use number of papers founds, as it seems to state and then execute on: “Confidence ratings are based on the number of methodologically robust (according to the two reviewers) studies supporting the claim. Low = 0-2 studies supporting, or mixed evidence; Medium = 3-6 studies supporting; Strong = 7+ studies supporting.”
The first one—Our aim was to examine all the papers (within our other criteria of recency, democratic context, etc) that related to the impacts of protest on public opinion, policy change, voting behaviour, etc. We didn’t exclude any because they found negative or negligible results—as that would obviously be empirically extremely dubious.
2. It seems like we largely agree in not putting much weight in this study. However, I don’t think comparisons against a baseline measurement mitigates the bias concerns much. For example, exposure to the protests is a strong signal of social desirability: it’s a chunk of society demonstrating to draw attention to the desirability of action on climate change. This exposure is present in the “after” measurement and absent in the “before” measurement, thus differential and potentially biasing the estimates. Such bias could be hiding a backlash effect.
I didn’t make this clear enough in my first comment (I’ve now edited it) but I think your social desirability critique feels somewhat off. Only 18% of people in the UK were supportive of these protests (according to our survey), with a fair bit of negative media attention about the protests. This makes it hard to believe that respondents would genuinely feel any positive social desirability bias, when the majority of the public actually disapprove of the protests. If anything, it would be much more likely to have negative social desirability bias. I’m open to ways on how we might test this post-hoc with the data we have, but not sure if that’s possible.
3. The issue lies in defining “unusually influential protest movements”. This is crucial because you’re selecting on your outcome measurement, which is generally discouraged. The most cynical interpretation would be that you excluded all studies that didn’t find an effect because, by definition, these weren’t very influential protest movements.
Just to reiterate what I said above for clarity: Our aim was to examine all the papers that related to the impacts of protest on public opinion, policy change, voting behaviour, etc. We didn’t exclude any because they found negative or negligible results—as that would obviously be empirically extremely dubious. The only reason we specified that our research looks at large and influential protest movements is that this is by default what academics study (as they are interesting and able to get published). There are almost no studies looking at the impact of small protests, which make up the majority of protests, so we can’t claim to have any solid understanding of their impacts. The research was largely aiming to understand the impacts for the largest/most well-studied protest movements, and I think that aim was fulfilled.
4. Unfortunately, this is not a semantic critique. Call it what you will but I don’t know what the confidences/uncertainties you are putting forward mean and your readers would be wrong to assume. I didn’t read the entire OpenPhil report, but I didn’t see any examples of using low percentages to indicate high uncertainty. Can you explain concretely what your numbers mean?
Sure—what we mean is that we’re 80% confident that our indicated answer is likely to be the true answer. For example, for our answers on policy change, we’re 40-60% confident that our finding (highlighted in blue) is likely to be correct e.g. there’s a 60-40% chance we’ve also got it wrong. One could also assume from where we’ve placed it on our summary table that if it was wrong, it’s likely to be in the boxes immediately surrounding what we indicated.
E.g. if you look at the Open Phil report, here is a quote similar to how we’ve used it:
“to indicate that I think the probability of my statement being true is >50%”
I understand that confidence intervals can be constructed for any effect size, but we indicate the effect sizes using the upper row in the summary table (and quantify it where we think it is reasonable to do so).
Lastly, I have to ask in what regard you don’t find these critiques methodological? The selection of outcome measure in a review, survey design, construction of a research question and approach to communicating uncertainty all seem methodological—at least these are topics commonly covered in research methods courses and textbooks.
The reasons why I don’t find these critiques as highlighting significant methodological flaws is that:
I don’t think we have selected the wrong outcome measure, but we just didn’t communicate it particularly well, which I totally accept.
The survey design isn’t perfect, which I admit, but we didn’t put a lot of weight on this for our report so in my view it’s not pointing out a methodological issue with this report. Additionally, you think there will be high levels of positive social desirability bias when this is the opposite of what I would expect—given the majority of the public (82% in our survey) don’t support the protests (and report this on the survey, indicating the social desirability bias doesn’t skew positive)
Similar to my first bullet point—I think the research question is well constructed (i.e. it wasn’t selecting for the outcome as I clarified) but you’ve read it in a fairly uncharitable way (which due to our fault, is possible because we’ve been vaguer than ideal)
Finally I think we’ve communicated uncertainty in quite a reasonable way, and other feedback we’ve got indicates that people fully understood what we meant. We’ve received 4+ other pieces of feedback regarding our uncertainty communication which people found useful and indicative, so I’m currently putting more weight on this than your view. That said, I do think it can be improved, but I’m not sure it’s as much of a methodological issue as a communicative issue.
I also feel sad that your comments feel slightly condescending or uncharitable, which makes it difficult for me to have a productive conversation.
I’m really sorry to come off that way, James. Please know it’s not my intention, but duly noted, and I’ll try to do better in the future.
Got it; that’s helpful to know, and thank you for taking the time to explain!
SDB is generally hard to test for post hoc, which is why it’s so important to design studies to avoid it. As the surveys suggest, not supporting protests doesn’t imply people don’t report support for climate action; so, for example, the responses about support for climate action could be biased upwards by the social desirability of climate action, even though those same respondents don’t support protests. Regardless, I don’t allege to know for certain these estimates are biased upwards (or downwards for that matter, in which case maybe the study is a false negative!). Instead, I’d argue the design itself is susceptible to social desirability and other biases. It’s difficult, if not impossible, to sort out how those biases affected the result, which is why I don’t find this study very informative. I’m curious why, if you think the results weren’t likely biased, you chose to down-weight it?
Understood; thank you for taking the time to clarify here. I agree this would be quite dubious. I don’t mean to be uncharitable in my interpretation: unfortunately, dubious research is the norm, and I’ve seen errors like this in the literature regularly. I’m glad they didn’t occur here!
Great, this makes sense and seems like standard practice. My misunderstanding arose from an error in the labeling of the tables: Uncertainty level 1 is labeled “highly uncertain,” but this is not the case for all values in that range. For example, suppose you were 1% confident that protests led to a large change. Contrary to the label, we would be quite certain protests did not lead to a large change. 20% confidence would make sense to label as highly uncertain as it reflects a uniform distribution of confidence across the five effect size bins. But confidences below that, in fact, reflect increasing certainty about the negation of the claim. I’d suggest using traditional confidence intervals here instead as they’re more familiar and standard, eg: We believe the average effects of protests on voting behavior is in the interval of [1, 8] percentage points with 90% confidence, or [3, 6] pp with 80% confidence.
Further adding to my confusion, the usage of “confidence interval” in “which can also be interpreted as 0-100% confidence intervals,” doesn’t reflect the standard usage of the term.
The reasons why I don’t find these critiques as highlighting significant methodological flaws is that:
Sorry, I think this was a miscommunication in our comments. I was referring to “Issues you raise are largely not severe nor methodological,” which gave me the impression you didn’t think the issues were related to the research methods. I understand your position here better.
Anyway, I’ll edit my top-level comment to reflect some of this new information; this generally updates me toward thinking this research may be more informative. I appreciate your taking the time to engage so thoroughly, and apologies again for giving an impression of anything less than the kindness and grace we should all aspire to.
Thank you for your responses and engagement. Overall, it seems like we agree 1 and 2 are problems; still disagree about 3; and I don’t think I made my point on 4 understood and your explanation raises more issues in my mind. While I think these 4 issues are themselves substantive, I worry they are the tip of an iceberg as 1 and 2 are in my opinion relatively basic issues. I appreciate your offer to pay for further critique; I hope someone is able to take you up on it.
Great, I think we agree the approach outlined in the original report should be changed. Did the report actually use percentage of total papers found? I don’t mean to be pedantic but it’s germane to my greater point: was this really a miscommunication of the intended analysis, or did the report originally intend to use number of papers founds, as it seems to state and then execute on: “Confidence ratings are based on the number of methodologically robust (according to the two reviewers) studies supporting the claim. Low = 0-2 studies supporting, or mixed evidence; Medium = 3-6 studies supporting; Strong = 7+ studies supporting.”
It seems like we largely agree in not putting much weight in this study. However, I don’t think comparisons against a baseline measurement mitigates the bias concerns much. For example, exposure to the protests is a strong signal of social desirability: it’s a chunk of society demonstrating to draw attention to the desirability of action on climate change. This exposure is present in the “after” measurement and absent in the “before” measurement, thus differential and potentially biasing the estimates. Such bias could be hiding a backlash effect.
The issue lies in defining “unusually influential protest movements”. This is crucial because you’re selecting on your outcome measurement, which is generally discouraged. The most cynical interpretation would be that you excluded all studies that didn’t find an effect because, by definition, these weren’t very influential protest movements.
Unfortunately, this is not a semantic critique. Call it what you will but I don’t know what the confidences/uncertainties you are putting forward mean and your readers would be wrong to assume. I didn’t read the entire OpenPhil report, but I didn’t see any examples of using low percentages to indicate high uncertainty. Can you explain concretely what your numbers mean?
My best guess is this is a misinterpretation of the “90%” in a “90% confidence interval”. For example, maybe you’re interpreting a 90% CI from [2,4] to indicate we are highly confident the effect ranges from 2 to 4, while a 10% CI from [2.9, 3.1] would indicate we have very little confidence in the effect? This is incorrect as CIs can be constructed at any level of confidence regardless of the size of effect, from null to very large, or the variance in the effect.
Thank you for pointing to this additional information re your definition of variance; I hadn’t seen it. Unfortunately, it illustrates my point that these superficial methodological issues are likely just the tip of the iceberg. The definition you provide offers two radically different options for the bound of the range you’re describing: randomly selected or median protest. Which one is it? If it’s randomly selected, what prevents randomly selecting the most effective protests, in which case the range would be zero? Etc.
Lastly, I have to ask in what regard you don’t find these critiques methodological? The selection of outcome measure in a review, survey design, construction of a research question and approach to communicating uncertainty all seem methodological—at least these are topics commonly covered in research methods courses and textbooks.
Thanks for your quick reply Jacob! I think I still largely degree on how substantive you think these are, and address these points below. I also feel sad that your comments feel slightly condescending or uncharitable, which makes it difficult for me to have a productive conversation.
The first one—Our aim was to examine all the papers (within our other criteria of recency, democratic context, etc) that related to the impacts of protest on public opinion, policy change, voting behaviour, etc. We didn’t exclude any because they found negative or negligible results—as that would obviously be empirically extremely dubious.
I didn’t make this clear enough in my first comment (I’ve now edited it) but I think your social desirability critique feels somewhat off. Only 18% of people in the UK were supportive of these protests (according to our survey), with a fair bit of negative media attention about the protests. This makes it hard to believe that respondents would genuinely feel any positive social desirability bias, when the majority of the public actually disapprove of the protests. If anything, it would be much more likely to have negative social desirability bias. I’m open to ways on how we might test this post-hoc with the data we have, but not sure if that’s possible.
Just to reiterate what I said above for clarity: Our aim was to examine all the papers that related to the impacts of protest on public opinion, policy change, voting behaviour, etc. We didn’t exclude any because they found negative or negligible results—as that would obviously be empirically extremely dubious. The only reason we specified that our research looks at large and influential protest movements is that this is by default what academics study (as they are interesting and able to get published). There are almost no studies looking at the impact of small protests, which make up the majority of protests, so we can’t claim to have any solid understanding of their impacts. The research was largely aiming to understand the impacts for the largest/most well-studied protest movements, and I think that aim was fulfilled.
Sure—what we mean is that we’re 80% confident that our indicated answer is likely to be the true answer. For example, for our answers on policy change, we’re 40-60% confident that our finding (highlighted in blue) is likely to be correct e.g. there’s a 60-40% chance we’ve also got it wrong. One could also assume from where we’ve placed it on our summary table that if it was wrong, it’s likely to be in the boxes immediately surrounding what we indicated.
E.g. if you look at the Open Phil report, here is a quote similar to how we’ve used it:
I understand that confidence intervals can be constructed for any effect size, but we indicate the effect sizes using the upper row in the summary table (and quantify it where we think it is reasonable to do so).
The reasons why I don’t find these critiques as highlighting significant methodological flaws is that:
I don’t think we have selected the wrong outcome measure, but we just didn’t communicate it particularly well, which I totally accept.
The survey design isn’t perfect, which I admit, but we didn’t put a lot of weight on this for our report so in my view it’s not pointing out a methodological issue with this report. Additionally, you think there will be high levels of positive social desirability bias when this is the opposite of what I would expect—given the majority of the public (82% in our survey) don’t support the protests (and report this on the survey, indicating the social desirability bias doesn’t skew positive)
Similar to my first bullet point—I think the research question is well constructed (i.e. it wasn’t selecting for the outcome as I clarified) but you’ve read it in a fairly uncharitable way (which due to our fault, is possible because we’ve been vaguer than ideal)
Finally I think we’ve communicated uncertainty in quite a reasonable way, and other feedback we’ve got indicates that people fully understood what we meant. We’ve received 4+ other pieces of feedback regarding our uncertainty communication which people found useful and indicative, so I’m currently putting more weight on this than your view. That said, I do think it can be improved, but I’m not sure it’s as much of a methodological issue as a communicative issue.
I’m really sorry to come off that way, James. Please know it’s not my intention, but duly noted, and I’ll try to do better in the future.
Got it; that’s helpful to know, and thank you for taking the time to explain!
SDB is generally hard to test for post hoc, which is why it’s so important to design studies to avoid it. As the surveys suggest, not supporting protests doesn’t imply people don’t report support for climate action; so, for example, the responses about support for climate action could be biased upwards by the social desirability of climate action, even though those same respondents don’t support protests. Regardless, I don’t allege to know for certain these estimates are biased upwards (or downwards for that matter, in which case maybe the study is a false negative!). Instead, I’d argue the design itself is susceptible to social desirability and other biases. It’s difficult, if not impossible, to sort out how those biases affected the result, which is why I don’t find this study very informative. I’m curious why, if you think the results weren’t likely biased, you chose to down-weight it?
Understood; thank you for taking the time to clarify here. I agree this would be quite dubious. I don’t mean to be uncharitable in my interpretation: unfortunately, dubious research is the norm, and I’ve seen errors like this in the literature regularly. I’m glad they didn’t occur here!
Great, this makes sense and seems like standard practice. My misunderstanding arose from an error in the labeling of the tables: Uncertainty level 1 is labeled “highly uncertain,” but this is not the case for all values in that range. For example, suppose you were 1% confident that protests led to a large change. Contrary to the label, we would be quite certain protests did not lead to a large change. 20% confidence would make sense to label as highly uncertain as it reflects a uniform distribution of confidence across the five effect size bins. But confidences below that, in fact, reflect increasing certainty about the negation of the claim. I’d suggest using traditional confidence intervals here instead as they’re more familiar and standard, eg: We believe the average effects of protests on voting behavior is in the interval of [1, 8] percentage points with 90% confidence, or [3, 6] pp with 80% confidence.
Further adding to my confusion, the usage of “confidence interval” in “which can also be interpreted as 0-100% confidence intervals,” doesn’t reflect the standard usage of the term.
Sorry, I think this was a miscommunication in our comments. I was referring to “Issues you raise are largely not severe nor methodological,” which gave me the impression you didn’t think the issues were related to the research methods. I understand your position here better.
Anyway, I’ll edit my top-level comment to reflect some of this new information; this generally updates me toward thinking this research may be more informative. I appreciate your taking the time to engage so thoroughly, and apologies again for giving an impression of anything less than the kindness and grace we should all aspire to.