After taking a closer look at the actual stats, I agree this analysis seems really difficult to do well, and I don’t put much weight on this particular set of tests. But your hypothesis is plausible and interesting, your data is strong, your regressions seem like the right general idea, and this seems proof of concept that this analysis could demonstrate a real effect. I’m also surprised that I can’t find any statistical analysis of COVID and Biden support anywhere, even though it seems very doable and very interesting. If I were you and wanted to pursue this further, I would figure out the strongest case that there might be an effect to be found here, then bring it to some people who have the stats skills and public platform to find the effect and write about it.
Statistically, I think you have two interesting hypotheses, and I’m not sure how you should test them or what you should control for. (Background: I’ve done undergrad intro stats-type stuff.)
Hypothesis A (Models 1 and 2) is that more COVID is correlated with more Biden support.
Hypothesis B (Model 3) is that more Biden support is correlated with more tests, which then has unclear causal effects of COVID.
I say “more COVID” to be deliberately ambiguous because I’m not sure which tracking metric to use. Should we expect Biden support to be correlated with tests, cases, hospitalizations, or deaths? And for each metric, should it be cumulative over time, or change over a given time period? What would it mean to find different effects for different metrics? Also, they’re all correlated with each other—does that bias your regression, or otherwise affect your results? I don’t know.
I also don’t know what controls to use. Controlling for state-level FEs seems smart, while controlling for date is interesting and potentially captures a different dynamic, but I have no idea how you should control for the correlated bundle of tests/cases/hospitalizations/deaths.
Without resolving these issues, I think the strongest evidence in favor of either hypothesis would be a bunch of different regressions that categorically test many different implementations of the overall hypothesis, with most of them seemingly supporting the hypothesis. I’m not sure what the right implementation is, I’d want someone with a strong statistics background to resolve these issues before really believing it, and this method can fail, but if most implementations you can imagine point in the same direction, that’s at least a decent reason to investigate further.
If you actually want to convince someone to look into this (with or without you), maybe do that battery of regressions, then write up a very generalized takeaway along the lines of “The hypothesis is plausible, the data is here, and the regressions don’t rule out the hypothesis. Do you want to look into whether or not there’s an effect here?”
Who’d be interested in this analysis? Strong candidates might include academics, think tanks, data journalism news outlets, and bloggers. The stats seem very difficult, maybe such that the best fit is academics, but I don’t know. News outlets and bloggers that aren’t specifically data savvy probably aren’t capable of doing this analysis justice. Without working with someone with a very strong stats background, I’d be cautious about writing this for a public audience.
Not sure if you’re even interested in any of that, but FWIW I think they’d like your ideas and progress so far. If you’d like to talk about this more, I’m happy to chat, you can pick a time here. Cool analysis, kudos on thinking of an interesting topic, seriously following through with the analysis, and recognizing its limitations.
Thank you so much for putting so much thought into this and writing up all of that advice! Your uncertainties and hesitations about the stats itself are essentially the same as my own. Last night, I passed this around to a few people who know marginally more about stats than I do, and they suggested some further robustness checks that they thought would be appropriate. I spent a bunch of time today implementing those suggestions, identifying problems with my previous work, and re-doing that work differently. In the process, I think I significantly improved my understanding of the right (or at least good) way to approach this analysis. I did, however, end up with a quite different (and less straightforward) set of conclusions than I had yesterday. I’ve updated the GitHub repository to reflect the current state of the project, and I will likely update the shortform post in a few minutes, too. Now that I think the analysis is in much better shape (and, frankly, that you’ve encouraged me), I am more seriously entertaining the idea of trying to get in touch with someone who might be able to explore it further. I think it would be fun chat about this, so I’ll probably book a time on your Calendly soon. Thanks again for all your help!
Glad to hear it! Very good idea to talk with a bunch of stats people, your updated tests are definitely beyond my understanding. Looking forward to talking (or not), and let me know if I can help with anything
After taking a closer look at the actual stats, I agree this analysis seems really difficult to do well, and I don’t put much weight on this particular set of tests. But your hypothesis is plausible and interesting, your data is strong, your regressions seem like the right general idea, and this seems proof of concept that this analysis could demonstrate a real effect. I’m also surprised that I can’t find any statistical analysis of COVID and Biden support anywhere, even though it seems very doable and very interesting. If I were you and wanted to pursue this further, I would figure out the strongest case that there might be an effect to be found here, then bring it to some people who have the stats skills and public platform to find the effect and write about it.
Statistically, I think you have two interesting hypotheses, and I’m not sure how you should test them or what you should control for. (Background: I’ve done undergrad intro stats-type stuff.)
Hypothesis A (Models 1 and 2) is that more COVID is correlated with more Biden support.
Hypothesis B (Model 3) is that more Biden support is correlated with more tests, which then has unclear causal effects of COVID.
I say “more COVID” to be deliberately ambiguous because I’m not sure which tracking metric to use. Should we expect Biden support to be correlated with tests, cases, hospitalizations, or deaths? And for each metric, should it be cumulative over time, or change over a given time period? What would it mean to find different effects for different metrics? Also, they’re all correlated with each other—does that bias your regression, or otherwise affect your results? I don’t know.
I also don’t know what controls to use. Controlling for state-level FEs seems smart, while controlling for date is interesting and potentially captures a different dynamic, but I have no idea how you should control for the correlated bundle of tests/cases/hospitalizations/deaths.
Without resolving these issues, I think the strongest evidence in favor of either hypothesis would be a bunch of different regressions that categorically test many different implementations of the overall hypothesis, with most of them seemingly supporting the hypothesis. I’m not sure what the right implementation is, I’d want someone with a strong statistics background to resolve these issues before really believing it, and this method can fail, but if most implementations you can imagine point in the same direction, that’s at least a decent reason to investigate further.
If you actually want to convince someone to look into this (with or without you), maybe do that battery of regressions, then write up a very generalized takeaway along the lines of “The hypothesis is plausible, the data is here, and the regressions don’t rule out the hypothesis. Do you want to look into whether or not there’s an effect here?”
Who’d be interested in this analysis? Strong candidates might include academics, think tanks, data journalism news outlets, and bloggers. The stats seem very difficult, maybe such that the best fit is academics, but I don’t know. News outlets and bloggers that aren’t specifically data savvy probably aren’t capable of doing this analysis justice. Without working with someone with a very strong stats background, I’d be cautious about writing this for a public audience.
Not sure if you’re even interested in any of that, but FWIW I think they’d like your ideas and progress so far. If you’d like to talk about this more, I’m happy to chat, you can pick a time here. Cool analysis, kudos on thinking of an interesting topic, seriously following through with the analysis, and recognizing its limitations.
Thank you so much for putting so much thought into this and writing up all of that advice! Your uncertainties and hesitations about the stats itself are essentially the same as my own. Last night, I passed this around to a few people who know marginally more about stats than I do, and they suggested some further robustness checks that they thought would be appropriate. I spent a bunch of time today implementing those suggestions, identifying problems with my previous work, and re-doing that work differently. In the process, I think I significantly improved my understanding of the right (or at least good) way to approach this analysis. I did, however, end up with a quite different (and less straightforward) set of conclusions than I had yesterday. I’ve updated the GitHub repository to reflect the current state of the project, and I will likely update the shortform post in a few minutes, too. Now that I think the analysis is in much better shape (and, frankly, that you’ve encouraged me), I am more seriously entertaining the idea of trying to get in touch with someone who might be able to explore it further. I think it would be fun chat about this, so I’ll probably book a time on your Calendly soon. Thanks again for all your help!
Glad to hear it! Very good idea to talk with a bunch of stats people, your updated tests are definitely beyond my understanding. Looking forward to talking (or not), and let me know if I can help with anything
Thanks! I booked a slot on your Calendly—looking forward to speaking Thursday (assuming that still works)!