Yay for Bayesian regression (binomial, I’m guessing? You re-binned your attitude and donations responses? I think an ordered logit would be more appropriate here and result in less of a loss in resolution, or even a dirichlet, but then you’d lose yer ordering)! Those posteriors look decently tight, though I do have some questions!
I’m a little confused on what your control was, exactly. You have both points and distributions in your posterior plots, but you don’t have any control paragraph blurb in you google doc questionnaire. How did you evaluate your control? Did you give them a paragraph entirely unrelated to EA? These plots are the posterior estimates for p_binomial when each dummy variable for treatment is 0? Is “average treatment effect” some posterior predictive difference from the control p (i.e. why it’s exactly 0)?
On a related (and elucidatory) note, could you more explicitly clarify which models you fitted, exactly? Did you do any model comparison or averaging, or evaluate model adequacy? You mention “controlling for other variables in the survey” but I don’t see any e.g. demographic questions in your questionnaire. You said you “examined these relationships overall and among the critical subgroup of those with at least a bachelor’s degree”—did you do this by excluding everyone without a bachelor’s, or by modeling the effects of educational attainment and then doing model comparison to test the legitimacy of those effects (I’d think looking at the posterior for the interaction between your paragraph and education dummies would be the clearest test)? Did you use diffuse, “uninformative” priors (and hyperpriors)? Which ones, exactly?
I assume that since this is a hierarchical analysis you used MCMC (HMC?) to do the fitting. Are your posterior distributions smoothed substantially, e.g. with a kernel density estimator? Or did you just get fantastic performance? What diagnostics did you run to ensure MCMC health? How many chains did you run? Did you use stopping rules? In my experience, hierarchical regression models can be pretty finicky to fit as they get more complex.
Kudos on not just using some wackily inappropriate out-of-the-box frequentist test!
edit: also, what are the boxplot-looking things? 95% HPDIs? CIs? Some other %? Ah wait they’re the sd of your marginal samples?
Unfortunately, because I used proprietary survey data/a proprietary R package to run this analysis, I don’t think I’ll be able to share the data and code.
Ah, interesting! What package? I’ve never heard of something like that before. Usually in the cold, mechanical heart of every R package is the deep desire to be used and shared as far as possible. If it’s just someone’s personal interface code, why not use something more publicly available? Can you write out your basic script in pseudocode (or just math/words?)? Especially the model and MCMC specification bits?
Sure, in an ideal world, software would all be free for everyone; alas, we do not live in such a world :p. I used the proprietary package because it did exactly what I needed and doesn’t require writing STAN code or anything myself. I’d rather not re-invent the wheel. I felt the tradeoff of transparency for efficiency and confidence in its accuracy was worth it, especially since I wouldn’t be able to share the data either way (such are the costs of getting these questions on a 1200 person survey without paying a substantial amount).
But the basic model was just a multilevel binomial model predicting the dependent variable using the treatments and questions asked earlier in the survey as controls.
Of course (though wheel reinvention can be super helpful educationally), but there are great free public R packages that interface to STAN (I use “rethinking” for my hierarchical Bayesian regression needs but I think Rstan would work, too), so going with someone’s unnamed, private code isn’t necessary imo. How much did the survey cost (was it a lot longer than the included google doc, then? e.g. Did you have screening questions to make sure people read the paragraph?). And model+mcmc specification can have lots of fiddly bits that can easily lead us astray, I’d say
Yeah, the survey was a lot longer. Typically general public surveys will cost over 10 dollars a complete, so getting 1200 cases for a survey like this can cost thousands of dollars.
I agree that model specification can be tricky, which is a reason I felt it well worth it to use the proprietary software I had access to that has been thoroughly vetted and code reviewed and is used frequently to run similar analyses rather than trying to construct my own.
I did not make sure people read the paragraph. I discussed the issue a bit in my discussion section, but one way a web survey might understate the effect is if people would pay closer attention and respond better to a friend delivering the message. OTOH, surveys do have some potentual vulnerability to the hawthorne effect, though that didn’t seem to express itself in the donations question.
The respondents in a treatment were each shown a message and asked how compelling they thought it was. The control was shown no message.
Yeah; the plots are the predicted values for those given a particular treatment. and Average Treatment Effect is the difference with the control.
I did not include every control used in the provided questionnaire. There were a mix of demographics/attitudinal/behavioral questions asked in the survey that I also used. These controls, particularly previous donations, were important for decreasing variance.
I used a multilevel model to estimate the effects among those with and without a bachelor’s degree. So, the bachelor’s estimate borrow’s power from those without a degree, reducing problems with over fitting.
These models used STAN, which handles these multilevel models well. Convergence was assessed with gelman-rubin statistics.
Ah, I guess that’s better than no control, and presumably paying attention to a paragraph of text doesn’t make someone substantially more or less generous. Did you fit a bunch of models with different predictors and test for a sufficient improvement of fit with each? Might do to be wary of overfitting in those regards maybe… though since those aren’t focal Bayes tends to be pretty robust there, imo, so long as you used sensible priors
“I used a multilevel model to estimate the effects among those with and without a bachelor’s degree. So, the bachelor’s estimate borrow’s power from those without a degree, reducing problems with over fitting.”
If I’m understanding correctly, you had a hyperprior on the effect of education level? With just two options? IDK that that would help you much (if you had more: e.g. HS, BA/S, MS, PhD, etc. it might, but I’d try to preserve ordering there, myself).
“These models used STAN, which handles these multilevel models well. Convergence was assessed with gelman-rubin statistics.”
STAN’s great, but certainly not magic or perfect, and though idk them personally I’m sure its authors would strongly advocate paranoia about its output. So you got convergence with multiple (2?) chains from a random (hopefully) starting value? R_hats were all 1? That’s good! Did all the other cheap diagnostics turn up ok (e.g trace plots, autocorrelation times/ESS, marginal histograms, quick within-chain metrics, etc.)?
No; I did not fit multiple models. Lasso regression was used to fit a propensity model using the predictors.
Using bachelor’s vs. non-bachelor’s has advantages in interpretability, so I think this was the right move for my purposes.
I did not spend an exorbitant amount of time investigating diagnostics, for the same reason I used a proprietary package was has been built for running these tests at a production level and has been thoroughly code reviewed. I don’t think it’s worth the time to construct an overly customized analysis.
Ah, gotcha. But re: code review, even the most beautifully constructed chains can fail, and how you specify your model can easily cause things to go kabloom even if the machine’s doing everything exactly how it’s supposed to. And it only takes a few minutes to drag your log files into something like Tracer and do some basic peace-of-mind checks (and others, e.g. examine bivariate posterior distributions to assess nonidentifiably wrt your demographic params). More sophisticated diagnostics are scattered across a few programs but don’t take too long to run either (unless you have e.g. hundreds or thousands of chains, like in marginal likelihood estimation w/ stepping stones… a friend’s actually coming out with a program soon—BONSAI—that automates a lot of that grunt work, which might be worth looking out for!). :]
(on phone at gym with shit wifi so can’t provide links/refs atm, sorry!)
Sure! Though unfortunately most of the stuff comes from scattered lectures, workshops, discussions, book chapters, seminars, papers, etc. But for intro multilevel Bayesian regression in R/STAN I’d say John Kruschke’s “Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan” and Richard McElreath’s “Statistical Rethinking: A Bayesian Course with Examples in R and Stan” would be really solid (Richard also has his course lectures up on youtube if you prefer that, though I found his book super readable, so much so that when I took the class with him a few years back I skipped most of his lectures since the room was really hot. But don’t let that dissuade you from watching them, he’s a great guy/speaker and quite fun and funny!).
Purely in terms of building my own intuitions/understanding, though, I’ve found little more helpful than just looking up the relevant algorithms and implementing the damn things from scratch (to talk of reinventing square wheels above lol… though ofc you’d use the far superior underlying code others have written for your actual analysis).
Yay for Bayesian regression (binomial, I’m guessing? You re-binned your attitude and donations responses? I think an ordered logit would be more appropriate here and result in less of a loss in resolution, or even a dirichlet, but then you’d lose yer ordering)! Those posteriors look decently tight, though I do have some questions!
I’m a little confused on what your control was, exactly. You have both points and distributions in your posterior plots, but you don’t have any control paragraph blurb in you google doc questionnaire. How did you evaluate your control? Did you give them a paragraph entirely unrelated to EA? These plots are the posterior estimates for p_binomial when each dummy variable for treatment is 0? Is “average treatment effect” some posterior predictive difference from the control p (i.e. why it’s exactly 0)?
On a related (and elucidatory) note, could you more explicitly clarify which models you fitted, exactly? Did you do any model comparison or averaging, or evaluate model adequacy? You mention “controlling for other variables in the survey” but I don’t see any e.g. demographic questions in your questionnaire. You said you “examined these relationships overall and among the critical subgroup of those with at least a bachelor’s degree”—did you do this by excluding everyone without a bachelor’s, or by modeling the effects of educational attainment and then doing model comparison to test the legitimacy of those effects (I’d think looking at the posterior for the interaction between your paragraph and education dummies would be the clearest test)? Did you use diffuse, “uninformative” priors (and hyperpriors)? Which ones, exactly?
I assume that since this is a hierarchical analysis you used MCMC (HMC?) to do the fitting. Are your posterior distributions smoothed substantially, e.g. with a kernel density estimator? Or did you just get fantastic performance? What diagnostics did you run to ensure MCMC health? How many chains did you run? Did you use stopping rules? In my experience, hierarchical regression models can be pretty finicky to fit as they get more complex.
Kudos on not just using some wackily inappropriate out-of-the-box frequentist test!
edit: also, what are the boxplot-looking things? 95% HPDIs? CIs? Some other %? Ah wait they’re the sd of your marginal samples?
It would be cool to provide the code, for both learning and verification purposes.
Unfortunately, because I used proprietary survey data/a proprietary R package to run this analysis, I don’t think I’ll be able to share the data and code.
Ah, interesting! What package? I’ve never heard of something like that before. Usually in the cold, mechanical heart of every R package is the deep desire to be used and shared as far as possible. If it’s just someone’s personal interface code, why not use something more publicly available? Can you write out your basic script in pseudocode (or just math/words?)? Especially the model and MCMC specification bits?
Sure, in an ideal world, software would all be free for everyone; alas, we do not live in such a world :p. I used the proprietary package because it did exactly what I needed and doesn’t require writing STAN code or anything myself. I’d rather not re-invent the wheel. I felt the tradeoff of transparency for efficiency and confidence in its accuracy was worth it, especially since I wouldn’t be able to share the data either way (such are the costs of getting these questions on a 1200 person survey without paying a substantial amount).
But the basic model was just a multilevel binomial model predicting the dependent variable using the treatments and questions asked earlier in the survey as controls.
Of course (though wheel reinvention can be super helpful educationally), but there are great free public R packages that interface to STAN (I use “rethinking” for my hierarchical Bayesian regression needs but I think Rstan would work, too), so going with someone’s unnamed, private code isn’t necessary imo. How much did the survey cost (was it a lot longer than the included google doc, then? e.g. Did you have screening questions to make sure people read the paragraph?). And model+mcmc specification can have lots of fiddly bits that can easily lead us astray, I’d say
Yeah, the survey was a lot longer. Typically general public surveys will cost over 10 dollars a complete, so getting 1200 cases for a survey like this can cost thousands of dollars.
I agree that model specification can be tricky, which is a reason I felt it well worth it to use the proprietary software I had access to that has been thoroughly vetted and code reviewed and is used frequently to run similar analyses rather than trying to construct my own.
I did not make sure people read the paragraph. I discussed the issue a bit in my discussion section, but one way a web survey might understate the effect is if people would pay closer attention and respond better to a friend delivering the message. OTOH, surveys do have some potentual vulnerability to the hawthorne effect, though that didn’t seem to express itself in the donations question.
Yep, and alongside it, of course, the raw data!
Yup, binomial.
The respondents in a treatment were each shown a message and asked how compelling they thought it was. The control was shown no message.
Yeah; the plots are the predicted values for those given a particular treatment. and Average Treatment Effect is the difference with the control.
I did not include every control used in the provided questionnaire. There were a mix of demographics/attitudinal/behavioral questions asked in the survey that I also used. These controls, particularly previous donations, were important for decreasing variance.
I used a multilevel model to estimate the effects among those with and without a bachelor’s degree. So, the bachelor’s estimate borrow’s power from those without a degree, reducing problems with over fitting.
These models used STAN, which handles these multilevel models well. Convergence was assessed with gelman-rubin statistics.
Ah, I guess that’s better than no control, and presumably paying attention to a paragraph of text doesn’t make someone substantially more or less generous. Did you fit a bunch of models with different predictors and test for a sufficient improvement of fit with each? Might do to be wary of overfitting in those regards maybe… though since those aren’t focal Bayes tends to be pretty robust there, imo, so long as you used sensible priors
“I used a multilevel model to estimate the effects among those with and without a bachelor’s degree. So, the bachelor’s estimate borrow’s power from those without a degree, reducing problems with over fitting.”
If I’m understanding correctly, you had a hyperprior on the effect of education level? With just two options? IDK that that would help you much (if you had more: e.g. HS, BA/S, MS, PhD, etc. it might, but I’d try to preserve ordering there, myself).
“These models used STAN, which handles these multilevel models well. Convergence was assessed with gelman-rubin statistics.”
STAN’s great, but certainly not magic or perfect, and though idk them personally I’m sure its authors would strongly advocate paranoia about its output. So you got convergence with multiple (2?) chains from a random (hopefully) starting value? R_hats were all 1? That’s good! Did all the other cheap diagnostics turn up ok (e.g trace plots, autocorrelation times/ESS, marginal histograms, quick within-chain metrics, etc.)?
No; I did not fit multiple models. Lasso regression was used to fit a propensity model using the predictors.
Using bachelor’s vs. non-bachelor’s has advantages in interpretability, so I think this was the right move for my purposes.
I did not spend an exorbitant amount of time investigating diagnostics, for the same reason I used a proprietary package was has been built for running these tests at a production level and has been thoroughly code reviewed. I don’t think it’s worth the time to construct an overly customized analysis.
Ah, gotcha. But re: code review, even the most beautifully constructed chains can fail, and how you specify your model can easily cause things to go kabloom even if the machine’s doing everything exactly how it’s supposed to. And it only takes a few minutes to drag your log files into something like Tracer and do some basic peace-of-mind checks (and others, e.g. examine bivariate posterior distributions to assess nonidentifiably wrt your demographic params). More sophisticated diagnostics are scattered across a few programs but don’t take too long to run either (unless you have e.g. hundreds or thousands of chains, like in marginal likelihood estimation w/ stepping stones… a friend’s actually coming out with a program soon—BONSAI—that automates a lot of that grunt work, which might be worth looking out for!). :]
(on phone at gym with shit wifi so can’t provide links/refs atm, sorry!)
Do you have any good textbooks or educational resources to learn these kinds of techniques?
Sure! Though unfortunately most of the stuff comes from scattered lectures, workshops, discussions, book chapters, seminars, papers, etc. But for intro multilevel Bayesian regression in R/STAN I’d say John Kruschke’s “Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan” and Richard McElreath’s “Statistical Rethinking: A Bayesian Course with Examples in R and Stan” would be really solid (Richard also has his course lectures up on youtube if you prefer that, though I found his book super readable, so much so that when I took the class with him a few years back I skipped most of his lectures since the room was really hot. But don’t let that dissuade you from watching them, he’s a great guy/speaker and quite fun and funny!).
Purely in terms of building my own intuitions/understanding, though, I’ve found little more helpful than just looking up the relevant algorithms and implementing the damn things from scratch (to talk of reinventing square wheels above lol… though ofc you’d use the far superior underlying code others have written for your actual analysis).
Sounds interesting. Would love to take a look when you get a chance to provide the links.