Thanks for sharing the results and thanks, in particular, for including the results for the particular measures, rather than just the composite score.
high writing scores predicted less engagement… Model (3) shows what is driving this: our measures of open-mindedness and commitment. It is unclear why this is. One story for open-mindedness could be that open-minded applicants are less likely to go all-in on EA socials and events and prefer to read widely. And a story for commitment could be that those most committed to the fellowship spent more time reading the extra readings and thus had less time for non-fellowship engagement.
Taking the results at face value, it seems like this could be explained by your measures systematically measuring something other than what you take them to be measuring (e.g. the problem is construct validity). For example, perhaps your measures of “open-mindedness” or “commitment” actually just tracked people’s inclination to acquiesce to social pressure, or something associated with it. Of course, I don’t know how you actually measured open-mindedness or commitment, so my speculation isn’t based on having any particular reason to think your measures were bad.
Of course, not taking the results at face value, it could just be idiosyncracies of what you note was a small sample. It could be interesting to see plots of the relationship between some of the variables, to help get a sense of whether some of the effects could be driven by outliers etc.
I think it’s completely plausible that these two measures were systematically measuring something other than what we took them to be measuring. The confusing thing is what it indeed was measuring and why these traits had negative effects.
(The way we judged open-mindedness, for example, was by asking applicants to write down an instance where they changed their minds in response to evidence.)
But I do think the most likely case is the small sample.
I think we tend to confuse ‘lack of strong statistical significance’ with ‘no predictive power’.
A small amount of evidence can substantially improve our decision-making...
… even if we cannot conclude that ’data with a correlation this large or larger would be very unlikely to be generated (p<0.05) if there were no correlation in the true population.
We, very reasonably, substantially update our beliefs and guide our decisions based on small amounts of data. See, e.g., the ‘Bayes rule’ chapter of Algorithms to Live By
I believe that for optimization problems and decision-making problems we should use a different approach both to design and to assessing results… relative to when we are trying to measure and test for scientific purposes.
We need to make a decision in one direction or another, and we need to consider costs and benefits of collecting and using these measures I believe we should be taking a Bayesian approach, updating our belief distribution,
… and considering the value of the information generated (in industry, the ‘lift’, ‘profit curve’ etc) in terms of how it improves our decision-making.
Note: I am exploring these ideas and hoping to learn, share and communicate more. Maybe others in this forum have more expertise in ‘reinforcement learning’ etc.
This is very reasonable; ‘no predictive power’ is a simplification.
Purely academically, I am sure a well-reasoned Bayesian approach would get us closer to the truth. But I think the conclusions drawn still make sense for three reasons.
I did not specify in the table, but the p-values for the insignificant coefficients were very high; often around p=0.85. I think this constitutes so little evidence that it would be too minor a Bayesian update to have to formally conduct.
Given that we do have evidence of some other variables being predictive, updating in favour of weighting those higher still makes sense (although maybe to a lesser degree than I implied in the post).
The time applicants and facilitators spend on the many different criteria we used is a cost (and a meaningful one for smaller groups). I would guess that cutting down the number of variables used would increase productivity more than what can be outweighed by the small updates we could make with little (but non-zero) predictive power.
Thanks for sharing the results and thanks, in particular, for including the results for the particular measures, rather than just the composite score.
Taking the results at face value, it seems like this could be explained by your measures systematically measuring something other than what you take them to be measuring (e.g. the problem is construct validity). For example, perhaps your measures of “open-mindedness” or “commitment” actually just tracked people’s inclination to acquiesce to social pressure, or something associated with it. Of course, I don’t know how you actually measured open-mindedness or commitment, so my speculation isn’t based on having any particular reason to think your measures were bad.
Of course, not taking the results at face value, it could just be idiosyncracies of what you note was a small sample. It could be interesting to see plots of the relationship between some of the variables, to help get a sense of whether some of the effects could be driven by outliers etc.
Thanks for the comment!
I think it’s completely plausible that these two measures were systematically measuring something other than what we took them to be measuring. The confusing thing is what it indeed was measuring and why these traits had negative effects.
(The way we judged open-mindedness, for example, was by asking applicants to write down an instance where they changed their minds in response to evidence.)
But I do think the most likely case is the small sample.
I think we tend to confuse ‘lack of strong statistical significance’ with ‘no predictive power’.
A small amount of evidence can substantially improve our decision-making...
… even if we cannot conclude that ’data with a correlation this large or larger would be very unlikely to be generated (p<0.05) if there were no correlation in the true population.
We, very reasonably, substantially update our beliefs and guide our decisions based on small amounts of data. See, e.g., the ‘Bayes rule’ chapter of Algorithms to Live By
I believe that for optimization problems and decision-making problems we should use a different approach both to design and to assessing results… relative to when we are trying to measure and test for scientific purposes.
This relates to ‘reinforcement learning’ and to ‘exploration sampling’.
We need to make a decision in one direction or another, and we need to consider costs and benefits of collecting and using these measures I believe we should be taking a Bayesian approach, updating our belief distribution,
… and considering the value of the information generated (in industry, the ‘lift’, ‘profit curve’ etc) in terms of how it improves our decision-making.
Note: I am exploring these ideas and hoping to learn, share and communicate more. Maybe others in this forum have more expertise in ‘reinforcement learning’ etc.
Thanks for writing this!
This is very reasonable; ‘no predictive power’ is a simplification.
Purely academically, I am sure a well-reasoned Bayesian approach would get us closer to the truth. But I think the conclusions drawn still make sense for three reasons.
I did not specify in the table, but the p-values for the insignificant coefficients were very high; often around p=0.85. I think this constitutes so little evidence that it would be too minor a Bayesian update to have to formally conduct.
Given that we do have evidence of some other variables being predictive, updating in favour of weighting those higher still makes sense (although maybe to a lesser degree than I implied in the post).
The time applicants and facilitators spend on the many different criteria we used is a cost (and a meaningful one for smaller groups). I would guess that cutting down the number of variables used would increase productivity more than what can be outweighed by the small updates we could make with little (but non-zero) predictive power.