Thanks for the post, but I don’t think you can conclude from your analysis that your criteria weren’t helpful and the result is not necessarily that surprising.
If you look at professional NBA basketball players, there’s not much of a correlation between how tall a basketball player is and how much they get paid or some other measure of how good they are. Does this mean NBA teams are making a mistake by choosing tall basketball players? Of course not!
The mistake your analysis is making is called ‘selecting on the dependent variable’ or ‘collider bias’. You are looking at the correlation between two variables (interview score and engagement) in a specific subpopulation, the subpopulation that scored highly in interview score. However, that specific subpopulation correlation may not be representative of the correlation between interview score and engagement in the broader relevant population i.e., all students who applied to the fellowship. This is related to David Moss’s comment on range restrictions.
The correlation in the population is the thing you care about, not the correlation in your subpopulation. You want to know whether the scores are helpful for selecting people into or out of the fellowship. For this, you need to know about engagement of people not in the fellowship as well as people in the fellowship.
This sort of thing comes up all the time, like in the basketball case. Another common example with a clear analogy to your case is grad school admissions. For admitted students, GRE scores are (usually) not predictive of success. Does that mean schools shouldn’t select students based on GRE? Only if the relationship between success and GPA for admitted students is representative of the relationship for unadmitted students, which is unlikely to be the case.
The simplest thing you could do to improve this would be to measure engagement for all the people who applied (or who you interviewed if you only have scores for them) and then re-estimate the correlation on the full sample, rather than the selected subsample. This will provide a better answer to your question of whether scores are predictive of engagement. It seems like the things included in your engagement measure are pretty easy to observe so this should be easy to do. However, a lot of them are explicitly linked to participation in the fellowship which biases it towards fellows somewhat, so if you could construct an alternative engagement measure which doesn’t include these, that would likely be better.
Broadly, I agree with your points. You’re right that we don’t care about the relationship in the subpopulation, but rather about the relationship in the broader population. However, there are a couple of things I think are important to note here:
As mentioned in my response on range restrictions, in some cases we did not reject many people at all. In those cases, our subpopulation was almost the entire population. This is not the case for the NBA or GRE examples.
Lastly, possibly more importantly: we only know of maybe 3 cases of people being rejected from the fellowship but becoming involved in the group in any way at all. All of these were people who were rejected and later reapplied and completed the fellowship. We suspect this is both due to the fact that the fellowship causes people to become engaged, and also because people who are rejected may be less likely to want to get involved. As a result, it wouldn’t really make sense to try to measure engagement in this group.
In general, we believe that in order to use a selection method based on subjective interview rankings—which are very time-consuming and open us up to the possibility of implicit bias—we need to have some degree of evidence that our selection method actually works. After two years, we have found none using the best available data.
That being said—this fall, we ended up admitting everyone who we interviewed. Once we know more about how engaged these fellows end up being, we can follow up with an analysis that is truly of the entire population.
The simplest thing you could do to improve this would be to measure engagement for all the people who applied and then re-estimate the correlation on the full sample, rather than the selected subsample… However, a lot of them are explicitly linked to participation in the fellowship which biases it towards fellows somewhat, so if you could construct an alternative engagement measure which doesn’t include these, that would likely be better.
The other big issue with this approach is that this would likely be confounded by the treatment effect of being selected for and undertaking the fellowship. i.e. we would hope that going through the fellowship actually makes people more engaged, which would lead to the people with higher scores (who get accepted to the fellowship) also having higher engagement scores.
But perhaps what you had in mind was combining the simple approach with a more complex approach, like randomly selecting people for the fellowship across the range of predictor scores and evaluating the effects of the fellowship as well as the effect of the initial scores?
Thanks for the post, but I don’t think you can conclude from your analysis that your criteria weren’t helpful and the result is not necessarily that surprising.
If you look at professional NBA basketball players, there’s not much of a correlation between how tall a basketball player is and how much they get paid or some other measure of how good they are. Does this mean NBA teams are making a mistake by choosing tall basketball players? Of course not!
The mistake your analysis is making is called ‘selecting on the dependent variable’ or ‘collider bias’. You are looking at the correlation between two variables (interview score and engagement) in a specific subpopulation, the subpopulation that scored highly in interview score. However, that specific subpopulation correlation may not be representative of the correlation between interview score and engagement in the broader relevant population i.e., all students who applied to the fellowship. This is related to David Moss’s comment on range restrictions.
The correlation in the population is the thing you care about, not the correlation in your subpopulation. You want to know whether the scores are helpful for selecting people into or out of the fellowship. For this, you need to know about engagement of people not in the fellowship as well as people in the fellowship.
This sort of thing comes up all the time, like in the basketball case. Another common example with a clear analogy to your case is grad school admissions. For admitted students, GRE scores are (usually) not predictive of success. Does that mean schools shouldn’t select students based on GRE? Only if the relationship between success and GPA for admitted students is representative of the relationship for unadmitted students, which is unlikely to be the case.
The simplest thing you could do to improve this would be to measure engagement for all the people who applied (or who you interviewed if you only have scores for them) and then re-estimate the correlation on the full sample, rather than the selected subsample. This will provide a better answer to your question of whether scores are predictive of engagement. It seems like the things included in your engagement measure are pretty easy to observe so this should be easy to do. However, a lot of them are explicitly linked to participation in the fellowship which biases it towards fellows somewhat, so if you could construct an alternative engagement measure which doesn’t include these, that would likely be better.
Broadly, I agree with your points. You’re right that we don’t care about the relationship in the subpopulation, but rather about the relationship in the broader population. However, there are a couple of things I think are important to note here:
As mentioned in my response on range restrictions, in some cases we did not reject many people at all. In those cases, our subpopulation was almost the entire population. This is not the case for the NBA or GRE examples.
Lastly, possibly more importantly: we only know of maybe 3 cases of people being rejected from the fellowship but becoming involved in the group in any way at all. All of these were people who were rejected and later reapplied and completed the fellowship. We suspect this is both due to the fact that the fellowship causes people to become engaged, and also because people who are rejected may be less likely to want to get involved. As a result, it wouldn’t really make sense to try to measure engagement in this group.
In general, we believe that in order to use a selection method based on subjective interview rankings—which are very time-consuming and open us up to the possibility of implicit bias—we need to have some degree of evidence that our selection method actually works. After two years, we have found none using the best available data.
That being said—this fall, we ended up admitting everyone who we interviewed. Once we know more about how engaged these fellows end up being, we can follow up with an analysis that is truly of the entire population.
The other big issue with this approach is that this would likely be confounded by the treatment effect of being selected for and undertaking the fellowship. i.e. we would hope that going through the fellowship actually makes people more engaged, which would lead to the people with higher scores (who get accepted to the fellowship) also having higher engagement scores.
But perhaps what you had in mind was combining the simple approach with a more complex approach, like randomly selecting people for the fellowship across the range of predictor scores and evaluating the effects of the fellowship as well as the effect of the initial scores?