Interesting, but based on the small sample and limited range of scores (and I also agree with the points made by Moss and Rhys-Bernard) …
I’m not sure whether you have enough data/statistical power to say anything substantially informative/conclusive. Even saying ‘we have evidence that there is not a strong relation’ may be too strong.
To help us understand this, can you report (frequentist) confidence intervals around your estimates? (Or even better, a Bayesian approach involving a flat but informative prior and a posterior distribution in light of the data?)
I’ll try to say more on this later. A good reference is: Harms and Lakens (2018), “Making ‘null effects’ informative: statistical techniques and inferential frameworks”
Also, even ‘insignificant’ results may actually be rather informative for practical decision-making… if they cause us to reasonably substantially update our beliefs. We rationally make inferences and adjust our choices based on small amount of data all the time, even if we can’t say something like ‘it is less than 1% likely that what I just saw would have observed by chance’. Maybe 12% (p>0.05 !) of the time the dark cloud I see in the sky will fade away, but seeing this cloud still makes me decide to carry an umbrella… as now the expected benefits outweigh the costs..
I agree, I do not think I would say that “we have evidence that there is not a strong relation”. But I do feel comfortable saying that we do not have evidence that there is any relation at all.
The 95% confidence intervals are extremely wide, given our small sample sizes:
Spring 2019: −0.75 to 0.5 (95th) and −0.55 to 0.16 (75th)
Fall 2019: −0.37 to 0.69 and −0.19 to 0.43
Spring 2020: −0.67 to 0.66 and −0.37 to 0.37
Summer 2020: −0.60 to 0.51 and −0.38 to 0.26
The upper ends are very high, and there is certainly a possibility that our interview scoring process is actually good. But, of the observed effects, two are negative, and two are positive. The highest positive observed correlation is only 0.10.
To somebody who has never been to San Francisco in the summer, it seems reasonable to expect it to rain. It’s cloudy, it’s dark, and it’s humid. You might even bring an umbrella! But, after four days, you’ve noticed that it hasn’t rained on any of them, despite continuing to be gloomy. You also notice that almost nobody else is carrying an umbrella; many of those who are are only doing so because you told them you were! In this situation, it seems unlikely that you would need to see historical weather charts to conclude that the cloudy weather probably doesn’t imply what you thought it did.
This is analogous to our situation. We thought our interview scores would be helpful. But it’s been several years, and we haven’t seen any evidence that they have been. It’s costly to use this process, and we would like to see some benefit if we are going to use it. We have not seen that benefit in any of our four cohorts. So, it makes sense to leave the umbrella at home, for now.
Thanks for sharing the confidence intervals. I guess it might be reasonable to conclude from your experience that the interview scores have not been informative enough to justify their cost.
What I am saying is that it doesn’t seem (to me) that the data and evidence presented allows you to say that. (But maybe other analysis or inference from your experience might in fact drive that conclusion, the ‘other people in San Francisco’ in your example.)
But if I glance at just the evidence/confidence intervals it suggests to me that there may be a substantial probability that in fact there is a strongly positive relationship and the results are a fluke.
On the other hand I might be wrong. I hope to get a chance to follow up on this:
We could simulate a case where the measure has ‘the minimum correlation to the outcome to make it worth using for selecting on’, and see how likely it would be, in such a case, to observe the correlations as low as you observed
Or we could start with a minimally informative ‘prior’ over our beliefs about the measure, and do a Bayesian updating exercise in light of your observations; we could then consider the posterior probability distribution and consider whether it might justify discontinuing the use of these scores
Interesting, but based on the small sample and limited range of scores (and I also agree with the points made by Moss and Rhys-Bernard) …
I’m not sure whether you have enough data/statistical power to say anything substantially informative/conclusive. Even saying ‘we have evidence that there is not a strong relation’ may be too strong.
To help us understand this, can you report (frequentist) confidence intervals around your estimates? (Or even better, a Bayesian approach involving a flat but informative prior and a posterior distribution in light of the data?)
I’ll try to say more on this later. A good reference is: Harms and Lakens (2018), “Making ‘null effects’ informative: statistical techniques and inferential frameworks”
Also, even ‘insignificant’ results may actually be rather informative for practical decision-making… if they cause us to reasonably substantially update our beliefs. We rationally make inferences and adjust our choices based on small amount of data all the time, even if we can’t say something like ‘it is less than 1% likely that what I just saw would have observed by chance’. Maybe 12% (p>0.05 !) of the time the dark cloud I see in the sky will fade away, but seeing this cloud still makes me decide to carry an umbrella… as now the expected benefits outweigh the costs..
I agree, I do not think I would say that “we have evidence that there is not a strong relation”. But I do feel comfortable saying that we do not have evidence that there is any relation at all.
The 95% confidence intervals are extremely wide, given our small sample sizes:
Spring 2019: −0.75 to 0.5 (95th) and −0.55 to 0.16 (75th)
Fall 2019: −0.37 to 0.69 and −0.19 to 0.43
Spring 2020: −0.67 to 0.66 and −0.37 to 0.37
Summer 2020: −0.60 to 0.51 and −0.38 to 0.26
The upper ends are very high, and there is certainly a possibility that our interview scoring process is actually good. But, of the observed effects, two are negative, and two are positive. The highest positive observed correlation is only 0.10.
To somebody who has never been to San Francisco in the summer, it seems reasonable to expect it to rain. It’s cloudy, it’s dark, and it’s humid. You might even bring an umbrella! But, after four days, you’ve noticed that it hasn’t rained on any of them, despite continuing to be gloomy. You also notice that almost nobody else is carrying an umbrella; many of those who are are only doing so because you told them you were! In this situation, it seems unlikely that you would need to see historical weather charts to conclude that the cloudy weather probably doesn’t imply what you thought it did.
This is analogous to our situation. We thought our interview scores would be helpful. But it’s been several years, and we haven’t seen any evidence that they have been. It’s costly to use this process, and we would like to see some benefit if we are going to use it. We have not seen that benefit in any of our four cohorts. So, it makes sense to leave the umbrella at home, for now.
Thanks for sharing the confidence intervals. I guess it might be reasonable to conclude from your experience that the interview scores have not been informative enough to justify their cost.
What I am saying is that it doesn’t seem (to me) that the data and evidence presented allows you to say that. (But maybe other analysis or inference from your experience might in fact drive that conclusion, the ‘other people in San Francisco’ in your example.)
But if I glance at just the evidence/confidence intervals it suggests to me that there may be a substantial probability that in fact there is a strongly positive relationship and the results are a fluke.
On the other hand I might be wrong. I hope to get a chance to follow up on this:
We could simulate a case where the measure has ‘the minimum correlation to the outcome to make it worth using for selecting on’, and see how likely it would be, in such a case, to observe the correlations as low as you observed
Or we could start with a minimally informative ‘prior’ over our beliefs about the measure, and do a Bayesian updating exercise in light of your observations; we could then consider the posterior probability distribution and consider whether it might justify discontinuing the use of these scores