First of all, thanks a lot for spending the time to turn this into a model and polishing it enough to be shared.
Issue:
It seems like the model might have trouble filtering people who have detailed but wrong models. I encounter this a lot in the nutrition literature where very detailed and technical models with complex evidence from combinations of in vitro, animal, and some human studies compete against outcome measuring RCTs. As near as I can tell, an expert with a detailed but wrong model can potentially get by 3 of the 4 filters, PI and T. They will have a harder time with F, but my current guess is that the vast majority of experts fail F, because that is where you have loaded most of the epistemic rigor. Consider how rarely (if ever) you have heard a response like the example given for F from real life researchers. You might say “all is well, the vast majority fail and the ones left are highly reliable.” It seems to me however that we must rely on the lower quality evidence from people failing the F filter all the time, simply because in the vast majority of cases there is little to no evidence really passing muster and yet we must make a decision anyway.
Side note: in my estimation The Cambridge Handbook of Expertise would lend support for most of the “work” here being done by F, as opportunities for rapid, measurable feedback is one of the core predictors of performance they point to.
Potential improvement:
Rather than a binary pass fail for experts we should like a metric that grades the material they present. Even crude metrics outperform estimates that do not use metrics according to the forecasting literature. Cochrane’s metric for risk of bias, for example, is simply a list of 5 common sources of bias which the reviewer grades as low, high, or unclear, with a short summary of the reasoning. A very simple example would be rating each of the PIFT criteria similarly. This gives some path forward for improvement over time as well: whether or not a low or high score in a particular dimension is actually predicting subsequent expert performance.
I hope you interpret detailed feedback as a +1 and not too punishing. I am greatly encouraged by seeing work on what I consider core areas of improving the quality of EA research.
Potential improvement: Rather than a binary pass fail for experts we should like a metric that grades the material they present.
Agreed. I tried to make it binary for the sake of generating good examples, but the world is much more messy. In the spreadsheet version I use, I try to assign each marker a rating from “none” to “high.”
Issue: It seems like the model might have trouble filtering people who have detailed but wrong models.
100%. The model above is only good for assessing necessary conditions, not sufficient ones. I.e., someone can pass all four conditions above and still not be an expert.
First of all, thanks a lot for spending the time to turn this into a model and polishing it enough to be shared.
Issue: It seems like the model might have trouble filtering people who have detailed but wrong models. I encounter this a lot in the nutrition literature where very detailed and technical models with complex evidence from combinations of in vitro, animal, and some human studies compete against outcome measuring RCTs. As near as I can tell, an expert with a detailed but wrong model can potentially get by 3 of the 4 filters, PI and T. They will have a harder time with F, but my current guess is that the vast majority of experts fail F, because that is where you have loaded most of the epistemic rigor. Consider how rarely (if ever) you have heard a response like the example given for F from real life researchers. You might say “all is well, the vast majority fail and the ones left are highly reliable.” It seems to me however that we must rely on the lower quality evidence from people failing the F filter all the time, simply because in the vast majority of cases there is little to no evidence really passing muster and yet we must make a decision anyway.
Side note: in my estimation The Cambridge Handbook of Expertise would lend support for most of the “work” here being done by F, as opportunities for rapid, measurable feedback is one of the core predictors of performance they point to.
Potential improvement: Rather than a binary pass fail for experts we should like a metric that grades the material they present. Even crude metrics outperform estimates that do not use metrics according to the forecasting literature. Cochrane’s metric for risk of bias, for example, is simply a list of 5 common sources of bias which the reviewer grades as low, high, or unclear, with a short summary of the reasoning. A very simple example would be rating each of the PIFT criteria similarly. This gives some path forward for improvement over time as well: whether or not a low or high score in a particular dimension is actually predicting subsequent expert performance.
I hope you interpret detailed feedback as a +1 and not too punishing. I am greatly encouraged by seeing work on what I consider core areas of improving the quality of EA research.
How worthwhile do you think it would be for someone to read the handbook?
I think a skim/outline is worthwhile. It includes lots of object level data which isn’t a great use of time.
Agreed. I tried to make it binary for the sake of generating good examples, but the world is much more messy. In the spreadsheet version I use, I try to assign each marker a rating from “none” to “high.”
100%. The model above is only good for assessing necessary conditions, not sufficient ones. I.e., someone can pass all four conditions above and still not be an expert.