[Edited on 19 Nov 2021: I removed links to my models and report, as I was asked to do so.]
Just to clarify, our (Derek Foster’s/Rethink Priorities’) estimated Effect Size of ~0.01–0.02 DALYs averted per paying user assumes a counterfactual of no treatment for anxiety. It is misleading to estimate total DALYs averted without taking into account the proportion of users who would have sought other treatment, such as a different app, and the relative effectiveness of that treatment.
In our Main Model, these inputs are named “Relative impact of Alternative App” and “Proportion of users who would have used Alternative App”. The former is by default set at 1, because the other leading apps seem(ed) likely to be at least as effective as Mind Ease, though we didn’t look at them in depth independently of Hauke. The second defaults to 0; I suppose this was to get an upper bound of effectiveness, and because of the absence of relevant data, though I don’t recall my thought process at the time. (If it’s set to 1, the counterfactual impact is of course 0.)
Our summary, copied in a previous comment, also stresses that the estimate is per paying user. I don’t remember exactly why, but our report says:
Other elements of the MindEase evaluation (i.e. parts not done by Rethink Priorities) consider a “user” to be a paying user, i.e. someone who has downloaded the app and purchased a monthly or annual plan. For consistency, we will adopt the same definition. (Note that this is a very important assumption, as the average effect size and retention is likely to be many times smaller for those who merely download or install the app.)
As far as I can tell (correct me if I’m wrong), your “Robust, uncertainty-adjusted DALYs averted per user” figure is essentially my theoretical upper-bound estimate with no adjustments for realistic counterfactuals. It seems likely (though I have no evidence as such) that:
Many users would otherwise use a different app.
Those apps are roughly as effective as MindEase.
The users who are least likely to use another app, such as people in developing countries who were given free access, are unlikely to be paying (and therefore perhaps less likely to regularly use/benefit from it) – not to mention issues with translation to different cultures/languages.
So 0.02 DALYs averted per user seems to me like an extremely optimistic average effect size, based on the information we had around the middle of last year.
Hi Derek, hope you are doing well. Thank you for sharing your views on this analysis that you completed while you were at Rethink Priorities.
The difference between your estimates and Hauke’s certainly made our work more interesting.
A few points that may be of general interest:
For both analysts we used 3 estimates, an ‘optimistic guess’, ‘best guess’ and ‘pessimistic guess’.
For users from middle-income countries we doubled the impact estimates. Without reviewing our report/notes in detail, I don’t recall the rationale for the specific value of this multiplier. The basic idea is that high-income countries are better served, more competitive markets, so apps are more likely to find users with worse counterfactuals in middle income countries.
The estimates were meant to be conditional on Mind Ease achieving some degree of success. We simply assumed the impact of failure scenarios is 0. Hauke’s analysis seems to have made more clear use of this aspect. Not only is Hauke’s reading of the literature more optimistic, but he is more optimistic about how much more effective a successful Mind Ease will be relative to the competition.
Indeed the values we used for Derek’s analysis, for high income countries, were all less than 0.01. We simplified the 3 estimates, doing a weighted average across the two types of countries, into the single value of 0.01 for Derek’s analysis after rounding up (I think the true number may be more like 0.006). The calculations in the post use rounded values so it is easier for a reader to follow. Nevertheless, the results are in line with our more detailed calculations in the original report.
Similar to this point of rounding, we simplified the explanation of the robustness tilt we applied. It wasn’t just about Derek vs Hauke. It was also along the dimensions of the business analysis (e.g. success probabilities). We simplified the framing of the robustness tilt both here and in a ‘Fermi Estimate’ section of the original report because we believed that it is conceptually clearer to only talk about the one dimension.
What would I suggest to someone who would like to penalize the estimate more or less for all the uncertainty? Adjust the impact return.
How can you adjust the impact return in a consistent way? Of course, to make analyses like this useful you would want to do them in a consistent fashion. There isn’t a golden standard for how to control the strength of the robustness tilts we used. But you can think of the tilt we applied (in the original report) as being like being told a coin is fair (50/50) and then assuming it is biased to 80% heads (if heads is the side you don’t want). This is an expression of how different our tilted probability distribution was from the distribution in the base model (the effect on the impact estimate was more severe; 1-0.02/(0.25/2+0.01/2)=85%). There is a way of assessing this “degree of coin-equivalent tilt” for any tilt of any model. So if you felt another startup had the same level of uncertainty as Mind Ease you could tilt your model of it until you get the same degree of tilt. This would give you some consistency and not make the tilts based purely in analyst intuition (though of course there is basically no way to avoid some bias). If a much better way to consistently manage these tilts was developed, we would happily use it.
Overall, this analysis is just one example of how one might deal with all the things that make such assessments difficult including impact uncertainty, business uncertainty, and analyst disagreement. The key point really being a need to summarize all the uncertainty in a way that is useful to busy, non-technical decision makers who aren’t going to look at the underlying distributions. We look forward to seeing how techniques in this regard evolve as more and more impact assessments are done and shared publicly.
[Edited on 19 Nov 2021: I removed links to my models and report, as I was asked to do so.]
Just to clarify, our (Derek Foster’s/Rethink Priorities’) estimated Effect Size of ~0.01–0.02 DALYs averted per paying user assumes a counterfactual of no treatment for anxiety. It is misleading to estimate total DALYs averted without taking into account the proportion of users who would have sought other treatment, such as a different app, and the relative effectiveness of that treatment.
In our Main Model, these inputs are named “Relative impact of Alternative App” and “Proportion of users who would have used Alternative App”. The former is by default set at 1, because the other leading apps seem(ed) likely to be at least as effective as Mind Ease, though we didn’t look at them in depth independently of Hauke. The second defaults to 0; I suppose this was to get an upper bound of effectiveness, and because of the absence of relevant data, though I don’t recall my thought process at the time. (If it’s set to 1, the counterfactual impact is of course 0.)
Our summary, copied in a previous comment, also stresses that the estimate is per paying user. I don’t remember exactly why, but our report says:
As far as I can tell (correct me if I’m wrong), your “Robust, uncertainty-adjusted DALYs averted per user” figure is essentially my theoretical upper-bound estimate with no adjustments for realistic counterfactuals. It seems likely (though I have no evidence as such) that:
Many users would otherwise use a different app.
Those apps are roughly as effective as MindEase.
The users who are least likely to use another app, such as people in developing countries who were given free access, are unlikely to be paying (and therefore perhaps less likely to regularly use/benefit from it) – not to mention issues with translation to different cultures/languages.
So 0.02 DALYs averted per user seems to me like an extremely optimistic average effect size, based on the information we had around the middle of last year.
Hi Derek, hope you are doing well. Thank you for sharing your views on this analysis that you completed while you were at Rethink Priorities.
The difference between your estimates and Hauke’s certainly made our work more interesting.
A few points that may be of general interest:
For both analysts we used 3 estimates, an ‘optimistic guess’, ‘best guess’ and ‘pessimistic guess’.
For users from middle-income countries we doubled the impact estimates. Without reviewing our report/notes in detail, I don’t recall the rationale for the specific value of this multiplier. The basic idea is that high-income countries are better served, more competitive markets, so apps are more likely to find users with worse counterfactuals in middle income countries.
The estimates were meant to be conditional on Mind Ease achieving some degree of success. We simply assumed the impact of failure scenarios is 0. Hauke’s analysis seems to have made more clear use of this aspect. Not only is Hauke’s reading of the literature more optimistic, but he is more optimistic about how much more effective a successful Mind Ease will be relative to the competition.
Indeed the values we used for Derek’s analysis, for high income countries, were all less than 0.01. We simplified the 3 estimates, doing a weighted average across the two types of countries, into the single value of 0.01 for Derek’s analysis after rounding up (I think the true number may be more like 0.006). The calculations in the post use rounded values so it is easier for a reader to follow. Nevertheless, the results are in line with our more detailed calculations in the original report.
Similar to this point of rounding, we simplified the explanation of the robustness tilt we applied. It wasn’t just about Derek vs Hauke. It was also along the dimensions of the business analysis (e.g. success probabilities). We simplified the framing of the robustness tilt both here and in a ‘Fermi Estimate’ section of the original report because we believed that it is conceptually clearer to only talk about the one dimension.
What would I suggest to someone who would like to penalize the estimate more or less for all the uncertainty? Adjust the impact return.
How can you adjust the impact return in a consistent way? Of course, to make analyses like this useful you would want to do them in a consistent fashion. There isn’t a golden standard for how to control the strength of the robustness tilts we used. But you can think of the tilt we applied (in the original report) as being like being told a coin is fair (50/50) and then assuming it is biased to 80% heads (if heads is the side you don’t want). This is an expression of how different our tilted probability distribution was from the distribution in the base model (the effect on the impact estimate was more severe; 1-0.02/(0.25/2+0.01/2)=85%). There is a way of assessing this “degree of coin-equivalent tilt” for any tilt of any model. So if you felt another startup had the same level of uncertainty as Mind Ease you could tilt your model of it until you get the same degree of tilt. This would give you some consistency and not make the tilts based purely in analyst intuition (though of course there is basically no way to avoid some bias). If a much better way to consistently manage these tilts was developed, we would happily use it.
Overall, this analysis is just one example of how one might deal with all the things that make such assessments difficult including impact uncertainty, business uncertainty, and analyst disagreement. The key point really being a need to summarize all the uncertainty in a way that is useful to busy, non-technical decision makers who aren’t going to look at the underlying distributions. We look forward to seeing how techniques in this regard evolve as more and more impact assessments are done and shared publicly.