I’m really glad to see an attack on this problem, so thanks for having a go. It’s an important issue that can be easy to lose track of.
Unfortunately I think there are some technical issues with your attempt to address it. To help organise the comment thread, I’m going to go into details in comments that are children of this one.
Edit: I originally claimed that the technical issues were serious. I’m now less confident—perhaps something in this space will be useful, although the difficulty in estimating some of the parameters makes me wary of applying it as is.
the difficulty in estimating some of the parameters makes me wary of applying it as is.
I agree that these expected-value estimates shouldn’t be taken (even somewhat) literally. But I think toy models like this one can still be important for checking the internal consistency of one’s reasoning. That is: if you can create a model that says X, this doesn’t mean you should treat X as true; but if you can’t create a reasonable model that says X, this is pretty strong evidence that X isn’t true.
In this case, the utility would be in allowing you to inspect the sometimes-unintuitive interplay between your guesses at an estimate’s R^2, the distributional parameters, and the amount of regression. While you shouldn’t plug in guesses at the parameters and expect the result to be correct, you can still use such a model to constrain the parameter space you want to think about.
2) We need to know the means of the distribution to do the standardization—after all, if an intervention was estimated to be below the mean, we should anticipate it to regress upwards.
Trickier for an EA context, as the groups that do evaluation focus their efforts on what appear to be the most promising things, so there isn’t a clear handle on the ‘mean global health intervention’ which may be our distribution of interest. To some extent though, this problem solves itself if the underlying distributions of interest are log normal or similarly fat tailed and you are confident your estimate lies far from the mean (whatever it is): log (X—something small) approximates to log(X)
Sadly I don’t think a log-normal distribution solves this problem for you, because to apply your model I think you are working entirely in the log-domain, so taking log(X) - log(something small), rather than log(X—something small). Then the choice of the small thing can have quite an effect on the answer.
For example when you regressed the estimate of cost-effectiveness of malaria nets, you had an implicit mean cost-effectiveness of 1 DALY/$100,000. If you’d assumed instead 1 DALY/$10,000, you’d have regressed to $77/DALY instead of $97/DALY.
I noticed something else: you may have lost track of expectations when translating between log-normal and normal.
The log of the median of a log-normal distribution is the same as the median of the normal distribution which you get by taking logs, and the same as the mean of that latter distribution. But the log of the mean of the log-normal distribution will be higher.
This affects what happens with your regressions. Assuming the initial estimates are point estimates, you end up with a distribution of possible values after regressing. With normal distributions, everything behaves nicely, the mean is equal to the median, and your calculations are correct.
With log-normal distributions, we normally want to make decisions based on expected value, which will be higher than the median I think you’ve produced. In particular, the log-normal will be wider (and the mean:median ratio higher) when there is less correlation between true values and estimates. This looks like it might reduce the importance of the r-value relative to your calculations, at least if you do care about expectations, and at the upper end of the range.
I’m not sure how this all cashes out. But wanted to flag it as something else which may significantly affect the answer!
I’m not sold on estimating r^2 just by group (this is a totally outside view on error). I think you can often say something quite sensible about the relative strengths of different estimates you make for yourself. There should be a way to incorporate such information, and I’m not sure what it is.
It looks like it might need more thought on how to generalise to distributions which aren’t normal or log-normal. Generally I think we should expect tails to be a little thicker than log-normal; this should have the effect of reducing the impact of regression, but I don’t off-hand see how to follow this through quantitatively.
Generally I think we should expect tails to be a little thicker than log-normal
What’s your reasoning for this?
I don’t off-hand see how to follow this through quantitatively.
In general it depends on the structure of the distribution, or more precisely the copula between the ground truth and estimate. If you assume a Gaussian copula then you can easily reduce to the bivariate-normal case that Gregory describes here, but model uncertainty in the estimate probably fattens the copula tails a bit.
Generally I think we should expect tails to be a little thicker than log-normal
What’s your reasoning for this?
That there are various mechanisms (of which I only feel like I understand a few) in complex systems which produce power-law type tails. These can enter as factors, and the convergence back to log-normal we’d expect from the central limit theorem is slowish in the tails.
On the other hand we’ll see additive effects too, which could pull the tails in more tightly than log-normal. I maintain a reasonable amount of uncertainty over what kind of tail we’ll eventually think is most appropriate, but while I have that uncertainty I don’t want to say that the tails have to be thin.
Of course this is all qualitative reasoning which really only affects the behaviour quite far down the tail. I think for practical purposes log-normal is often a decent assumption (leaving you with the not-insignificant problem of how to pick parameters).
That there are various mechanisms (of which I only feel like I understand a few) in complex systems which produce power-law type tails. These can enter as factors, and the convergence back to log-normal we’d expect from the central limit theorem is slowish in the tails.
It seems like this probably depends a lot on what type of intervention you’re studying. I guess I would expect x-risks to have power-law-ish distributions, but I can’t think of very many power-law factors that would influence e.g. scaling up a proven global health intervention.
I agree that the distribution will depend on the kind of intervention. When you take into account indirect effects you may get some power-law type behaviour even in interventions where it looks unlikely, though—for instance coalescing broader societal support around an intervention so that it gets implemented far more than your direct funding provides for.
Our distribution of beliefs about the cost-effectiveness of scaling up something which is “proven” is likely to have particularly thin tails compared to dealing with “unproven” things, as by proof we tend to mean high-quality evidence that substantially tightens the possibilities. I’m not sure whether it changes the eventual tail to a qualitatively different kind of behaviour, or if they’re just quantitatively narrower distributions, though.
Carl has argued convincingly that the [edit: normal and] log-normal priors are too thin-tailed here:
I think it’s worth pointing out some of the strange implications of a normal prior for charity cost-effectiveness.
For instance, it appears that one can save lives hundreds of times more cheaply through vaccinations in the developing world than through typical charity expenditures aimed at saving lives in rich countries, according to experiments, government statistics, etc.
But a normal distribution (assigns) a probability of one in tens of thousands that a sample will be more than 4 standard deviations above the median, and one in hundreds of billions that a charity will be more than 7 standard deviations from the median. The odds get tremendously worse as one goes on. If your prior was that charity cost-effectiveness levels were normally distributed, then no conceivable evidence could convince you that a charity could be 100x as good as the 90th percentile charity. The probability of systematic error or hoax would always be ludicrously larger than the chance of such an effective charity. One could not believe, even in hindsight, that paying for Norman Borlaug’s team to work on the Green Revolution, or administering smallpox vaccines (with all the knowledge of hindsight) actually did much more good than typical. The gains from resources like GiveWell would be small compared to acting like an index fund and distributing charitable dollars widely.
Such denial seems unreasonable to me, and I think to Holden. However, if one does believe that there have in fact been multiple interventions that turned out 100x as effective as the 90th percentile charity, then one should reject a normal prior. When a model predicts that the chance of something happening is less than 10^-100, and that thing goes on to happen repeatedly in the 20th century, the model is broken, and one should try to understand how it could be so wrong.
Another problem with the normal prior (and, to a lesser but still problematic extent, a log-normal prior) is that it would imply overconfident conclusions about the physical world.
For instance, consider threats of human extinction. Using measures like “lives saved” or “happy life-years produced,” counting future generations the gain of averting a human extinction scales with the expected future population of humanity. There are pretty well understood extinction risks with well understood interventions, where substantial progress has been made: with a trickle of a few million dollars per year (for a couple decades) in funding 90% of dinosaur-killer size asteroids were tracked and checked for future impacts on Earth. So, if future populations are large then using measures like happy life-years there will be at least some ultra-effective interventions.
If humanity can set up a sustainable civilization and harness a good chunk of the energy of the Sun, or colonize other stars, then really enormous prosperous populations could be created: see Nick Bostrom’s (paper) on astronomical waste for figures.
From this we can get something of a reductio ad absurdum for the normal prior on charity effectiveness. If we believed a normal prior then we could reason as follows:
If humanity has a reasonable chance of surviving to build a lasting advanced civilization, then some charity interventions are immensely cost-effective, e.g. the historically successful efforts in asteroid tracking.
By the normal (or log-normal) prior on charity cost-effectiveness, no charity can be immensely cost-effective (with overwhelming probability).
Therefore, 3. Humanity is doomed to premature extinction, stagnation, or an otherwise cramped future.
I find this “Charity Doomsday Argument” pretty implausible. Why should intuitions about charity effectiveness let us predict near-certain doom for humanity from our armchairs? Long-term survival of civilization on Earth and/or space colonization are plausible scenarios, not to be ruled out in this a priori fashion.
To really flesh out the strangely strong conclusion of these priors, suppose that we lived to see spacecraft intensively colonize the galaxy. There would be a detailed history leading up to this outcome, technical blueprints and experiments supporting the existence of the relevant technologies, radio communication and travelers from other star systems, etc. This would be a lot of evidence by normal standards, but the normal (or log-normal) priors would never let us believe our own eyes: someone who really held a prior like that would conclude they had gone insane or that some conspiracy was faking the evidence.
Yet if I lived through an era of space colonization, I think I could be convinced that it was real. I think Holden could be convinced that it was real. So a prior which says that space colonization is essentially impossible does not accurately characterize our beliefs.
1) The variances of the two distributions need to be standardized.
Less of a big deal, we we’d generally hope for and aim that our estimates of expected value are drawn from a similar distribution to the actual expected values—if it’s not, our estimates are systemically wrong somehow.
Our estimates could be systematically wrong in that they represent, for instance, before-regression estimates. We don’t even know the true distribution, and the generating mechanism for the estimates is sufficiently different that I wouldn’t feel confident that the variances should look similar.
Our estimates could be systematically wrong in that they represent, for instance, before-regression estimates.
If you assume the estimates are unbiased, as Gregory does, then before-regression estimates are not systematically wrong; they merely have variance.
Gregory isn’t even claiming to solve the biased-estimate case, which (IMO) is wise since the addition of bias (or arbitrary distributions, or arbitrary copulae between the estimate and true distribution) would drastically increase the number of model parameters, perhaps beyond the optimum point on the model uncertainty—parameter uncertainty trade-off.
I agree that the language in this post makes the divergence of the toy model from the true model seem smaller than it is, but I don’t think I’d call that a “serious technical problem!”
Even without bias, you need to know the ratio of the standard deviations of the distribution of true values and the distribution of estimates. The post assumes they are equal, which I wasn’t happy about (though I realise now that the fix for assuming they’re not equal is not that hard).
You’re right about the strength of the criticism, I should have edited that sentence and will do so now. I had weakened my claim about the strength of criticism in emails with Greg, but should have done so here too.
So, in our toy model with an R-square of 0.9, an estimate which is 1SD above the mean estimate puts the expected value at 0.9SD above the mean expected value.
I think there’s a confusion here about what the different distributions are. Normally when thinking about regression I think of having a prior over cost-effectiveness of the intervention at hand, and a distribution representing model uncertainty, which tells you the likelihood of having got your model output, given varying true values for the parameter. If those are the distributions you have, and they’re both normal with the same variance, then regression would end up centring on the mid-point (and so be linear with the number of SDs up you are).
I think that the distributions you are looking at, however, are the prior, and a prior over the distribution of the estimate numbers. Given this, the amount of regression is not linear with the number of standard deviations out. I think rather it goes up super-linearly.
Working with the perspective you are bringing in terms of the different distributions could be a useful angle on the problem. It’s not obvious to me it’s better than the normal approach to regression, though, mostly because it seems harder to give an inside view of correlation than of possible model error.
In the multivariate-normal case, the two approaches are exactly equivalent: if you know the marginals (unconditional true effectiveness and unconditional estimate value), and R^2, then you know the entire shape of the distribution (and hence the distribution of the true mean given the estimated mean).
A model in which the estimate is bivariate normal with R^2=0.9 to the ground truth corresponds to an estimate distribution of, if my stats is right, X~N(0, 0.9), E~N(0, 0.1), Y=X+E (where X is the ground truth, E the error, and Y the estimate; the second arguments are variances; this is true up to an affine transformation). As such, it follows from e.g. this theorem cited on Wikipedia that the actual mean scales linearly with the measured mean, although the coefficient of correlation is not quite what Gregory said (it’s R, not R^2).
I’m really glad to see an attack on this problem, so thanks for having a go. It’s an important issue that can be easy to lose track of.
Unfortunately I think there are some technical issues with your attempt to address it. To help organise the comment thread, I’m going to go into details in comments that are children of this one.
Edit: I originally claimed that the technical issues were serious. I’m now less confident—perhaps something in this space will be useful, although the difficulty in estimating some of the parameters makes me wary of applying it as is.
I agree that these expected-value estimates shouldn’t be taken (even somewhat) literally. But I think toy models like this one can still be important for checking the internal consistency of one’s reasoning. That is: if you can create a model that says X, this doesn’t mean you should treat X as true; but if you can’t create a reasonable model that says X, this is pretty strong evidence that X isn’t true.
In this case, the utility would be in allowing you to inspect the sometimes-unintuitive interplay between your guesses at an estimate’s R^2, the distributional parameters, and the amount of regression. While you shouldn’t plug in guesses at the parameters and expect the result to be correct, you can still use such a model to constrain the parameter space you want to think about.
Sadly I don’t think a log-normal distribution solves this problem for you, because to apply your model I think you are working entirely in the log-domain, so taking log(X) - log(something small), rather than log(X—something small). Then the choice of the small thing can have quite an effect on the answer.
For example when you regressed the estimate of cost-effectiveness of malaria nets, you had an implicit mean cost-effectiveness of 1 DALY/$100,000. If you’d assumed instead 1 DALY/$10,000, you’d have regressed to $77/DALY instead of $97/DALY.
I noticed something else: you may have lost track of expectations when translating between log-normal and normal.
The log of the median of a log-normal distribution is the same as the median of the normal distribution which you get by taking logs, and the same as the mean of that latter distribution. But the log of the mean of the log-normal distribution will be higher.
This affects what happens with your regressions. Assuming the initial estimates are point estimates, you end up with a distribution of possible values after regressing. With normal distributions, everything behaves nicely, the mean is equal to the median, and your calculations are correct.
With log-normal distributions, we normally want to make decisions based on expected value, which will be higher than the median I think you’ve produced. In particular, the log-normal will be wider (and the mean:median ratio higher) when there is less correlation between true values and estimates. This looks like it might reduce the importance of the r-value relative to your calculations, at least if you do care about expectations, and at the upper end of the range.
I’m not sure how this all cashes out. But wanted to flag it as something else which may significantly affect the answer!
Couple of other comments:
I’m not sold on estimating r^2 just by group (this is a totally outside view on error). I think you can often say something quite sensible about the relative strengths of different estimates you make for yourself. There should be a way to incorporate such information, and I’m not sure what it is.
It looks like it might need more thought on how to generalise to distributions which aren’t normal or log-normal. Generally I think we should expect tails to be a little thicker than log-normal; this should have the effect of reducing the impact of regression, but I don’t off-hand see how to follow this through quantitatively.
What’s your reasoning for this?
In general it depends on the structure of the distribution, or more precisely the copula between the ground truth and estimate. If you assume a Gaussian copula then you can easily reduce to the bivariate-normal case that Gregory describes here, but model uncertainty in the estimate probably fattens the copula tails a bit.
That there are various mechanisms (of which I only feel like I understand a few) in complex systems which produce power-law type tails. These can enter as factors, and the convergence back to log-normal we’d expect from the central limit theorem is slowish in the tails.
On the other hand we’ll see additive effects too, which could pull the tails in more tightly than log-normal. I maintain a reasonable amount of uncertainty over what kind of tail we’ll eventually think is most appropriate, but while I have that uncertainty I don’t want to say that the tails have to be thin.
Of course this is all qualitative reasoning which really only affects the behaviour quite far down the tail. I think for practical purposes log-normal is often a decent assumption (leaving you with the not-insignificant problem of how to pick parameters).
It seems like this probably depends a lot on what type of intervention you’re studying. I guess I would expect x-risks to have power-law-ish distributions, but I can’t think of very many power-law factors that would influence e.g. scaling up a proven global health intervention.
I agree that the distribution will depend on the kind of intervention. When you take into account indirect effects you may get some power-law type behaviour even in interventions where it looks unlikely, though—for instance coalescing broader societal support around an intervention so that it gets implemented far more than your direct funding provides for.
Our distribution of beliefs about the cost-effectiveness of scaling up something which is “proven” is likely to have particularly thin tails compared to dealing with “unproven” things, as by proof we tend to mean high-quality evidence that substantially tightens the possibilities. I’m not sure whether it changes the eventual tail to a qualitatively different kind of behaviour, or if they’re just quantitatively narrower distributions, though.
Carl has argued convincingly that the [edit: normal and] log-normal priors are too thin-tailed here:
Ryan, unless I’m dramatically misreading this post, it is about a normal, not log-normal distribution. Their behaviors are very different.
Carl starts off objecting to a normal prior then goes on to explain why normal and log-normal priors both look too thin-tailed.
Our estimates could be systematically wrong in that they represent, for instance, before-regression estimates. We don’t even know the true distribution, and the generating mechanism for the estimates is sufficiently different that I wouldn’t feel confident that the variances should look similar.
If you assume the estimates are unbiased, as Gregory does, then before-regression estimates are not systematically wrong; they merely have variance.
Gregory isn’t even claiming to solve the biased-estimate case, which (IMO) is wise since the addition of bias (or arbitrary distributions, or arbitrary copulae between the estimate and true distribution) would drastically increase the number of model parameters, perhaps beyond the optimum point on the model uncertainty—parameter uncertainty trade-off.
I agree that the language in this post makes the divergence of the toy model from the true model seem smaller than it is, but I don’t think I’d call that a “serious technical problem!”
Even without bias, you need to know the ratio of the standard deviations of the distribution of true values and the distribution of estimates. The post assumes they are equal, which I wasn’t happy about (though I realise now that the fix for assuming they’re not equal is not that hard).
You’re right about the strength of the criticism, I should have edited that sentence and will do so now. I had weakened my claim about the strength of criticism in emails with Greg, but should have done so here too.
I think there’s a confusion here about what the different distributions are. Normally when thinking about regression I think of having a prior over cost-effectiveness of the intervention at hand, and a distribution representing model uncertainty, which tells you the likelihood of having got your model output, given varying true values for the parameter. If those are the distributions you have, and they’re both normal with the same variance, then regression would end up centring on the mid-point (and so be linear with the number of SDs up you are).
I think that the distributions you are looking at, however, are the prior, and a prior over the distribution of the estimate numbers. Given this, the amount of regression is not linear with the number of standard deviations out. I think rather it goes up super-linearly.
Working with the perspective you are bringing in terms of the different distributions could be a useful angle on the problem. It’s not obvious to me it’s better than the normal approach to regression, though, mostly because it seems harder to give an inside view of correlation than of possible model error.
In the multivariate-normal case, the two approaches are exactly equivalent: if you know the marginals (unconditional true effectiveness and unconditional estimate value), and R^2, then you know the entire shape of the distribution (and hence the distribution of the true mean given the estimated mean).
A model in which the estimate is bivariate normal with R^2=0.9 to the ground truth corresponds to an estimate distribution of, if my stats is right, X~N(0, 0.9), E~N(0, 0.1), Y=X+E (where X is the ground truth, E the error, and Y the estimate; the second arguments are variances; this is true up to an affine transformation). As such, it follows from e.g. this theorem cited on Wikipedia that the actual mean scales linearly with the measured mean, although the coefficient of correlation is not quite what Gregory said (it’s R, not R^2).
Thanks Ben, you’re exactly right. I’d convinced myself of the contrary with a spurious geometric argument.