Some ideas for the presentation of the table to make it more digestible:
1. Is the table downloadable? Can it be made downloadable?
2. Can the table cell/font sizes and table height be made adjustable? It would be nice to be able to fit more of it (ideally all of it) on my screen at once. Just zooming out in my browser doesn’t work, since the table shrinks, too, and the same cells are displayed.
3. What about description boxes that pop up when you click on (or hover over) a cell (description/motivation of the feature itself, a box with the footnotes/text/sources when you click on the given cell)? Could also stick to informal recognizable names (cows, ants) where possible and put the taxon in a popup to save on space.
4. Different colour cells for “Likely No”, “Lean No”, “Unknown”, “Lean Yes”, “Likely Yes” (e.g. red, pink, grey, light green, green).
Was the mirror test experiment with ants missed or was it intentionally excluded? If the latter, why? It seems the journal it was published in is not very reputable, and the results have not been replicated independently.
What are the plans for maintaining/expanding this database? Would you consider making a wiki or open source version and allowing contribution from others (possibly through some formal approval process)?
I imagine it could be a useful resource not just for guiding our beliefs about the consciousness of invertebrates, but also the consciousness of other forms of life (and AI in the future).
One suggestion: I think it could be useful to have a column for the age at which each feature is first observable in humans on average (or include these in the entries for humans, as applicable).
tl;dr: even using priors, with more options and hazier probabilities, you tend to increase the number of options which are too sensitive to supporting information (or just optimistically biased due to your priors), and these options look disproportionately good. This is still an optimizer’s curse in practice.
This is an issue of the models and priors. If your models and priors are not right… then you should update over your priors and use better models. Of course they can still be wrong… but that’s true of all beliefs, all reasoning, etc.
If you assume from the outside (unbeknownst to the agent) that they are all fair, then you’re not showing a problem with the agent’s reasoning, you’re just using relevant information which they lack.
In practice, your models and priors will almost always be wrong, because you lack information; there’s some truth of the matter of which you aren’t aware. It’s unrealistic to expect us to have good guesses for the priors in all cases, especially with little information or precedent as in hazy probabilities, a major point of the OP.
You’d hope that more information would tend to allow you to make better predictions and bring you closer to the truth, but when optimizing, even with correctly specified likelihoods and after updating over priors as you said should be done, the predictions for the selected coin can be more biased in expectation with more information (results of coin flips). On the other hand, the predictions for any fixed coin will not be any more biased in expectation over the new information, and if the prior’s EV hadn’t matched the true mean, the predictions would tend to be less biased.
More information (flips) per option (coin) would reduce the bias of the selection on average, but, as I showed, more options (coins) would increase it, too, because you get more chances to be unusually lucky.
My prior would not be uniform, it would be 0.5! What else could “unbiased coins” mean?
The intent here again is that you don’t know the coins are fair.
Bayesian EV estimation doesn’t do hypothesis testing with p-value cutoffs. This is the same problem popping up in a different framework, yes it will require a different solution in that context, but they are separate.
The proposed solution applies here too, just do (simplistic, informal) posterior EV correction for your (simplistic, informal) estimates.
How would you do this in practice? Specifically, how would you get an idea of the magnitude for the correction you should make?
Maybe you could test your own (or your group’s) prediction calibration and bias, but it’s not clear how exactly you should incorporate this information, and it’s likely these tests won’t be very representative when you’re considering the kinds of problems with hazy probabilities mentioned in the OP.
I suspect experience sampling is much more costly and time-consuming to get data on than alternatives, and there’s probably much less data. Life satisfaction or other simple survey questions about subjective wellbeing might be good enough proxies, and there’s already a lot of available data out there.
Here’s a pretty comprehensive post on using subjective wellbeing:
A Happiness Manifesto: Why and How Effective Altruism Should Rethink its Approach to Maximising Human Welfare by Michael Plant
Another good place to read more about this is https://whatworkswellbeing.org/our-work/measuring-evaluating/
Deliberately offsetting a harm through a “similar” opposite benefit means deliberately restricting that donation to a charity from a restricted subset of possible charities, and it may be less effective than the ones you’ve ruled out.
Offsetting could also justify murder, because there are life-saving charities.
Also related: https://forum.effectivealtruism.org/posts/eeBwfLfB3iQkpDhz6/at-what-cost-carnivory
I know the post is satirical, but I think it’s worth pointing out that ego depletion, the idea that self-control or willpower draws upon a limited pool of mental resources that can be used up, is on shaky ground, i.e. the effect was not replicated in a few meta-analyses, although an older meta-analysis did replicate it.
This paper (Schuyler, J. R., & Nieman, T. (2007, January 1). Optimizer’s Curse: Removing the Effect of this Bias in Portfolio Planning. Society of Petroleum Engineers. doi:10.2118/107852-MS; earlier version) has some simple recommendations for dealing with the Optimizer’s Curse:
The impacts of the OC will be evident for any decisions involving ranking and selection among alternatives and projects. As described in Smith and Winkler, the effects increase when the true values of alternatives are more comparable and when the uncertainty in value estimations is higher. This makes intuitive sense: We expect a higher likelihood of making incorrect decisions when there is little true difference between alternatives and where there is significant uncertainty in our ability to asses value.
(...) Good decision-analysis practice suggests applying additional effort when we face closely competing alternatives with large uncertainty. In these cases, we typically conduct sensitivity analyses and value-of-information assessments to evaluate whether to acquire additional information. Incremental information must provide sufficient additional discrimination between alternatives to justify the cost of acquiring the additional information. New information will typically reduce the uncertainty in our values estimates, with the additional benefit of reducing the magnitude of OC.
The paper’s focus is actually on a more concrete Bayesian approach, based on modelling the population from which potential projects are sampled.
I made a long top-level comment that I hope will clarify some problems with the solution proposed in the original paper.
I ask the same question I asked of OP: give me some guidance that applies for estimating the impact of maximizing actions that doesn’t apply for estimating the impact of randomly selected actions.
This is a good point. Somehow, I think you’d want to adjust your posterior downward based on the set or the number of options under consideration and how unlikely the data that makes the intervention look good. This is not really useful, since I don’t know how much you should adjust these. Maybe there’s a way to model this explicitly, but it seems like you’d be trying to model your selection process itself before you’ve defined it, and then you look for a selection process which satisfies some properties.
You might also want to spend more effort looking for arguments and evidence against each option the more options you’re considering.
When considering a larger number of options, you could use some randomness in your selection process or spread funding further (although the latter will be vulnerable to the satisficer’s curse if you’re using cutoffs).
What do you mean by “the priors”?
If I haven’t decided on a prior, and multiple different priors (even an infinite set of them) seem equally reasonable to me.
I’m going to try to clarify further why I think the Bayesian solution in the original paper on the Optimizer’s Curse is inadequate.
The Optimizer’s Curse is defined by Proposition 1: informally, the expectation of the estimated value of your chosen intervention overestimates the expectation of its true value when you select the intervention with the maximum estimate.
The proposed solution is to instead maximize the posterior expected value of the variable being estimated (conditional on your estimates, the data, etc.), with a prior distribution for this variable, and this is purported to be justified by Proposition 2.
However, Proposition 2 holds no matter which priors and models you use; there are no restrictions at all in its statement (or proof). It doesn’t actually tell you that your posterior distributions will tend to better predict values you will later measure in the real world (e.g. by checking if they fall in your 95% credence intervals), because there need not be any connection between your models or priors and the real world. It only tells you that your maximum posterior EV equals your corresponding prior’s EV (taking both conditional on the data, or neither, although the posterior EV is already conditional on the data).
Something I would still call an “optimizer’s curse” can remain even with this solution when we are concerned with the values of future measurements rather than just the expected values of our posterior distributions based on our subjective priors. I’ll give 4 examples, the first just to illustrate, and the other 3 real-world examples:
1. Suppose you have n different fair coins, but you aren’t 100% sure they’re all fair, so you have a prior distribution over the future frequency of heads (it could be symmetric in heads and tails, so the expected value would be 1/2 for each), and you use the same prior for each coin. You want to choose the coin which has the maximum future frequency of landing heads, based on information about the results of finitely many new coin flips from each coin. If you select the one with the maximum expected posterior, and repeat this trial many times (flip each coin multiple times, select the one with the max posterior EV, and then repeat), you will tend to find the posterior EV of your chosen coin to be greater than 1/2, but since the coins are actually fair, your estimate will be too high more than half of the time on average. I would still call this an “optimizer’s curse”, even though it followed the recommendations of the original paper. Of course, in this scenario, it doesn’t matter which coin is chosen.
Now, suppose all the coins are as before except for one which is actually biased towards heads, and you have a prior for it which will give a lower posterior EV conditional on k heads and no tails than the other coins would (e.g. you’ve flipped it many times before with particular results to achieve this; or maybe you already know its bias with certainty). You will record the results of k coin flips for each coin. With enough coins, and depending on the actual probabilities involved, you could be less likely to select the biased coin (on average, over repeated trials) based on maximum posterior EV than by choosing a coin randomly; you’ll do worse than chance.
(Math to demonstrate the possibility of the posteriors working this way for k heads out of k: you could have a uniform prior on the true future long-run average frequency of heads for the unbiased coins, i.e.p(μi)=1 for μi in the interval [0,1], then p(μi|k heads)=(k+1)μki, and E[μi|k heads]=(k+1)/(k+2), which goes to 1 as k goes to infinity. You could have a prior which gives certainty to your biased coin having any true average frequency <1, so any of the unbiased coins which lands heads k out of k times will beat it for k large enough.)
If you flip each coin k times, there’s a number of coins, n, so that the true probability (not your modelled probability) of at least one of the n−1 other coins getting k heads is strictly greater than 1−1/n, i.e. 1−(1−1/2k)n−1>1−1/n (for k=2, you need n>8, and for k=10, you need n>9360, so n grows pretty fast as a function of k). This means, with probability strictly greater than 1−1/n, you won’t select the biased coin, so with probability strictly less than 1/n, you will select the biased coin. So, you actually do worse than random choice, because of how many different coins you have and how likely one of them is to get very lucky. You would have even been better off on average ignoring all of the new k×n coin flips and sticking to your priors, if you already suspected the biased coin was better (if you had a prior with mean >1/2).
2. A common practice in machine learning is to select the model with the greatest accuracy on a validation set among multiple candidates. Suppose that the validation and test sets are a random split of a common dataset for each problem. You will find that under repeated trials (not necessarily identical; they could be over different datasets/problems, with different models) that by choosing the model with the greatest validation accuracy, this value will tend to be greater than its accuracy on the test set. If you build enough models each trial, you might find the models you select are actually overfitting to the validation set (memorizing it), sometimes to the point that the models with highest validation accuracy will tend to have worse test accuracy than models with validation accuracy in a lower interval. This depends on the particular dataset and machine learning models being used. Part of this problem is just that we aren’t accounting for the possibility of overfitting in our model of the accuracies, but fixing this on its own wouldn’t solve the extra bias introduced by having more models to choose from.
3. Due to the related satisficer’s curse, when doing multiple hypothesis tests, you should adjust your p-values upward or your p-value cutoffs (false positive rate, significance level threshold) downward in specific ways to better predict replicability. There are corrections for the cutoff that account for the number of tests being performed, a simple one is that if you want a false positive rate of α, and you’re doing m tests, you could instead use a cutoff of 1−(1−α)m.
4. The satisficer’s curse also guarantees that empirical study publication based on p-value cutoffs will cause published studies to replicate less often than their p-values alone would suggest. I think this is basically the same problem as 3.
Now, if you treat your priors as posteriors that are conditional on a sample of random observations and arguments you’ve been exposed to or thought of yourself, you’d similarly find a bias towards interventions with “lucky” observations and arguments. For the intervention you do select compared to an intervention chosen at random, you’re more likely to have been convinced by poor arguments that support it and less likely to have seen good arguments against it, regardless of the intervention’s actual merits, and this bias increases the more interventions you consider. The solution supported by Proposition 2 doesn’t correct for the number of interventions under consideration.
You seem to be using “people all agree” as a stand-in for “the optimizer’s curse has been addressed”. I don’t get this. Addressing the optimizer’s curse has been mathematically demonstrated. Different people can disagree about the specific inputs, so people will disagree, but that doesn’t mean they haven’t addressed the optimizer’s curse.
Maybe we’re thinking about the optimizer’s curse in different ways.
The proposed solution of using priors just pushes the problem to selecting good priors. It’s also only a solution in the sense that it reduces the likelihood of mistakes happening (discovered in hindsight, and under the assumption of good priors), but not provably to its minimum, since it does not eliminate the impacts of noise. (I don’t think there’s any complete solution to the optimizer’s curse, since, as long as estimates are at least somewhat sensitive to noise, “lucky” estimates will tend to be favoured, and you can’t tell in principle between “lucky” and “better” interventions.)
If you’re presented with multiple priors, and they all seem similarly reasonable to you, but depending on which ones you choose, different actions will be favoured, how would you choose how to act? It’s not just a matter of different people disagreeing on priors, it’s also a matter of committing to particular priors in the first place.
If one action is preferred with almost all of the priors (perhaps rare in practice), isn’t that a reason (perhaps insufficient) to prefer it? To me, using this could be an improvement over just using priors, because I suspect it will further reduce the impacts of noise, and if it is an improvement, then just using priors never fully solved the problem in practice in the first place.
I agree with the rest of your comment. I think something like that would be useful.
What do you mean by “a good position”?
I’m getting a little confused about what sorts of concrete conclusions we are supposed to take away from here.
I’m not saying we shouldn’t use priors or that they’ll never help. What I am saying is that they don’t address the optimizer’s curse just by including them, and I suspect they won’t help at all on their own in some cases.
Maybe checking sensitivity to priors and further promoting interventions whose value depends less on them (among some set of “reasonable” priors) would help. You could see this as a special case of Chris’s suggestion to “Entertain multiple models”.
Perhaps you could even use an explicit model to combine the estimates or posteriors from multiple models into a single one in a way that either penalizes sensitivity to priors or gives less weight to more extreme estimates, but a simpler decision rule might be more transparent or otherwise preferable. From my understanding, GiveWell already uses medians of its analysts’ estimates this way.
Ah, I guess we’ll have to switch to a system of epistemology which doesn’t bottom out in unproven assumptions. Hey hold on a minute, there is none.
I get your point, but the snark isn’t helpful.
Yes, but it’s very hard to attack any particular prior as well.
I don’t think this leaves you in a good position if your estimates and rankings are very sensitive to the choice of “reasonable” priors. Chris illustrated this in his post at the end of part 2 (with the atheist example), and in part 3.
You could try to choose some compromise between these priors, but there are multiple “reasonable” ways to compromise. You could introduce a prior on these priors, but you could run into the same problem with multiple “reasonable” choices for this new prior.
I think even more people have things in the bads set, and there will be more agreement on these values, too, e.g. suffering, cruelty and injustice. The question is then a matter of weight.
Most people (and probably most EAs) aren’t antinatalists, so you would expect, for them, the total good to outweigh the total bad. Or, they haven’t actually thought about it enough.
OTOH, while current mental health issues may prevent altruism, prior experiences of suffering may lead to increased empathy and compassion.
A few more: energy (nuclear fusion, green tech, energy storage), medical physics, quantum computing (and its medical applications), risks from space and preparedness for worst case scenarios (like ALLFED).
By preventing one pregnancy in Vietnam, we save approximately: 30 mammals 850 chickens 1395 fish from being produced in factory-farmed conditions (or 35 626 welfare points).
Is this only from the animal products the child would have eaten themself? Should the consumption from that child’s descendants be included?
None of the GiveWell/ACE top or standout charities are working in these areas.
FWIW, TLYCS recommends PSI and DMI, and DMI is one of GiveWell’s standout charities, and both do family planning work.
FWIW, this is aimed at developing countries.
Couldn’t you say the same about GiveWell’s evaluation of AMF, TLYCS’s evaluation of PSI or the evaluation of any other charity or intervention that would predictably affect population sizes? ACE doesn’t consider impacts on wild animals for most of the charities/interventions it looks into, either, despite the effects of agriculture on wild animals.
My impression is that Charity Science/Entrepreneurship prioritizes global health/poverty and animal welfare, so we shouldn’t expect them to consider the effects on technological advancement or GCRs anymore than we should expect GiveWell, TLYCS or ACE to.
They have worked on evaluating animal welfare, though, so it would be nice to see this work applied here for wild animals.
EDIT: Oh, is the concern that they’re looking at a more biased subset of possible effects (by focusing primarily on effects that seem positive)?
For the Rethink Priorities project, why not also look into consciousness in plant species (e.g. mimosa and some carnivorous plants), AI (especially reinforcement learning) and animal/brain simulations (e.g. OpenWorm)? Whether or not they’re conscious (or conscious in a way that’s morally significant), they can at least provide some more data to adjust our credences in the consciousness of different animal species; they can still be useful for comparisons.
I understand that there will be little research to use here, but I expect this to mean proportionately less time will be spent on them.
My rough answer to this is: If someone wants to die (after thinking about it for a long time and having time to reflect on it), let them die.
Some people don’t have the choice to die, because they’re prevented from it, like victims of abuse/torture or certain freak accidents.
I don’t see how the atrocities that are experienced by humans outweigh the benefits, given that the vast majority of humans seem to have a pretty decent will to live.
I think this is a problem with the idea of “outweigh”. Utilitarian interpersonal tradeoffs can be extremely cruel and unfair. If you think the happiness can aggregate to outweigh the worst instances of suffering:
1. How many additional happy people would need to be born to justify subjecting a child to a lifetime of abuse and torture?
2. How many extra years of happy life for yourself would you need to justify subjecting a child to a lifetime of abuse and torture?
The framings might invoke very different immediate reactions (2 seems much more accusatory because the person benefitting from another’s abuse and torture is the one making the decision to subject them to it), but for someone just aggregating by summation, like a classical utilitarian, they’re basically the same.
I think it’s put pretty well here, too:
There’s ongoing sickening cruelty: violent child pornography, chickens are boiled alive, and so on. We should help these victims and prevent such suffering, rather than focus on ensuring that many individuals come into existence in the future. When spending resources on increasing the number of beings instead of preventing extreme suffering, one is essentially saying to the victims: “I could have helped you, but I didn’t, because I think it’s more important that individuals are brought into existence. Sorry.”