I have a bunch of thoughts on this, and would like to spend time thinking of more. Here are a few:
---
I’ve been advising this effort and gave feedback on it (some of which he explicitly included in the post in the “Caveats and warnings” section). Correspondingly, I think it’s a good early attempt, but definitely feel like things are still fairly early. Doing deep evaluation of some of this with getting more empirical data (for instance, surveying people to see which ones might have taken this advice, or having probing conversations with Guesstimate users) seems necessary to get a decent picture. However, it is a lot of work. This effort was much more of Nuño intuitively estimating all the parameters, which can get you kind of far, but shouldn’t be understood to be substantially more. Rubrics like these can be seen as much more authoritative than they actually are.
Reasons to expect these estimates to be over-positive
I tried my best to encourage Nuño to be fair and unbiased, but I’m sure he felt incentivized to give positive grades. I don’t believe I gave feedback to encourage him to exchange the scores favorably, but I did request that he made uncertainty more clear in this post. This wasn’t because I thought I did poorly in the rankings, is more because I thought that this was just a rather small amount of work for the claims being made. I imagine this will be an issue forward with evaluation, especially if people are evaluated you might be seen as possibly holding grudges or similar later on. It is not enough for them to not retaliate, the problem is that from an evaluator’s perspective, there’s a chance that they might retaliate.
Also, I imagine there is some selection pressure to a positive outcome. One of the reasons why I have been advising his efforts is because they are very related to my interests, so it would make sense that he might be more positive towards my previous efforts then would be others of different interests. This is one challenging thing about evaluation; typically the people who best understand the work have the advantage of better understanding its quality, but the disadvantage typically be biased towards how good this type of work is.
Note that all none of the projects wound up with a negative score, for example. I’m sure that at least one really should if we were clairvoyant, although it’s not obvious to me to say which one at this point.
Reasons to expect these estimates to be over-negative
I personally care whole lot more about being able to be neutral, and also in seeming neutral, than I do that my projects were evaluated favorably at the stage. I imagine this could been the case for Nuño as well. So it’s possible there was some over-compensation
here, but my guess is that you should expect things to be biased on the positive side regardless.
Tooling
I think this work brings to light how valuable improved tooling (better software solutions) could be. A huge spreadsheet can be kind of a mess, and things get more complicated if multiple users (like myself) would try to make rankings. I’ve been inspecting no-code options and would like to do some iteration here.
One change that seems obvious would be for reviews to be posted on the same page as the corresponding blog post. This could be done on the comments or in the post itself, like a Github status icon.
Decision Relevance
I’m hesitant to update much due to the rather low weight I place on this. I was very uncertain about the usefulness my projects before this and also I’m uncertain about it afterwards. I agree that most of it is probably not to be valuable at all unless I specifically, are much more unlikely someone else, continues his work into a more accessible or valuable form.
If it’s true the Guesstimate is actually far more important than anything else I’ve worked on, it would probably update me to focus a lot more on software. Recently I’ve been more focused on writing and mentorship than on technical development, but I’m considering changing back.
I think I would have paid around $1,000 or so for a report like this for my own usefulness. Looking back, the main value perhaps would come from talking through the thoughts with the people doing the rating. We haven’t done this yet, but might going forward. I’d probably pay at least $10,000 or so if I was sure that it was “fairly correct”.
The value of research in neglected areas
I think one big challenge with research is that you either focus on an active area or a neglected one. In active areas, marginal contributions may be less valuable because others are much more like you to come up with them. There’s one model where there is basically a bunch of free prestige lying around, and if you get there first you are taking zero-sum gains directly from someone else. In the EA community in particular I don’t want to play zero-sum games with other people. However, for neglected work, it seems very possible that almost no one will continue with it. My read is that neglected work is generally a fair bit more risky. There are instances where goes well and this could actually encourage a whole field to emerge (though this takes a while). There are other instances where no one happens to be interested in continuing this kind of research, and it dies before being able to be useful at all.
I think of my main research as being in areas I feel are very neglected. This can be exciting, but has obvious challenge that is difficult to be adopted by others, and so far this has been the case.
Doing deep evaluation of some of this without getting more empirical data (for instance, surveying people to see which ones might have taken this advice, or having probing conversations with Guesstimate users) seems necessary to get a decent picture.
(I assume you mean something like “and getting more empirical data”, not “without”?)
I think it’d indeed be interesting and useful to combine the sort of intuitive estimation approach Nuño adopted here with some gathering of empirical data. Nuño (or whoever) could perhaps randomly select a subset of posts/outputs to gather empirical data on, to reduce how time-consuming/costly the data collection is.
Two potential methods of data collection that come to mind are:
E.g., Rethink do “structured interviews with key decision-makers and leaders at EA organizations[, and seek] interviewees’ feedback on the general importance of our work for them and for the community, what they have and have not found helpful in what we’ve done, what we can do in the future that would be useful for them, and ways we can improve.”
In fact, one possibility would be to use the intuitive estimation approach on the work of one of the orgs/people who already have a bunch of this sort of data relevant to that work (after checking that the org/people are happy to have their work used for this process), and then look at the empirical data, and see how they compare.
(I recently started working for Rethink, but the views in this comment are just my personal views.)
In fact, one possibility would be to use the intuitive estimation approach on the work of one of the orgs/people who already have a bunch of this sort of data relevant to that work (after checking that the org/people are happy to have their work used for this process), and then look at the empirical data, and see how they compare.
This seems like a neat idea to me. We’ll investigate it.
Note that all none of the projects wound up with a negative score, for example. I’m sure that at least one really should if we were clairvoyant, although it’s not obvious to me to say which one at this point.
Regarding tooling, it may be very helpful to input subjective distributions—uncertainty seems to me to be very important here, mostly if we expect this kind of tool to be used by a low number of people
Yea, I’d love to see things like this, but it’s all a lot of work. The existing tooling is quite bad, and it will probably be a while before we could rig it up with Foretold/Guesstimate/Squiggle.
Another idea—to be able to use units in cells such that the end result will depend on them. Say, for scale one can write “20000 QALYs” or “400 BCLs” (Broiler Chicken Lives) or “2% XpC” (X-risk per Century)
I have a bunch of thoughts on this, and would like to spend time thinking of more. Here are a few:
---
I’ve been advising this effort and gave feedback on it (some of which he explicitly included in the post in the “Caveats and warnings” section). Correspondingly, I think it’s a good early attempt, but definitely feel like things are still fairly early. Doing deep evaluation of some of this with getting more empirical data (for instance, surveying people to see which ones might have taken this advice, or having probing conversations with Guesstimate users) seems necessary to get a decent picture. However, it is a lot of work. This effort was much more of Nuño intuitively estimating all the parameters, which can get you kind of far, but shouldn’t be understood to be substantially more. Rubrics like these can be seen as much more authoritative than they actually are.
Reasons to expect these estimates to be over-positive
I tried my best to encourage Nuño to be fair and unbiased, but I’m sure he felt incentivized to give positive grades. I don’t believe I gave feedback to encourage him to exchange the scores favorably, but I did request that he made uncertainty more clear in this post. This wasn’t because I thought I did poorly in the rankings, is more because I thought that this was just a rather small amount of work for the claims being made. I imagine this will be an issue forward with evaluation, especially if people are evaluated you might be seen as possibly holding grudges or similar later on. It is not enough for them to not retaliate, the problem is that from an evaluator’s perspective, there’s a chance that they might retaliate.
Also, I imagine there is some selection pressure to a positive outcome. One of the reasons why I have been advising his efforts is because they are very related to my interests, so it would make sense that he might be more positive towards my previous efforts then would be others of different interests. This is one challenging thing about evaluation; typically the people who best understand the work have the advantage of better understanding its quality, but the disadvantage typically be biased towards how good this type of work is.
Note that all none of the projects wound up with a negative score, for example. I’m sure that at least one really should if we were clairvoyant, although it’s not obvious to me to say which one at this point.
Reasons to expect these estimates to be over-negative
I personally care whole lot more about being able to be neutral, and also in seeming neutral, than I do that my projects were evaluated favorably at the stage. I imagine this could been the case for Nuño as well. So it’s possible there was some over-compensation
here, but my guess is that you should expect things to be biased on the positive side regardless.
Tooling
I think this work brings to light how valuable improved tooling (better software solutions) could be. A huge spreadsheet can be kind of a mess, and things get more complicated if multiple users (like myself) would try to make rankings. I’ve been inspecting no-code options and would like to do some iteration here.
One change that seems obvious would be for reviews to be posted on the same page as the corresponding blog post. This could be done on the comments or in the post itself, like a Github status icon.
Decision Relevance
I’m hesitant to update much due to the rather low weight I place on this. I was very uncertain about the usefulness my projects before this and also I’m uncertain about it afterwards. I agree that most of it is probably not to be valuable at all unless I specifically, are much more unlikely someone else, continues his work into a more accessible or valuable form.
If it’s true the Guesstimate is actually far more important than anything else I’ve worked on, it would probably update me to focus a lot more on software. Recently I’ve been more focused on writing and mentorship than on technical development, but I’m considering changing back.
I think I would have paid around $1,000 or so for a report like this for my own usefulness. Looking back, the main value perhaps would come from talking through the thoughts with the people doing the rating. We haven’t done this yet, but might going forward. I’d probably pay at least $10,000 or so if I was sure that it was “fairly correct”.
The value of research in neglected areas
I think one big challenge with research is that you either focus on an active area or a neglected one. In active areas, marginal contributions may be less valuable because others are much more like you to come up with them. There’s one model where there is basically a bunch of free prestige lying around, and if you get there first you are taking zero-sum gains directly from someone else. In the EA community in particular I don’t want to play zero-sum games with other people. However, for neglected work, it seems very possible that almost no one will continue with it. My read is that neglected work is generally a fair bit more risky. There are instances where goes well and this could actually encourage a whole field to emerge (though this takes a while). There are other instances where no one happens to be interested in continuing this kind of research, and it dies before being able to be useful at all.
I think of my main research as being in areas I feel are very neglected. This can be exciting, but has obvious challenge that is difficult to be adopted by others, and so far this has been the case.
(I assume you mean something like “and getting more empirical data”, not “without”?)
I think it’d indeed be interesting and useful to combine the sort of intuitive estimation approach Nuño adopted here with some gathering of empirical data. Nuño (or whoever) could perhaps randomly select a subset of posts/outputs to gather empirical data on, to reduce how time-consuming/costly the data collection is.
Two potential methods of data collection that come to mind are:
Surveys
E.g., Rethink Priorities’ “impact survey”, my survey which was inspired by Rethink’s, 80k’s annual survey, and a recent Happier Lives Institute survey
Some extra discussion here
Interviews
E.g., Rethink do “structured interviews with key decision-makers and leaders at EA organizations[, and seek] interviewees’ feedback on the general importance of our work for them and for the community, what they have and have not found helpful in what we’ve done, what we can do in the future that would be useful for them, and ways we can improve.”
In fact, one possibility would be to use the intuitive estimation approach on the work of one of the orgs/people who already have a bunch of this sort of data relevant to that work (after checking that the org/people are happy to have their work used for this process), and then look at the empirical data, and see how they compare.
(I recently started working for Rethink, but the views in this comment are just my personal views.)
That’s quite useful, thanks
This seems like a neat idea to me. We’ll investigate it.
Expected Error, or how wrong you expect to be ended up with a −1, because of the negative comments.
Good catch
Regarding tooling, it may be very helpful to input subjective distributions—uncertainty seems to me to be very important here, mostly if we expect this kind of tool to be used by a low number of people
Yea, I’d love to see things like this, but it’s all a lot of work. The existing tooling is quite bad, and it will probably be a while before we could rig it up with Foretold/Guesstimate/Squiggle.
Another idea—to be able to use units in cells such that the end result will depend on them. Say, for scale one can write “20000 QALYs” or “400 BCLs” (Broiler Chicken Lives) or “2% XpC” (X-risk per Century)
Yep, I think this is quite useful/obvious. (If I understand it correctly). Work though :)