I agree this is an important question that would be of value to other organizations as well. We’ve already consulted with 80K, CE and AAC about it, but still feel this is an area we have a lot more work to do on. It isn’t explicitly pointed out in our open questions doc, but when we talk about measuring and evaluating our counterfactual benefits and harms, this question has been top of mind for us.
The short version of our current thinking is separated into short-term measurement and long-term measurement. We expect that longer term this kind of evaluation will be easier—since we’ll at least have career trajectories to evaluate. Counterfactual impact estimation is always challenging without an experimental set up which is hard to do at scale, but I think 80K and OpenPhil have put out multiple surveys that try to and extract estimates of counterfactual impact and do so reasonably well given the challenges, so we’ll probably do something similar. Also, at that point, we could compare our results to theirs, which could be a useful barometer. In the specific context of our effect on people taking existing priority paths, I think it’ll be interesting to compare the chosen career paths of people who have discovered 80K through our website relative to those who discovered 80K from other sources.
Our larger area of focus at the moment is how to evaluate the effect of our work in the short term, when we can’t yet see our long-term effect on people’s careers. We plan on measuring proxies, such as changes to their values, beliefs and plans. We expect whatever proxy we use in the short term to be very noisy and based on a small sample size, so we plan on relying heavily on qualitative methods. This is one of the reasons we reached out to a lot of people who are experienced in this space (and we’re incredibly grateful they agreed to help) - we think their intuition is an invaluable proxy to figuring out if we’re heading in the right direction.
This is an area that we believe is important and we still have a lot of uncertainty about, so additional advice from people with significant experience in this domain would be highly appreciated.
Here’s a (potentially stupid) idea for a mini RCT-type evaluation of this that came to mind: You could perhaps choose some subset of applicants for advising calls, and then randomly assign half of those to go through your normal process and half to be simply referred to 80k. And 80k could perhaps do the same in the other direction.
You could perhaps arrange for these referred people to definitely be spoken to (rather than not being accepted for advising or waiting for many months). And/or you could choose the subset for this random allocation to ensure the people are fairly good fits for either organisation’s focus (rather than e.g. someone who’ll very clearly focus on longtermism or someone who’ll very clearly focus on global health & poverty).
And then you could see whether the outcomes differ depending on which org the people were randomly assigned to speak to. Including seeing if the people assigned to speak to 80k were substantially more likely to then pursue their priority paths, and if so, whether they stuck with that, whether they liked it, and whether they seem to be doing well at it.
I raise this as food for thought rather than as a worked-out plan. It’s possible that anything remotely likely this would be too complicated and time-consuming to be worthwhile. And even if something like this is worth doing, maybe various details would need to be added or changed.
I like this, but have a few concerns. First, you need to pick a good outcome metrics, and most are high-variance and not very informative / objective. I also think the hoped-for outcomes are different, since 80k wants a few people to pick high-priority career paths, and probably good wants slight marginal improvements along potentially non-ideal career paths. And lastly, you can’t reliably randomize, since many people who might talk to Probably Good will be looking at 80k as well. Given all of that, I worry that even if you pick something useful to measure, the power / sample size needed, given individual variance, would be very large.
Still, I’d be happy to help Sella / Omer work through this and set it up, since I suspect they will get more applicants than they will be able to handle, and randomizing seems like a reasonable choice—and almost any type of otherwise useful follow-up survey can be used in this way once they are willing to randomize.
I agree this is an important question that would be of value to other organizations as well. We’ve already consulted with 80K, CE and AAC about it, but still feel this is an area we have a lot more work to do on. It isn’t explicitly pointed out in our open questions doc, but when we talk about measuring and evaluating our counterfactual benefits and harms, this question has been top of mind for us.
The short version of our current thinking is separated into short-term measurement and long-term measurement. We expect that longer term this kind of evaluation will be easier—since we’ll at least have career trajectories to evaluate. Counterfactual impact estimation is always challenging without an experimental set up which is hard to do at scale, but I think 80K and OpenPhil have put out multiple surveys that try to and extract estimates of counterfactual impact and do so reasonably well given the challenges, so we’ll probably do something similar. Also, at that point, we could compare our results to theirs, which could be a useful barometer. In the specific context of our effect on people taking existing priority paths, I think it’ll be interesting to compare the chosen career paths of people who have discovered 80K through our website relative to those who discovered 80K from other sources.
Our larger area of focus at the moment is how to evaluate the effect of our work in the short term, when we can’t yet see our long-term effect on people’s careers. We plan on measuring proxies, such as changes to their values, beliefs and plans. We expect whatever proxy we use in the short term to be very noisy and based on a small sample size, so we plan on relying heavily on qualitative methods. This is one of the reasons we reached out to a lot of people who are experienced in this space (and we’re incredibly grateful they agreed to help) - we think their intuition is an invaluable proxy to figuring out if we’re heading in the right direction.
This is an area that we believe is important and we still have a lot of uncertainty about, so additional advice from people with significant experience in this domain would be highly appreciated.
Thanks, that all sounds reasonable :)
Here’s a (potentially stupid) idea for a mini RCT-type evaluation of this that came to mind: You could perhaps choose some subset of applicants for advising calls, and then randomly assign half of those to go through your normal process and half to be simply referred to 80k. And 80k could perhaps do the same in the other direction.
You could perhaps arrange for these referred people to definitely be spoken to (rather than not being accepted for advising or waiting for many months). And/or you could choose the subset for this random allocation to ensure the people are fairly good fits for either organisation’s focus (rather than e.g. someone who’ll very clearly focus on longtermism or someone who’ll very clearly focus on global health & poverty).
And then you could see whether the outcomes differ depending on which org the people were randomly assigned to speak to. Including seeing if the people assigned to speak to 80k were substantially more likely to then pursue their priority paths, and if so, whether they stuck with that, whether they liked it, and whether they seem to be doing well at it.
I raise this as food for thought rather than as a worked-out plan. It’s possible that anything remotely likely this would be too complicated and time-consuming to be worthwhile. And even if something like this is worth doing, maybe various details would need to be added or changed.
I like this, but have a few concerns. First, you need to pick a good outcome metrics, and most are high-variance and not very informative / objective. I also think the hoped-for outcomes are different, since 80k wants a few people to pick high-priority career paths, and probably good wants slight marginal improvements along potentially non-ideal career paths. And lastly, you can’t reliably randomize, since many people who might talk to Probably Good will be looking at 80k as well. Given all of that, I worry that even if you pick something useful to measure, the power / sample size needed, given individual variance, would be very large.
Still, I’d be happy to help Sella / Omer work through this and set it up, since I suspect they will get more applicants than they will be able to handle, and randomizing seems like a reasonable choice—and almost any type of otherwise useful follow-up survey can be used in this way once they are willing to randomize.