do you have a sense of how to interpret the differences between options? E.g. I could imagine that basically everyone always gives an answer between 5 and 6, so a difference of 5.1 and 5.9 is huge. I could also imagine that scores are uniformly distributed between the entire range of 1-7, in which case 5.1 vs 5.9 isnât that big.
Relatedly, I like how you included âpositive actionâ as a comparison point but I wonder if itâs worth including something which is widely agreed to be mediocre (Effective Lawnmowing?) so that we can get a sense of how bad some of the lower scores are.
I donât think thereâs a single way to interpret the magnitude of the differences or the absolute scores (e.g a single effect size), so itâs best to examine this in a number of different ways.
One way to interpret the difference between the ratings is to look at the probability of superiority scores. For example, for Study 3 we showed that ~78% of people would be expected to rate longtermism AI safety (6.00) higher than longtermism (4.75). In contrast, for AI safety vs effective giving (5.65), itâs 61%, and for GCRR (5.95) itâs only about 51%.
You can also examine the (raw and weighted) distributions of the responses. This allows one to assess directly how many people âLike a great dealâ, âDislike a great dealâ and so on.
You can also look at different measures, which have a more concrete interpretation than liking. We did this with one (interest in hearing more information about a topic). But in future studies weâll include additional concrete measures, so we know e.g. how many people say they would get involved with x movement.
I agree that comparing these responses to other similar things outside of EA (like âpositive actionâ but on the negative side) would be another useful way to compare the meaning of these responses.
One other thing to add is that the design of these studies isnât optimised for assessing the effect of different names in absolute terms, because we every subject evaluated every item (âwithin-subjectsâ). This allows greater statistical power more cheaply, but the evaluations are also more likely to be implicitly comparative. To get an estimate of something like the difference in number of people who would be interested in x rather than y (assuming they would only encounter one or the other in the wild at a single time), weâd want to use a between-subjects design where people only evaluate one and indicate their interest in it.
do you have a sense of how to interpret the differences between options? E.g. I could imagine that basically everyone always gives an answer between 5 and 6, so a difference of 5.1 and 5.9 is huge. I could also imagine that scores are uniformly distributed between the entire range of 1-7, in which case 5.1 vs 5.9 isnât that big.
Relatedly, I like how you included âpositive actionâ as a comparison point but I wonder if itâs worth including something which is widely agreed to be mediocre (Effective Lawnmowing?) so that we can get a sense of how bad some of the lower scores are.
Thanks Ben!
I donât think thereâs a single way to interpret the magnitude of the differences or the absolute scores (e.g a single effect size), so itâs best to examine this in a number of different ways.
One way to interpret the difference between the ratings is to look at the probability of superiority scores. For example, for Study 3 we showed that ~78% of people would be expected to rate longtermism AI safety (6.00) higher than longtermism (4.75). In contrast, for AI safety vs effective giving (5.65), itâs 61%, and for GCRR (5.95) itâs only about 51%.
You can also examine the (raw and weighted) distributions of the responses. This allows one to assess directly how many people âLike a great dealâ, âDislike a great dealâ and so on.
You can also look at different measures, which have a more concrete interpretation than liking. We did this with one (interest in hearing more information about a topic). But in future studies weâll include additional concrete measures, so we know e.g. how many people say they would get involved with x movement.
I agree that comparing these responses to other similar things outside of EA (like âpositive actionâ but on the negative side) would be another useful way to compare the meaning of these responses.
One other thing to add is that the design of these studies isnât optimised for assessing the effect of different names in absolute terms, because we every subject evaluated every item (âwithin-subjectsâ). This allows greater statistical power more cheaply, but the evaluations are also more likely to be implicitly comparative. To get an estimate of something like the difference in number of people who would be interested in x rather than y (assuming they would only encounter one or the other in the wild at a single time), weâd want to use a between-subjects design where people only evaluate one and indicate their interest in it.