I don’t think there’s a single way to interpret the magnitude of the differences or the absolute scores (e.g a single effect size), so it’s best to examine this in a number of different ways.
One way to interpret the difference between the ratings is to look at the probability of superiority scores. For example, for Study 3 we showed that ~78% of people would be expected to rate longtermism AI safety (6.00) higher than longtermism (4.75). In contrast, for AI safety vs effective giving (5.65), it’s 61%, and for GCRR (5.95) it’s only about 51%.
You can also examine the (raw and weighted) distributions of the responses. This allows one to assess directly how many people “Like a great deal”, “Dislike a great deal” and so on.
You can also look at different measures, which have a more concrete interpretation than liking. We did this with one (interest in hearing more information about a topic). But in future studies we’ll include additional concrete measures, so we know e.g. how many people say they would get involved with x movement.
I agree that comparing these responses to other similar things outside of EA (like “positive action” but on the negative side) would be another useful way to compare the meaning of these responses.
One other thing to add is that the design of these studies isn’t optimised for assessing the effect of different names in absolute terms, because we every subject evaluated every item (“within-subjects”). This allows greater statistical power more cheaply, but the evaluations are also more likely to be implicitly comparative. To get an estimate of something like the difference in number of people who would be interested in x rather than y (assuming they would only encounter one or the other in the wild at a single time), we’d want to use a between-subjects design where people only evaluate one and indicate their interest in it.
Thanks Ben!
I don’t think there’s a single way to interpret the magnitude of the differences or the absolute scores (e.g a single effect size), so it’s best to examine this in a number of different ways.
One way to interpret the difference between the ratings is to look at the probability of superiority scores. For example, for Study 3 we showed that ~78% of people would be expected to rate longtermism AI safety (6.00) higher than longtermism (4.75). In contrast, for AI safety vs effective giving (5.65), it’s 61%, and for GCRR (5.95) it’s only about 51%.
You can also examine the (raw and weighted) distributions of the responses. This allows one to assess directly how many people “Like a great deal”, “Dislike a great deal” and so on.
You can also look at different measures, which have a more concrete interpretation than liking. We did this with one (interest in hearing more information about a topic). But in future studies we’ll include additional concrete measures, so we know e.g. how many people say they would get involved with x movement.
I agree that comparing these responses to other similar things outside of EA (like “positive action” but on the negative side) would be another useful way to compare the meaning of these responses.
One other thing to add is that the design of these studies isn’t optimised for assessing the effect of different names in absolute terms, because we every subject evaluated every item (“within-subjects”). This allows greater statistical power more cheaply, but the evaluations are also more likely to be implicitly comparative. To get an estimate of something like the difference in number of people who would be interested in x rather than y (assuming they would only encounter one or the other in the wild at a single time), we’d want to use a between-subjects design where people only evaluate one and indicate their interest in it.