We can get a better intimation of the magnitude of the effect here with some further calculations. If we take all the people who have pre and post FTX satisfaction responses (n = 951), we see that 4% of them have a satisfaction score that went up, 53% remained the same, and 43% went down. That’s quite a striking negative impact. For those people whose scores went down, 67% had a reduction of only 1 point, 22% of 2 points, and then 7%, 3%, and 1% each for −3, −4, and −5 points.
We can also try to translate this effect into some more commonly used effect size metrics. Firstly, we can utilise a nice summary effect size metric for these ratings known as probability of superiority (PSup), which makes relatively few assumptions about the data—mainly that higher ratings are higher and lower ratings are lower, within the same respondent. This metric summarises the difference over time by taking the proportion of cases in which a score was higher pre-FTX (42.7%), and assigning a 50% weight to cases in which the score was the same from pre to post FTX (.5 * 53.2% = 26.6%), and adding these quantities together (69.3%). This metric is taken as an approximation of the proportion of people who would report being more satisfied before vs. after in a forced choice of being more or less satisfied. If everyone was more satisfied before, PSup would be 100%, if everyone was more satisfied after, PSup would be 0, and if it were just as likely for people to be more or less satisfied before or after, PSup would be 50%. In this case, we get a PSup of 69.3%. This corresponds to an effect size in standard deviation units (like Cohen’s d), of approximately .7.
We would encourage people not to just look up whether these are small or large effects in a table that would say e.g, from wikipedia, that .7 is in the ‘medium’ effect size bin. Think about how you would respond on this kind of question, what a difference of 1 or more points would mean in your head, and what precisely you think the proportions of people giving different responses substantively might mean to them. How one can best interpret effect sizes varies greatly with context
We can get a better intimation of the magnitude of the effect here with some further calculations. If we take all the people who have pre and post FTX satisfaction responses (n = 951), we see that 4% of them have a satisfaction score that went up, 53% remained the same, and 43% went down. That’s quite a striking negative impact. For those people whose scores went down, 67% had a reduction of only 1 point, 22% of 2 points, and then 7%, 3%, and 1% each for −3, −4, and −5 points.
We can also try to translate this effect into some more commonly used effect size metrics. Firstly, we can utilise a nice summary effect size metric for these ratings known as probability of superiority (PSup), which makes relatively few assumptions about the data—mainly that higher ratings are higher and lower ratings are lower, within the same respondent. This metric summarises the difference over time by taking the proportion of cases in which a score was higher pre-FTX (42.7%), and assigning a 50% weight to cases in which the score was the same from pre to post FTX (.5 * 53.2% = 26.6%), and adding these quantities together (69.3%). This metric is taken as an approximation of the proportion of people who would report being more satisfied before vs. after in a forced choice of being more or less satisfied. If everyone was more satisfied before, PSup would be 100%, if everyone was more satisfied after, PSup would be 0, and if it were just as likely for people to be more or less satisfied before or after, PSup would be 50%. In this case, we get a PSup of 69.3%. This corresponds to an effect size in standard deviation units (like Cohen’s d), of approximately .7.
We would encourage people not to just look up whether these are small or large effects in a table that would say e.g, from wikipedia, that .7 is in the ‘medium’ effect size bin. Think about how you would respond on this kind of question, what a difference of 1 or more points would mean in your head, and what precisely you think the proportions of people giving different responses substantively might mean to them. How one can best interpret effect sizes varies greatly with context