I want to point out that the ethical schools of thought that you’re (probably) most anti-aligned with (e.g., that certain behaviors and even thoughts are deserving of eternal divine punishment) are also far more prominent in the West, proportionately even more so than the ones you’re aligned with.
For what it’s worth, we recently ran a cross-cultural survey (n > 1,000 after extensive filtering) on endorsement of eternal extreme punishment, with questions like “If I could create a system that makes deserving people feel unbearable pain forever, I would” and “If hell didn’t exist, or if it stopped existing, we should create it [...]”.
~16-19% of Chinese respondents consistently endorsed such statements, compared to ~10–14% of US respondents—despite China being majority atheist/agnostic.[1]
Of course, online surveys are notoriously unreliable, especially on such abstract questions. But if these results hold up, concerns about eternal punishment would actually count against a China-dominated future, not in favor of one.
On individual questions, agreement rates were usually much higher, especially in China and other non-Western countries. The above numbers reflect a conservative conjunctive measure filtering for consistency across multiple questions.
Interesting! (And troubling—well above the lizardman constant.) It would be interesting to do some qualitative follow-up on this, maybe with having these consistently retributivist people chat with an LLM instructed to do qualitative data collection and gently nudge them towards more suffering-averse views to see how deeply held or changeable those beliefs are.
If I’m interpreting this correctly, 25% of people in China think that at least 58% of all people in the world deserve eternal unbearable pain (with similar results in 3 other countries). This is so crazy that I think there must be another explanation, e.g., results got mixed up, or a lot of people weren’t paying attention and just answered randomly.
Thanks for flagging this, but you’re looking at an unfiltered sample (N=2,980) which includes almost all participants regardless of data quality. All statistics in the main text use a filtered sample (N=1,084), which excludes participants who failed two attention checks, reported not answering honestly, gave invalid birth years, or strongly violated additivity (see the relevant section in the main post, including footnotes 94 and 95 for more details). The unfiltered numbers should be ignored as they clearly contain a lot of inattentive participants. (We will update the supplementary materials to label them clearly, sorry about the confusion).
Here is the table that only includes participants who passed our inclusion criteria (the 7th table in the selected stats doc.
As you can see, the numbers are much lower: 25% of Chinese respondents believe that at least 10% of people deserve unbearable pain forever.
(Note: this table shows N=1,036 rather than the N=1,084 in the main text; the small discrepancy likely reflects a stricter additivity filter. I’m confirming with my co-author Clare Harris who analyzed the survey and wrote the supplementary materials.)
Thanks for clarifying. Sill, this suggests that the Chinese participants were on average much less conscientious about answering truthfully/carefully than the US/UK ones, which implies that even the filtered samples may still be relatively more noisy.
Perplexity w/ GPT-5.2 Thinking when I asked “Are there standard methods for dealing with this in surveying/statistics?”, among other ideas (sorry I don’t know how good the answer actually is):
Model carelessness (don’t only drop)
If you’re worried that “filtered China” still contains more residual noise, a more formal option is to use response-time mixture models / latent-class approaches that treat careful vs. careless responding as latent states and use RT information to infer them (reducing reliance on arbitrary RT cutoffs).
This can yield posterior probabilities of careless responding that you can use to down-weight or run analyses both with and without likely-careless respondents.
Hi, happy to speak to the methodological points here.
Thanks for sharing the link and suggestion. We agree that understanding how much we can trust the data is crucial for interpreting our results, so thank you for engaging with this critically.
We didn’t measure individual reaction times for questions so using RT modelling isn’t an option. Modelling carelessness in other ways (e.g., modelling it as a latent tendency) would be fascinating, but I don’t endorse the assumptions we’d need to model carelessness as a latent variable. (I like how Rohrer & Paulewicz 2025 point out that latent variable modelling requires making some strong assumptions.) So even though we could try to model carelessness, I currently don’t think we should do so.
The headline results were designed such that it would be very unlikely for participants selecting at random to meet our definition of “consistent and concerning” endorsers. In the main piece, the focus is on who agreed with the hell question, AND selected “Forever” for the duration question (the last of 11 options), AND selected 1% or more for the proportion question. The supplementary materials additionally report those who endorse BOTH the “endorses system” question and “would create” question, AND selected “Forever” for the duration question (the last of 11 options), AND selected 1% or more for the proportion question. Partly due to the design of these “headline” results variables, the proportion of the sample meeting our definition of “consistent and concerning” turned out to be robust to the removal of all attention checks (i.e., the inclusion of everyone who didn’t drop out before the questions of interest, with no filtering).
Regarding the other things in the list you linked to from that LLM chat, though, we did do most of those things. For example, we included unobtrusive checks and multiple different quality measures, not just attention checks—I’d be interested in your thoughts on the checks outlined in our supplementary folder. And importantly, for our headline results, we did sensitivity analyses and shared the results (including confidence intervals) in our supplementary materials folder.
(Also, just to address the point about the N varying between questions—that’s because different numbers of participants completed some questions; slightly fewer completed the duration and hell questions because we had tested different wordings for both questions early in the study, before they were replaced with new versions that were used for the rest of the study.)
Would be happy to answer follow-up questions too. Thanks!
For what it’s worth, we recently ran a cross-cultural survey (n > 1,000 after extensive filtering) on endorsement of eternal extreme punishment, with questions like “If I could create a system that makes deserving people feel unbearable pain forever, I would” and “If hell didn’t exist, or if it stopped existing, we should create it [...]”.
~16-19% of Chinese respondents consistently endorsed such statements, compared to ~10–14% of US respondents—despite China being majority atheist/agnostic.[1]
Of course, online surveys are notoriously unreliable, especially on such abstract questions. But if these results hold up, concerns about eternal punishment would actually count against a China-dominated future, not in favor of one.
On individual questions, agreement rates were usually much higher, especially in China and other non-Western countries. The above numbers reflect a conservative conjunctive measure filtering for consistency across multiple questions.
Interesting! (And troubling—well above the lizardman constant.) It would be interesting to do some qualitative follow-up on this, maybe with having these consistently retributivist people chat with an LLM instructed to do qualitative data collection and gently nudge them towards more suffering-averse views to see how deeply held or changeable those beliefs are.
If I’m interpreting this correctly, 25% of people in China think that at least 58% of all people in the world deserve eternal unbearable pain (with similar results in 3 other countries). This is so crazy that I think there must be another explanation, e.g., results got mixed up, or a lot of people weren’t paying attention and just answered randomly.
Thanks for flagging this, but you’re looking at an unfiltered sample (N=2,980) which includes almost all participants regardless of data quality. All statistics in the main text use a filtered sample (N=1,084), which excludes participants who failed two attention checks, reported not answering honestly, gave invalid birth years, or strongly violated additivity (see the relevant section in the main post, including footnotes 94 and 95 for more details). The unfiltered numbers should be ignored as they clearly contain a lot of inattentive participants. (We will update the supplementary materials to label them clearly, sorry about the confusion).
Here is the table that only includes participants who passed our inclusion criteria (the 7th table in the selected stats doc.
As you can see, the numbers are much lower: 25% of Chinese respondents believe that at least 10% of people deserve unbearable pain forever.
(Note: this table shows N=1,036 rather than the N=1,084 in the main text; the small discrepancy likely reflects a stricter additivity filter. I’m confirming with my co-author Clare Harris who analyzed the survey and wrote the supplementary materials.)
Thanks for clarifying. Sill, this suggests that the Chinese participants were on average much less conscientious about answering truthfully/carefully than the US/UK ones, which implies that even the filtered samples may still be relatively more noisy.
Perplexity w/ GPT-5.2 Thinking when I asked “Are there standard methods for dealing with this in surveying/statistics?”, among other ideas (sorry I don’t know how good the answer actually is):
Hi, happy to speak to the methodological points here.
Thanks for sharing the link and suggestion. We agree that understanding how much we can trust the data is crucial for interpreting our results, so thank you for engaging with this critically.
We didn’t measure individual reaction times for questions so using RT modelling isn’t an option. Modelling carelessness in other ways (e.g., modelling it as a latent tendency) would be fascinating, but I don’t endorse the assumptions we’d need to model carelessness as a latent variable. (I like how Rohrer & Paulewicz 2025 point out that latent variable modelling requires making some strong assumptions.) So even though we could try to model carelessness, I currently don’t think we should do so.
The headline results were designed such that it would be very unlikely for participants selecting at random to meet our definition of “consistent and concerning” endorsers. In the main piece, the focus is on who agreed with the hell question, AND selected “Forever” for the duration question (the last of 11 options), AND selected 1% or more for the proportion question. The supplementary materials additionally report those who endorse BOTH the “endorses system” question and “would create” question, AND selected “Forever” for the duration question (the last of 11 options), AND selected 1% or more for the proportion question. Partly due to the design of these “headline” results variables, the proportion of the sample meeting our definition of “consistent and concerning” turned out to be robust to the removal of all attention checks (i.e., the inclusion of everyone who didn’t drop out before the questions of interest, with no filtering).
Regarding the other things in the list you linked to from that LLM chat, though, we did do most of those things. For example, we included unobtrusive checks and multiple different quality measures, not just attention checks—I’d be interested in your thoughts on the checks outlined in our supplementary folder. And importantly, for our headline results, we did sensitivity analyses and shared the results (including confidence intervals) in our supplementary materials folder.
(Also, just to address the point about the N varying between questions—that’s because different numbers of participants completed some questions; slightly fewer completed the duration and hell questions because we had tested different wordings for both questions early in the study, before they were replaced with new versions that were used for the rest of the study.)
Would be happy to answer follow-up questions too. Thanks!