The (un)reliability of moral judgments: A survey and systematic(ish) review

(This is a cross-post from my blog. The for­mat­ting there is a bit nicer. Most im­por­tantly, the ta­bles are less squished. Also, the in­ter­nal links work.)


Meta: This post at­tempts to sum­ma­rize the in­ter­dis­ci­plinary work on the (un)re­li­a­bil­ity of moral judge­ments. As that work con­tains many differ­ent per­spec­tives with no grand syn­the­sis and no clear win­ner (at pre­sent), this post is un­able to offer a sin­gle, neat con­clu­sion to take away. In­stead, this post is worth read­ing if the (un)re­li­a­bil­ity of moral judge­ments seems im­por­tant to you and you’d like to un­der­stand what the cur­rent state of in­ves­ti­ga­tion is.

We’d like our moral judg­ments to be re­li­able—to be sen­si­tive only to fac­tors that we en­dorse as morally rele­vant. Ex­per­i­men­tal stud­ies on a va­ri­ety of pu­ta­tively ir­rele­vant fac­tors—like the or­der­ing of dilem­mas and in­ci­den­tal dis­gust at time of eval­u­a­tion—give some (but not strong, due to method­olog­i­cal is­sues and limited data) rea­son to be­lieve that our moral judg­ments do in prac­tice track these ir­rele­vant fac­tors. The­o­ries about the ori­gins and op­er­a­tions of our moral fac­ul­ties give fur­ther rea­son to sus­pect that our moral judg­ments are not perfectly re­li­able. There are a va­ri­ety of re­sponses which try to re­ha­bil­i­tate our moral judg­ments—by deny­ing the val­idity of the ex­per­i­men­tal stud­ies, by block­ing the in­fer­ence to the peo­ple and situ­a­tions of most con­cern, by ac­cept­ing their limited re­li­a­bil­ity and shrug­ging, by ac­cept­ing their limited re­li­a­bil­ity and work­ing to over­come it—but it’s not yet clear whether any of them do or can suc­ceed.

(This post is painfully long. Cop­ing ad­vice: Each sub­sec­tion within Direct (em­piri­cal) ev­i­dence, within Indi­rect ev­i­dence, and within Re­sponses is pretty in­de­pen­dent—feel free to dip in and out as de­sired. I’ve also put a list-for­mat­ted sum­mary at the end of each these sec­tions boiling down each sub­sec­tion to one or two sen­tences.)


“Dan is a stu­dent coun­cil rep­re­sen­ta­tive at his school. This semester he is in charge of schedul­ing dis­cus­sions about aca­demic is­sues. He of­ten picks top­ics that ap­peal to both pro­fes­sors and stu­dents in or­der to stim­u­late dis­cus­sion.”

Is Dan’s be­hav­ior morally ac­cept­able? On first glance, you’d be in­clined to say yes. And even on the sec­ond and third glance, ob­vi­ously, yes. Dan is a stand-up guy. But what if you’d been ex­per­i­men­tally ma­nipu­lated to feel dis­gust while read­ing the vi­gnette? If we’re to be­lieve (Wheatley and Haidt 2005), there’s a one-third chance you’d judge Dan as morally sus­pect. ‘One sub­ject jus­tified his con­dem­na­tion of Dan by writ­ing “it just seems like he’s up to some­thing.” Another wrote that Dan seemed like a “pop­u­lar­ity seek­ing snob.”’

The pos­si­bil­ity that moral judg­ments track ir­rele­vant fac­tors like in­ci­den­tal dis­gust at the mo­ment of eval­u­a­tion is (to me, at least) alarm­ing. But now that you’ve been baited, we can move on the bor­ing, obli­ga­tory for­mal­ities.

What are moral judg­ments?

A moral judg­ment is a be­lief that some moral propo­si­tion is true or false. It is the out­put of a pro­cess of moral rea­son­ing. When I as­sent to the claim “Mur­der is wrong.”, I’m mak­ing a moral judg­ment.

(Quite a bit of work in this area talks about moral in­tu­itions rather than moral judg­ments. Mo­ral in­tu­itions are more about the im­me­di­ate sense of some­thing than about the all-things-con­sid­ered, re­flec­tive judg­ment. One model of the re­la­tion­ship be­tween in­tu­itions and judg­ments is that in­tu­itions are the raw ma­te­rial which are re­fined into moral judg­ments by more so­phis­ti­cated moral rea­son­ing. We will talk pre­dom­i­nately about moral judg­ments be­cause:

  1. It’s hard to get at in­tu­itions in em­piri­cal stud­ies. I don’t have much faith in di­rec­tions like “Give us your im­me­di­ate re­ac­tion.”.

  2. Mo­ral judg­ments are ul­ti­mately what we care about in­so­far as we call the things that mo­ti­vate moral ac­tion moral judg­ments.

  3. It’s not clear that moral in­tu­itions and moral judg­ments are always dis­tinct. There is rea­son to be­lieve that, at least for some peo­ple some of the time, moral in­tu­itions are not re­fined be­fore be­com­ing moral judg­ments. In­stead, they are sim­ply ac­cepted at face value.

On the other hand, in this post, we are in­ter­ested in judg­men­tal un­re­li­a­bil­ity driven by in­tu­itional un­re­li­a­bil­ity. We won’t fo­cus on ad­di­tional noise that any sub­se­quent moral rea­son­ing may layer on top of un­re­li­a­bil­ity in moral in­tu­itions.)

What would it mean for moral judg­ments to be un­re­li­able?

The sim­plest case of un­re­li­able judg­ments is when pre­cisely the same moral propo­si­tion is eval­u­ated differ­ently at differ­ent times. If I tell you that “Mur­der is wrong in con­text A.” to­day and “Mur­der is right in con­text A.” to­mor­row, my judg­ments are very un­re­li­able in­deed.

A more gen­eral sort of un­re­li­a­bil­ity is when our moral judg­ments as ac­tu­ally man­i­fested track fac­tors that seem, upon re­flec­tion, morally ir­rele­vant[1]. In other words, if two propo­si­tions are iden­ti­cal on all fac­tors that we en­dorse as morally rele­vant, our moral judg­ments about these propo­si­tions should be iden­ti­cal. The fear is that, in prac­tice, our moral judg­ments do not always ad­here to this rule be­cause we pay un­due at­ten­tion to other fac­tors.

Th­ese in­fluen­tial but morally ir­rele­vant fac­tors (at­tested to vary­ing de­grees in the liter­a­ture as we’ll see be­low) in­clude things like:

  • Order: The moral ac­cept­abil­ity of a vi­gnette de­pends on the or­der in which it’s pre­sented rel­a­tive to other vi­gnettes. Dis­gust and cleanliness

  • Dis­gust and clean­li­ness: The moral ac­cept­abil­ity of a vi­gnette de­pends how dis­gusted or clean the moral­ist (i.e. the per­son judg­ing moral ac­cept­abil­ity) feels at the time.

(The claim that cer­tain fac­tors are morally ir­rele­vant is it­self part of a moral the­ory. How­ever, some fac­tors seem to be morally ir­rele­vant on a very wide range of moral the­o­ries.)

Why do we care about the alleged un­re­li­a­bil­ity of moral judg­ments?

“The [Restric­tion­ist] Challenge, in a nut­shell, is that the ev­i­dence of the [hu­man philo­soph­i­cal in­stru­ment]’s sus­cep­ti­bil­ity to er­ror makes live the hy­poth­e­sis that the cathe­dra lacks re­sources ad­e­quate to the re­quire­ments of philo­soph­i­cal en­quiry.” —(J. M. Wein­berg 2017a)

We’re mostly go­ing to bracket metaeth­i­cal con­cerns here and as­sume that moral propo­si­tions with rel­a­tively sta­ble truth-like val­ues are pos­si­ble and de­sir­able and that our ap­pre­hen­sion of these propo­si­tion should satisfy cer­tain prop­er­ties.

Given that, the over­all state­ment of the Un­re­li­a­bil­ity of Mo­ral Judg­ment Prob­lem looks like this:

  1. Eth­i­cal and metaeth­i­cal ar­gu­ments sug­gest that cer­tain fac­tors are not rele­vant for the truth of cer­tain moral claims and ought not to be con­sid­ered when mak­ing moral judg­ments.

  2. Em­piri­cal in­ves­ti­ga­tion and the­o­ret­i­cal ar­gu­ments sug­gests that the moral judg­ments of some peo­ple, in some cases, track these morally ir­rele­vant fac­tors.

  3. There­fore, for some peo­ple, in some cases, moral judg­ments track fac­tors they ought not to.

Of course, how wor­ri­some that con­clu­sion is de­pends on how we in­ter­pret the “some”s. We’ll ad­dress that in the fi­nal sec­tion. Be­fore that, we’ll look at the sec­ond premise. What is the ev­i­dence of un­re­li­a­bil­ity?

Direct (em­piri­cal) evidence

We now turn to the cen­tral ques­tion: “Are our moral in­tu­itions re­li­able?”. There’s a fairly broad set of ex­per­i­men­tal stud­ies ex­am­in­ing this ques­tion.

(When we ex­am­ine each of the pu­ta­tively ir­rele­vant moral fac­tors be­low, for the sake of brevity[2], I’ll as­sume it’s ob­vi­ous why there’s at least a prima fa­cie case for ir­rele­vance.)


I at­tempted a sys­tem­atic re­view of these stud­ies. My search pro­ce­dure was as fol­lows:

  1. I searched for “ex­per­i­men­tal philos­o­phy” and “moral psy­chol­ogy” on Library Ge­n­e­sis and se­lected all books with rele­vant ti­tles. If I was in doubt based on the ti­tle alone, I looked at the book’s brief de­scrip­tion on Library Ge­n­e­sis.

  2. I then ex­am­ined the table of con­tents for each of these books and read the rele­vant chap­ters. If I was in doubt as to the rele­vance of a chap­ter, I read its in­tro­duc­tion or did a quick skim.

  3. I searched for “re­li­a­bil­ity of moral in­tu­itions” on Google Scholar and se­lected rele­vant pa­pers based on their ti­tles and ab­stracts.

  4. I browsed the “ex­per­i­men­tal philos­o­phy: ethics” sec­tion of PhilPapers and se­lected rele­vant pa­pers based on their ti­tles and ab­stracts.

  5. Any rele­vant pa­per (as judge by ti­tle and ab­stract) that was cited in the works gath­ered in steps 1-4 was also se­lected for re­view.

When se­lect­ing works, I was look­ing for ex­per­i­ments that ex­am­ined how moral (not episte­molog­i­cal—an­other com­mon sub­ject of ex­per­i­ment) in­tu­itions about the right­ness or wrong­ness of be­hav­ior co­varied with fac­tors that are prima fa­cie morally ir­rele­vant. I was open to any sort of sub­ject pop­u­la­tion though most stud­ies ended up ex­am­in­ing WEIRD col­lege stu­dents or work­ers on on­line sur­vey plat­forms like Ama­zon Me­chan­i­cal Turk.

I ex­cluded ex­per­i­ments that ex­am­ined other moral in­tu­itions like:

  • whether moral claims are relative

  • if moral be­hav­ior is in­ten­tional or un­in­ten­tional (Knobe 2003)

There were also sev­eral stud­ies that ex­am­ined peo­ple’s re­sponses to Kah­ne­man and Tver­sky’s Asian dis­ease sce­nario. Even though this sce­nario has a strong moral di­men­sion, I ex­cluded these stud­ies on the grounds that any strangeness here was most likely (as judged by my in­tu­itions) a re­sult of non-nor­ma­tive is­sues (i.e. failure to ac­tu­ally calcu­late or con­sider the full im­pli­ca­tions of the sce­nario).

For each in­cluded study, I ex­tracted in­for­ma­tion like sam­ple size and the au­thors’ statis­ti­cal anal­y­sis. Some pu­ta­tively ir­rele­vant fac­tors—or­der and dis­gust—had enough stud­ies that ho­mog­e­niz­ing and com­par­ing the data seemed fruit­ful. In these cases, I com­puted the effect size for each data point (The code for these calcu­la­tions can be found here).

is a mea­sure of effect size like the more pop­u­lar (I think?) Co­hen’s . How­ever, in­stead of mea­sur­ing the stan­dard­ized differ­ence of the mean of two pop­u­la­tions (like ), mea­sures the frac­tion of vari­a­tion ex­plained. That means is just like . The some­what ar­bi­trary con­ven­tional clas­sifi­ca­tion is that rep­re­sents a small effect, rep­re­sents a medium effect and any­thing larger counts as a large effect.

For the fac­tors with high cov­er­age—or­der and dis­gust—I also cre­ated fun­nel plots. A fun­nel plot is a way to as­sess pub­li­ca­tion bias. If ev­ery­thing is on the up and up, the plot should look like an up­side down fun­nel—effect sizes should spread out sym­met­ri­cally as we move down from large sam­ple stud­ies to small sam­ple stud­ies. If re­searchers only pub­lish their most pos­i­tive re­sults, we ex­pect the fun­nel to be very lop­sided and for the effect size es­ti­mated in the largest study to be the small­est.


Gen­er­ally, the ma­nipu­la­tion in these stud­ies is to pre­sent vi­gnettes in differ­ent se­quences to in­ves­ti­gate whether ear­lier vi­gnettes in­fluence moral in­tu­itions on later vi­gnettes. For ex­am­ple, if a sub­ject re­ceives:

  1. a trol­ley prob­lem where sav­ing five peo­ple re­quires kil­ling one by flip­ping a switch, and

  2. a trol­ley prob­lem where sav­ing five peo­ple re­quires push­ing one per­son with a heavy back­pack into the path of the trol­ley,

do sam­ples give the same re­sponses to these vi­gnettes re­gard­less of the or­der they’re en­coun­tered?

The find­ings seems to be roughly that:

  1. there are some vi­gnettes which elicit sta­ble moral in­tu­itions and are not sus­cep­ti­ble to or­der effects

  2. more marginal sce­nar­ios are af­fected by or­der of presentation

  3. the or­der effect seems to op­er­ate like a ratchet in which peo­ple are will­ing to make sub­se­quent judg­ments stric­ter but not laxer

But should we ac­tu­ally trust the stud­ies? I give brief com­ments on the method­ol­ogy of each study in the ap­pendix. Over­all, these stud­ies seemed of pretty method­olog­i­cally stan­dard to me—no ma­jor red flags.

The quan­ti­tive re­sults fol­low. The sum­mary is that while there’s sub­stan­tial vari­a­tion in effect size and some pub­li­ca­tion bias, I’m in­clined to be­lieve there’s a real effect here.

Stud­ies of moral in­tu­itions and or­der effects

Study In­de­pen­dent vari­able Depen­dent vari­able Sam­ple size Re­sult Effect size
[@petrinovich1996in­fluence], study 2, form 1 Order­ing of in­ac­tion vs ac­tion Scale of agree­ment 30 vs 29 ;
[@petrinovich1996in­fluence], study 2, form 2 Order­ing of in­ac­tion vs ac­tion Scale of agree­ment 30 vs 29 ;
[@haidt1996so­cial], mazda Order­ing of act vs omis­sion Rat­ing act worse 45.5 vs 45.5[^es­ti­mate] ;
[@haidt1996so­cial], crane Order­ing of act vs omis­sion Rat­ing act worse 34.5 vs 34.5 ;
[@haidt1996so­cial], mazda Order­ing of so­cial roles Rat­ing friend worse 45.5 vs 45.5 ;
[@haidt1996so­cial], crane Order­ing of so­cial roles Rat­ing fore­man worse 34.5 vs 34.5 ;
[@lan­teri2008ex­per­i­men­tal] Order­ing of vi­gnettes Obli­ga­tory or not 31 vs 31 ;
[@lan­teri2008ex­per­i­men­tal] Order­ing of vi­gnettes Ac­cept­able or not 31 vs 31 ; $ p=0.0011$
[@lom­brozo2009role] Order­ing of trol­ley switch vs push Rat­ing of per­mis­si­bil­ity 56 vs 56 ;
[@za­m­zow2009vari­a­tions] Order­ing of vi­gnettes Right or wrong 8 vs 9 ;
[@wright2010in­tu­itional], study 2 Order­ing of vi­gnettes Right or wrong 30 vs 30 ;
[@schwitzgebel2012ex­per­tise], philo­sphers Within-pair vi­gnette or­der­ings Num­ber of pairs judged equiv­a­lent 324 ;
[@schwitzgebel2012ex­per­tise], aca­demic non-philoso­phers Within-pair vi­gnette or­der­ings Num­ber of pairs judged equiv­a­lent 753 ;
[@schwitzgebel2012ex­per­tise], non-aca­demics Within-pair vi­gnette or­der­ings Num­ber of pairs judged equiv­a­lent 1389 ;
[@liao2012putting] Order­ing of vi­gnettes Rat­ing of per­mis­si­bil­ity 48.3 vs 48.3 vs 48.3 ;
[@wieg­mann2012or­der] Most vs least agree­able first Rat­ing of should­ness 25 vs 25 ;

Forest plot of orderstudies

(Pseudo-)For­est plot show­ing re­ported effect sizes for or­der manipulations

While there’s clearly dis­per­sion here, that’s to be ex­pected given the het­ero­gene­ity of the stud­ies. The most im­por­tant source of which (I’d guess) is the vi­gnettes used[3]. The more difficult the dilemma, the more I’d ex­pect or­der effects to mat­ter and I’d ex­pect some vi­gnettes to show no or­der effect. I’m not go­ing to en­dorse mur­der for fun no mat­ter which vi­gnette you pre­cede it with. Given all this, a study could pre­sum­ably drive the effect size from or­der­ing ar­bi­trar­ily low with the ap­pro­pri­ate choice of vi­gnettes. On the other hand, it seems like there prob­a­bly is some up­per bound on the mag­ni­tude of or­der effects and more care­ful stud­ies and re­views could per­haps tease that out.

Funnel plot of orderstudies

Fun­nel plot show­ing re­ported effect sizes for or­der manipulations

The fun­nel plot seems to in­di­cate some pub­li­ca­tion bias, but it looks like the effect may be real even af­ter ac­count­ing for that.


Un­for­tu­nately, I only found one pa­per di­rectly test­ing this. In this study, half the par­ti­ci­pants had their trol­ley prob­lem de­scribed with:

(a) “Throw the switch, which will re­sult in the death of the one in­no­cent per­son on the side track” and (b) “Do noth­ing, which will re­sult in the death of the five in­no­cent peo­ple.”

and the other half had their prob­lem de­scribed with:

(a) “Throw the switch, which will re­sult in the five in­no­cent peo­ple on the main track be­ing saved” and (b) “Do noth­ing, which will re­sult in the one in­no­cent per­son be­ing saved.”


The ac­tual con­se­quences of each ac­tion are the same in each con­di­tion—it’s only the word­ing which has changed. The study (with each of two in­de­pen­dent sam­ples) found that in­deed peo­ple’s moral in­tu­itions varied based on the word­ing:

Stud­ies of moral in­tu­itions and word­ing effects

Study In­de­pen­dent vari­able Depen­dent vari­able Sam­ple size Re­sult Effect size
[@petrinovich1993em­piri­cal], gen­eral class Word­ing of vi­gnettes Scale of agree­ment 361 ;
[@petrinovich1993em­piri­cal], biomeds Word­ing of vi­gnettes Scale of agree­ment 60 ;

While the effects are quite large here, it’s worth not­ing that in other stud­ies in other do­mains fram­ing effects have dis­ap­peared when prob­lems were more fully de­scribed (Küh­berger 1995). (Kuhn 1997) found that even word­ings which were plau­si­bly equiv­a­lent led sub­jects to al­ter their es­ti­mates of im­plicit prob­a­bil­ities in vi­gnettes.

Dis­gust and cleanliness

In stud­ies of dis­gust[4], sub­jects are ma­nipu­lated to feel dis­gust via mechanisms like:

  1. re­call­ing and vividly writ­ing about a dis­gust­ing ex­pe­rience,

  2. watch­ing a clip from Trainspot­ting,

  3. be­ing ex­posed to a fart spray, and

  4. be­ing hyp­no­tized to feel dis­gust at the word “of­ten” (Yes, this is re­ally one of the stud­ies).

In stud­ies of clean­li­ness, sub­jects are ma­nipu­lated to feel clean via mechanisms like:

  1. do­ing sen­tence un­scram­bling tasks with words about clean­li­ness,

  2. wash­ing their hands, and

  3. be­ing in a room where Win­dex was sprayed.

After dis­gust or clean­li­ness is in­duced (in the non-con­trol sub­jects), sub­jects are asked to un­der­take some morally-loaded ac­tivity (usu­ally mak­ing moral judg­ments about vi­gnettes). The hy­poth­e­sis is that their re­sponses will be differ­ent be­cause talk of moral pu­rity and dis­gust is not merely metaphor­i­cal—feel­ings of clean­li­ness and in­ci­den­tal dis­gust at the time of eval­u­a­tion have a causal effect on moral eval­u­a­tions. Con­fus­ingly, the ex­act na­ture of this pu­ta­tive re­la­tion­ship seems rather pro­tean: it de­pends sub­tly on whether the sub­ject or the ob­ject of a judg­ment feels clean or dis­gusted and can be me­di­ated by pri­vate body con­scious­ness and re­sponse effort.

As the above para­graph may sug­gest, I’m pretty skep­ti­cal of a shock­ing frac­tion of these stud­ies (as dis­cussed in more de­tail in the ap­pendix). Some re­cur­ring rea­sons:

  1. the ma­nipu­la­tions of­ten seem quite weak (e.g. sen­tence un­scram­bling, a spritz of Lysol on the ques­tion­naire),

  2. the ma­nipu­la­tion checks of­ten fail but the au­thors never seem par­tic­u­larly trou­bled by this or the fact that they find their pre­dicted re­sults de­spite the ap­par­ent failure of ma­nipu­la­tion,

  3. au­thors seem more in­clined to ex­plain noisy or ap­par­ently con­tra­dic­tory re­sults by com­pli­cat­ing their the­ory than by falsify­ing their the­ory, and

  4. mul­ti­ple di­rect repli­ca­tions have failed.

The quan­ti­ta­tive re­sults fol­low. I’ll sum­ma­rize them in ad­vance by draw­ing at­ten­tion to the mis­shapen fun­nel plot which I take as strong sup­port for my method­olog­i­cal skep­ti­cism. The ev­i­dence mar­shaled so far does not seem to sup­port the claim that dis­gust and clean­li­ness in­fluence moral judg­ments.

Stud­ies of moral in­tu­itions and dis­gust or clean­li­ness effects

Study In­de­pen­dent vari­able Depen­dent vari­able Sam­ple size Re­sult Effect size
[@wheatley2005hyp­notic], ex­per­i­ment 1 Hyp­notic dis­gust cue Scale of wrong­ness 45 ;
[@wheatley2005hyp­notic], ex­per­i­ment 2 Hyp­notic dis­gust cue Scale of wrong­ness 63 ;
[@schnall2008clean], ex­per­i­ment 1 Clean word scram­ble Scale of wrong­ness 20 vs 20 ;
[@schnall2008clean], ex­per­i­ment 2 Dis­gust­ing movie clip Scale of wrong­ness 22 vs 22 ;
[@schnall2008dis­gust], ex­per­i­ment 1 Fart spray Lik­ert scale 42.3 vs 42.3 vs 42.3 ;
[@schnall2008dis­gust], ex­per­i­ment 2 Dis­gust­ing room Scale of ap­pro­pri­acy 22.5 vs 22.5 Not sig­nifi­cant
[@schnall2008dis­gust], ex­per­i­ment 3 De­scribe dis­gust­ing mem­ory Scale of ap­pro­pri­acy 33.5 vs 33.5 Not sig­nifi­cant
[@schnall2008dis­gust], ex­per­i­ment 4 Dis­gust­ing vs sad vs neu­tral movie clip Scale of ap­pro­pri­acy 43.3 vs 43.3 vs 43.3 ;
[@hor­berg2009dis­gust], study 2 Dis­gust­ing vs sad movie clip Scale of right­ness and wrong­ness 59 vs 63 ;
[@lil­jen­quist2010smell], ex­per­i­ment 1 Clean scent in room Money re­turned 14 vs 14 ;
[@lil­jen­quist2010smell], ex­per­i­ment 2 Clean scent in room Scale of vol­un­teer­ing in­ter­est­ing 49.5 vs 49.5 ;
[@lil­jen­quist2010smell], ex­per­i­ment 2 Clean scent in room Willing­ness to donate 49.5 vs 49.5 ;
[@zhong2010clean], ex­per­i­ment 1 An­tisep­tic wipe for hands Scale of im­moral to moral 29 vs 29 ;
[@zhong2010clean], ex­per­i­ment 2 Vi­su­al­ize clean vs dirty and noth­ing Scale of im­moral to moral 107.6 vs 107.6 vs 107.6 ;
[@zhong2010clean], ex­per­i­ment 2 Vi­su­al­ize dirty vs noth­ing Scale of im­moral to moral 107.6 vs 107.6 vs 107.6 ;
[@zhong2010clean], ex­per­i­ment 3 Vi­su­al­ize clean vs dirty Scale of im­moral to moral 68 vs 68 ;
[@es­kine2011bad] Sweet, bit­ter or neu­tral drink Scale of wrong­ness 18 vs 15 vs 21 ;
[@david2011effect] Pres­ence of dis­gust-con­di­tioned word Scale of wrong­ness 61 ; Not sig­nifi­cant
[@to­bia2013clean­li­ness], un­der­grads Clean scent on sur­vey Scale of wrong­ness 84 vs 84 ;
[@to­bia2013clean­li­ness], philoso­phers Clean scent on sur­vey Scale of wrong­ness 58.5 vs 58.5 Not sig­nifi­cant
[@huang2014does], study 1 Clean word scram­ble Scale of wrong­ness 111 vs 103 ;
[@huang2014does], study 2 Clean word scram­ble Scale of wrong­ness 211 vs 229 ;
[@john­son2014does], ex­per­i­ment 1 Clean word scram­ble Scale of wrong­ness 114.5 vs 114.5 ;
[@john­son2014does], ex­per­i­ment 2 Wash­ing hands Scale of wrong­ness 58 vs 68 ;
[@john­son2016effects], study 1 De­scribe dis­gust­ing mem­ory Scale of wrong­ness 222 vs 256 ;
[@john­son2016effects], study 2 De­scribe dis­gust­ing mem­ory Scale of wrong­ness 467 vs 467 ;
[@daub­man2014] Clean word scram­ble Scale of wrong­ness 30 vs 30 ;
[@daub­man2013] Clean word scram­ble Scale of wrong­ness 30 vs 30 ;
[@john­son2014] Clean word scram­ble Scale of wrong­ness 365.6 vs 365.5 ;

Forest plot of disgust/cleanlinessstudies

(Pseudo-)For­est plot show­ing re­ported effect sizes for dis­gust/​clean­li­ness manipulations

Funnel plot of disgust/cleanlinessstudies

Fun­nel plot show­ing re­ported effect sizes for dis­gust/​clean­li­ness manipulations

This fun­nel plot sug­gests pretty heinous pub­li­ca­tion bias. I’m in­clined to say that the ev­i­dence does not sup­port claims of a real effect here.


This fac­tor has ex­tra weight within the field of philos­o­phy be­cause it’s been offered as an ex­pla­na­tion for the rel­a­tive scarcity of woman in aca­demic philos­o­phy (Buck­walter and Stich 2014): if women’s philo­soph­i­cal in­tu­itions sys­tem­at­i­cally di­verge from those of men and from canon­i­cal an­swers to var­i­ous thought ex­per­i­ments, they may find them­selves dis­cour­aged.

Stud­ies on this is­sue typ­i­cally just send sur­veys to peo­ple with a se­ries of vi­gnettes and an­a­lyze how the re­sults vary de­pend­ing on gen­der.

I ex­cluded (Buck­walter and Stich 2014) en­tirely for rea­sons de­scribed in the ap­pendix.

Here are the quan­ti­ta­tive re­sults:

Stud­ies of moral in­tu­itions and gen­der effects

Study In­de­pen­dent vari­able Depen­dent vari­able Sam­ple size Re­sult
[@lom­brozo2009role], trol­ley switch Gen­der Scale of per­mis­si­bil­ity 74.7 vs 149.3 ,
[@lom­brozo2009role], trol­ley push Gen­der Scale of per­mis­si­bil­ity 74.7 vs 149.3 ,
[@seyed­sayam­dost2015gen­der], plank of Carneades, MTurk Gen­der Scale of blame­wor­thi­ness 70 vs 86 ,
[@seyed­sayam­dost2015gen­der], plank of Carneades, Sur­veyMon­key Gen­der Scale of blame­wor­thi­ness 48 vs 50 ,
[@adle­berg2015men], vi­o­linist Gen­der Scale from for­bid­den to obli­ga­tory 52 vs 84 ,
[@adle­berg2015men], mag­is­trate and the mob Gen­der Scale from bad to good 71 vs 87 ,
[@adle­berg2015men], trol­ley switch Gen­der Scale of ac­cept­abil­ity 52 vs 84 ,

As we can see, there doesn’t seem to be good ev­i­dence for an effect here.

Cul­ture and so­cioe­co­nomic status

There’s just one study here[5]. It tested re­sponses to moral vi­gnettes across high and low so­cioe­co­nomic sta­tus sam­ples in Philadelphia, USA and Porto Ale­gre and Re­cife, Brazil.

As men­tioned in the ap­pendix, I find the seem­ingly very ar­tifi­cial di­chotimiza­tion of the out­come mea­sure a bit strange in this study.

Here are the quan­ti­ta­tive re­sults:

Stud­ies of moral in­tu­itions and cul­ture/​SES effects

Study In­de­pen­dent vari­able Depen­dent vari­able Sam­ple size Re­sult
[@haidt1993af­fect], adults Cul­ture Ac­cept­able or not 90 vs 90 ;
[@haidt1993af­fect], chil­dren Cul­ture Ac­cept­able or not 90 vs 90 ;
[@haidt1993af­fect], adults SES Ac­cept­able or not 90 vs 90 ;
[@haidt1993af­fect], chil­dren SES Ac­cept­able or not 90 vs 90 ;

The study found that Amer­i­cans and those of high so­cioe­co­nomic sta­tus were more likely to judge dis­gust­ing but harm­less ac­tivi­ties as morally ac­cept­able.


There’s just one sur­vey here ex­am­in­ing how re­sponses to vi­gnettes varied with Big Five per­son­al­ity traits.

Stud­ies of moral in­tu­itions and per­son­al­ity effects

Study In­de­pen­dent vari­able Depen­dent vari­able Sam­ple size Re­sult
[@feltz2008frag­mented], ex­per­i­ment 2 Ex­traver­sion Is it wrong? Yes or no 162 ,


In these stud­ies, one ver­sion of the vi­gnette has some stranger as the cen­tral figure in the dilemma. The other ver­sion puts the sur­vey’s sub­ject in the moral dilemma. For ex­am­ple, “Should Bob throw the trol­ley switch?” ver­sus “Should you throw the trol­ley switch?”.

I’m ac­tu­ally mildly skep­ti­cal that in­con­sis­tency here is nec­es­sar­ily any­thing to dis­ap­prove of. Sub­jects know more about them­selves than about ar­bi­trary char­ac­ters in vi­gnettes. That ex­tra in­for­ma­tion could be jus­tifi­able grounds for differ­ent eval­u­a­tions. For ex­am­ple, if sub­jects un­der­stand them­selves to be more likely than the av­er­age per­son to be haunted by util­i­tar­ian sac­ri­fices, that could ground differ­ent de­ci­sions in moral dilem­mas call­ing for util­i­tar­ian sac­ri­fice.

Nev­er­the­less, the quan­ti­ta­tive re­sults fol­low. They gen­er­ally find there is a sig­nifi­cant effect.

Stud­ies of moral in­tu­itions and ac­tor/​ob­server effects

Study In­de­pen­dent vari­able Depen­dent vari­able Sam­ple size Re­sult
[@nadelhoffer2008ac­tor], trol­ley switch, un­der­grads Ac­tor vs ob­server Mo­rally per­mis­si­ble? Yes or no 43 vs 42 90% per­mis­si­ble in ob­server con­di­tion; 65% per­mis­si­ble in ac­tor con­di­tion;
[@to­bia2013moral], trol­ley switch, philoso­phers Ac­tor vs ob­server Mo­rally per­mis­si­ble? Yes or no 24.5 vs 24.5 64% per­mis­si­ble in ob­server con­di­tion; 89% per­mis­si­ble in ac­tor con­di­tion;
[@to­bia2013moral], Jim and the na­tives, un­der­grads Ac­tor vs ob­server Mo­rally obli­gated? Yes or no 20 vs 20 53% obli­ga­tory in ob­server con­di­tion; 19% obli­ga­tory in ac­tor con­di­tion;
[@to­bia2013moral], Jim and the na­tives, philoso­phers Ac­tor vs ob­server Mo­rally obli­gated? Yes or no 31 vs 31 9% obli­ga­tory in the ob­server con­di­tion; 36% obli­ga­tory in the ac­tor con­di­tion;
[@to­bia2013clean­li­ness], un­der­grads Ac­tor vs ob­server Scale of wrong­ness 84 vs 84 ;
[@to­bia2013clean­li­ness], philoso­phers Ac­tor vs ob­server Scale of wrong­ness 58.5 vs 58.5 Not sig­nifi­cant


  • Order: Lots of stud­ies, over­all there seems to be ev­i­dence of an effect

  • Word­ing: Just one study, big effect, strikes me as plau­si­ble that there’s an effect here

  • Dis­gust and clean­li­ness: Lots of stud­ies, lots of method­olog­i­cal prob­lems and lots of pub­li­ca­tion bias, I round this to no good ev­i­dence of the effect

  • Gen­der: Medium amount of stud­ies, stud­ies gen­er­ally don’t find ev­i­dence of an effect

  • Cul­ture and so­cioe­co­nomic sta­tus: One study, found effect, seems hard to imag­ine there’s no effect here

  • Per­son­al­ity: One study, found effect

  • Ac­tor/​ob­server: A cou­ple of stud­ies, found big effects, strikes me as plau­si­ble that there’s an effect here

Indi­rect evidence

Given that the di­rect ev­i­dence isn’t quite defini­tive, it may be use­ful to look at some in­di­rect ev­i­dence. By that, I mean we’ll look at (among other things) some un­der­ly­ing the­o­ries about how moral in­tu­itions op­er­ate and what bear­ing they have on the ques­tion of re­li­a­bil­ity.

Heuris­tics and biases

No com­plex hu­man fac­ulty is perfectly re­li­able. This is no sur­prise and per­haps not of great im­port.

But we have ev­i­dence that some fac­ul­ties are not only “not perfect” but sys­tem­at­i­cally and sub­stan­tially bi­ased. The heuris­tics and bi­ases pro­gram (heav­ily as­so­ci­ated with Kah­ne­man and Tver­sky) of re­search has shown[6] se­ri­ous limi­ta­tions in hu­man ra­tio­nal­ity. A re­view of that liter­a­ture is out of scope here, but the list of alleged aber­ra­tions is ex­ten­sive. Scope in­sen­si­tivity—the failure of peo­ple, for ex­am­ple, to care twice as much about twice as many oil-cov­ered seag­ulls—is one ex­am­ple I find com­pel­ling.

How rele­vant these prob­lems are for moral judg­ment is a mat­ter of some in­ter­pre­ta­tion. An ar­gu­ment for rele­vance is this: even sup­pos­ing we have sui generis moral fac­ul­ties for judg­ing purely nor­ma­tive claims, much day-to-day “moral” rea­son­ing is ac­tu­ally pru­den­tial rea­son­ing about how best to achieve our ends given con­straints. This sort of pru­den­tial rea­son­ing is squarely in the crosshairs of the heuris­tics and bi­ases pro­gram.

At a min­i­mum, promi­nent heuris­tics and bi­ases re­searcher Gerd Gigeren­zer en­dorses the hy­poth­e­sis that heuris­tics un­der­ly­ing moral be­hav­ior are “largely” the same as heuris­tics un­der­ly­ing other be­hav­ior (Gigeren­zer 2008). He ex­plains, “Mo­ral in­tu­itions fit the pat­tern of heuris­tics, in our”nar­row” sense, if they in­volve (a) a tar­get at­tribute that is rel­a­tively in­ac­cessible, (b) a heuris­tic at­tribute that is more eas­ily ac­cessible, and (c) an un­con­scious sub­sti­tu­tion of the tar­get at­tribute for the heuris­tic at­tribute.” Con­di­tion (a) is satis­fied by many ac­counts of moral­ity and heuris­tic at­tributes as men­tioned in (b) abound (e.g. how bad does it feel to think about ac­tion A). It seems un­likely that the sub­sti­tu­tion de­scribed in (c) fails to hap­pen only in the do­main of moral judg­ments.


Now we’ll look at un­re­li­a­bil­ity at a lower level.

A dis­tinc­tion is some­times drawn be­tween joint eval­u­a­tions—choice—and sin­gle eval­u­a­tions—judg­ment. In a choice sce­nario, an ac­tor has to choose be­tween mul­ti­ple op­tions pre­sented to them si­mul­ta­neously. For ex­am­ple, pick­ing a box of ce­real in the gro­cery store re­quires choice. In a judg­ment sce­nario, an ac­tor makes some eval­u­a­tion of an op­tion pre­sented in iso­la­tion. For ex­am­ple, de­cid­ing how much to pay for a used car is judg­ment sce­nario.

For both tasks, lead­ing mod­els are (as far as I un­der­stand things) fun­da­men­tally stochas­tic.

Judg­ment tasks are de­scribed by the ran­dom util­ity model in which, upon in­tro­spec­tion, an ac­tor sam­ples from a dis­tri­bu­tion of pos­si­ble val­u­a­tions for an op­tion rather than find­ing a sin­gle, fixed val­u­a­tion (Glim­cher, Dor­ris, and Bayer 2005). This makes sense at the neu­ronal level be­cause lik­ing is en­coded as the firing rate of a neu­ron and firing rates are stochas­tic.

Choice tasks are de­scribed by the drift diffu­sion model in which the cur­rent dis­po­si­tion to act starts at 0 on some axis and takes a bi­ased ran­dom walk (drifts) (Rat­cliff and McKoon 2008). Away from zero, on two op­po­site sides, are thresh­olds rep­re­sent­ing each of the two op­tions. Once the cur­rent dis­po­si­tion drifts past a thresh­old, the cor­re­spond­ing op­tion is cho­sen. Be­cause of the ran­dom noise in the drift pro­cess, there’s no guaran­tee that the thresh­old fa­vored by the bias will always be the first one crossed. Again, the ran­dom­ness in this model makes sense be­cause neu­rons are stochas­tic.

Plot showing drift diffusionmodel

Ex­am­ple of ten ev­i­dence ac­cu­mu­la­tion se­quences for the drift diffu­sion model, where the true re­sult is as­signed to the up­per thresh­old. Due to the ad­di­tion of noise, two se­quences have pro­duced an in­ac­cu­rate de­ci­sion. From Wikipe­dia.

So for both choice and judg­ment tasks, low-level mod­els and neu­ral con­sid­er­a­tions sug­gest that we should ex­pect noise rather than perfectly re­li­a­bil­ity. And we should prob­a­bly ex­pect this to ap­ply equally in the moral do­main. In­deed, ex­per­i­men­tal ev­i­dence sug­gests that a drift diffu­sion model can be fit to moral de­ci­sions (Crock­ett et al. 2014) (Hutch­er­son, Bushong, and Ran­gel 2015).

Dual process

Josh Greene’s dual pro­cess the­ory of moral in­tu­itions (Greene 2007) sug­gests that we have two differ­ent types of moral in­tu­itions origi­nat­ing from two differ­ent cog­ni­tive sys­tems. Sys­tem 1 is emo­tional, au­to­matic and pro­duces char­ac­ter­is­ti­cally de­on­tolog­i­cal judg­ments. Sys­tem 2 is non-emo­tional, re­flec­tive and pro­duces char­ac­ter­is­ti­cally con­se­quen­tial­ist judg­ments.

He makes the fur­ther claim that these de­on­tolog­i­cal, sys­tem 1 judg­ments ought not to be trusted in novel situ­a­tions be­cause their au­to­mat­ic­ity means they fail to take new cir­cum­stances into ac­count.


All com­plex be­hav­ioral traits have sub­stan­tial ge­netic in­fluence (Plomin et al. 2016). Nat­u­rally, moral judg­ments are part of “all”. This means cer­tain traits rele­vant for moral judg­ment are evolved. But an evolved trait is not nec­es­sar­ily an adap­ta­tion. A trait only rises to the level of adap­ta­tion if it was the re­sult of nat­u­ral se­lec­tion (as op­posed to, for ex­am­ple, ran­dom drift).

If our evolved fac­ul­ties for moral judg­ment are not adap­ta­tions (i.e. they are ran­dom and not the product of se­lec­tion), it seems clear that they’re un­likely to be re­li­able.

On the other hand, might adap­ta­tions be re­li­able? Alas, even if our moral in­tu­itions are adap­tive this is no guaran­tee that they track the truth. First, knowl­edge is not always fit­ness rele­vant. For ex­am­ple, “per­ceiv­ing grav­ity as a dis­tor­tion of space-time” would have been no help in the an­ces­tral en­vi­ron­ment (Kras­now 2017). Se­cond, asym­met­ric costs and benefits for false pos­i­tives and false nega­tives means that perfect cal­ibra­tion isn’t nec­es­sar­ily op­ti­mal. Pre­ma­turely con­demn­ing a po­ten­tial hunt­ing part­ner as un­trust­wor­thy comes at min­i­mal cost if there are other po­ten­tial part­ners around while get­ting liter­ally stabbed in the back dur­ing a hunt would be very costly in­deed. Fi­nally, be­cause we are so­cially em­bed­ded, wrong be­liefs can in­crease fit­ness if they af­fect how oth­ers treat us.

Even if our moral in­tu­itions are adap­ta­tions and were re­li­able in the an­ces­tral en­vi­ron­ment, that’s no guaran­tee that they’re re­li­able in the mod­ern world. There’s rea­son to be­lieve that our moral in­tu­itions are not well-tuned to “evolu­tion­ar­ily novel moral dilem­mas that in­volve iso­lated, hy­po­thet­i­cal, be­hav­ioral acts by un­known strangers who can­not be re­warded or pun­ished through any nor­mal so­cial pri­mate chan­nels”. (Miller 2007) (Though for a con­trary point of view about so­cial con­di­tions in the an­ces­tral en­vi­ron­ment, see (Turner and Maryan­ski 2013).) This claim is es­pe­cially per­sua­sive if we be­lieve that (at least some of) our moral in­tu­itions are the re­sult of a fun­da­men­tally re­ac­tive, ret­ro­spec­tive pro­cess like Greene’s sys­tem 1[7].

(If you’re still skep­ti­cal about the role of biolog­i­cal evolu­tion in our fac­ul­ties for moral judg­ment, Tooby and Cos­mides’s so­cial con­tract the­ory is of­ten taken to be strong ev­i­dence for the evolu­tion of some speci­fi­cally moral fac­ul­ties. Tooby and Cos­mides are ad­vo­cates of the mas­sive mod­u­lar­ity the­sis ac­cord­ing to which the hu­man brain is com­posed of a large num­ber of spe­cial pur­pose mod­ules each perform­ing a spe­cific com­pu­ta­tional task. So­cial con­tract the­ory finds that peo­ple are much bet­ter at de­tect­ing vi­o­la­tions of con­di­tional rules when those rules en­code a so­cial con­tract. Tooby and Cos­mides[8] take this to mean that we have evolved a spe­cial-pur­pose mod­ule for an­a­lyz­ing obli­ga­tion in so­cial ex­change which can­not be ap­plied to con­di­tional rules in the gen­eral case.)

(There’s a lot more re­search on the deep roots of co­op­er­a­tion and moral­ity in hu­mans: (Boyd and Rich­er­son 2005), (Boyd et al. 2003), (Hauert et al. 2007), (Singer and oth­ers 2000).)

Univer­sal moral grammar

Lin­guists have ob­served a poverty of the stim­u­lus—chil­dren learn how to speak a nat­u­ral lan­guage with­out any­where near enough lan­guage ex­pe­rience to pre­cisely spec­ify all the de­tails of that lan­guage. The solu­tion that Noam Chom­sky came up with is a uni­ver­sal gram­mar—hu­mans have cer­tain lan­guage rules hard-coded in our brains and lan­guage ex­pe­rience only has to be rich enough to se­lect among these, not con­struct them en­tirely.

Re­searchers have made similar claims about moral­ity (Sri­pada 2008). The ar­gu­ment is that chil­dren learn moral rules with­out enough moral ex­pe­rience to pre­cisely spec­ify all the de­tails of those rules. There­fore, they must have a uni­ver­sal moral gram­mar—in­nate fac­ul­ties that en­code cer­tain pos­si­ble moral rules. There are of course ar­gu­ments against this claim. Briefly: moral rules are much less com­plex than lan­guages, and (some) lan­guage learn­ing must be in­duc­tive while moral learn­ing can in­clude ex­plicit in­struc­tion.

If our hard-coded moral rules pre­clude us from learn­ing the true moral rules (a pos­si­bil­ity on some metaeth­i­cal views), our moral judg­ments would be very un­re­li­able in­deed (Millhouse, Ayars, and Ni­chols 2018).


I’ll take it as fairly ob­vi­ous that our moral judg­ments are cul­turally in­fluenced[9] (see e.g. (Hen­rich et al. 2004)). A com­mon story for the role of cul­ture in moral judg­ments and be­hav­ior is that norms of con­di­tional co­op­er­a­tion arose to solve co­op­er­a­tion prob­lems in­her­ent in group liv­ing (Curry 2016) (Hechter and Opp 2001). But, just as we dis­cussed with biolog­i­cal evolu­tion, these se­lec­tive pres­sures aren’t nec­es­sar­ily al­igned with the truth.

One of the al­ter­na­tive ac­counts of moral judg­ments as a product of cul­ture is the so­cial in­tu­ition­ism of Haidt and Bjork­lund (Haidt and Bjork­lund 2008). They ar­gue that, at the in­di­vi­d­ual level, moral rea­son­ing is usu­ally a post-hoc con­fab­u­la­tion in­tended to sup­port au­to­matic, in­tu­itive judg­ments. De­spite this, these con­fab­u­la­tions have causal power when passed be­tween peo­ple and in so­ciety at large. Th­ese so­cially-en­dorsed con­fab­u­la­tions ac­cu­mu­late and even­tu­ally be­come the ba­sis for our pri­vate, in­tu­itive judg­ments. Within this model, it seems quite hard to ar­rive at the con­clu­sion that our moral judg­ments are highly re­li­able.

Mo­ral disagreements

There’s quite a bit of liter­a­ture on the im­pli­ca­tions of en­dur­ing moral dis­agree­ment. I’ll just briefly men­tion that, on many metaeth­i­cal views, it’s not triv­ial to rec­on­cile perfectly re­li­able moral judg­ments and en­dur­ing moral dis­agree­ment. (While I think this is an im­por­tant line of ar­gu­ment: I’m giv­ing it short shrift here be­cause: 1. the fact of moral dis­agree­ment is no rev­e­la­tion, and 2. it’s hard to make it bite—it’s too easy to say, “Well, we dis­agree be­cause I’m right and they’re wrong.”.)


  • Heuris­tics and bi­ases: There’s lots of ev­i­dence that hu­mans are not in­stru­men­tally ra­tio­nal. This prob­a­bly ap­plies at least some­what to moral judg­ments too since pru­den­tial rea­son­ing is com­mon in day-to-day moral judg­ments.

  • Neu­ral: Com­mon mod­els of both choice and judg­ment are fun­da­men­tally stochas­tic which re­flects the stochas­tic­ity of neu­rons.

  • Dual pro­cess: Sys­tem 1 moral in­tu­itions are au­to­matic, ret­ro­spec­tive and un­trust­wor­thy.

  • Genes: Mo­ral fac­ul­ties are at least partly evolved. Adap­ta­tions aren’t nec­es­sar­ily truth-track­ing—es­pe­cially when re­moved from the an­ces­tral en­vi­ron­ment.

  • Cul­ture: Cul­ture in­fluences moral judg­ment and cul­tural forces don’t nec­es­sar­ily in­cen­tivize truth-track­ing.

  • Mo­ral dis­agree­ment: There’s a lot of dis­agree­ment about what’s moral and it’s hard to both ac­cept this and claim that moral judg­ments are perfectly re­li­able.


Depend­ing on how skep­ti­cal of skep­ti­cism you’re feel­ing, all of the above might add up to se­ri­ous doubts about the re­li­a­bil­ity of our moral in­tu­itions. How might we re­spond to these doubts? There are a va­ri­ety of ap­proaches dis­cussed in the liter­a­ture. I will group these re­sponses loosely based on how they fit into the struc­ture of the Un­re­li­a­bil­ity of Mo­ral Judg­ment Prob­lem:

  • The first type of re­sponse sim­ply ques­tions the in­ter­nal val­idity of the em­piri­cal stud­ies call­ing the core of premise 2 into ques­tion.

  • The sec­ond type of re­sponse ques­tions the ex­ter­nal val­idity of the stud­ies thereby as­sert­ing that the “some”s in premise 2 (“some peo­ple, in some cases”) are nar­row enough to de­fuse any real threat in the con­clu­sion.

  • The third type of re­sponse ac­cepts the whole ar­gu­ment and ar­gues that it’s not too wor­ri­some.

  • The fourth type of re­sponse ac­cepts the whole ar­gu­ment and ar­gues that we can take coun­ter­mea­sures.

In­ter­nal validity

If the ex­per­i­men­tal re­sults that pur­port to show that moral judg­ments are un­re­li­able lack in­ter­nal val­idity, the ar­gu­ment as a whole lacks force. On the other hand, the in­val­idity of these stud­ies isn’t af­fir­ma­tive ev­i­dence that moral judg­ments are re­li­able and the in­di­rect ev­i­dence may still be wor­ry­ing.

The val­idity of the stud­ies is dis­cussed in the di­rect ev­i­dence sec­tion and in the ap­pendix so I won’t re­peat it here[10]. I’ll sum­ma­rize my take as: the clean­li­ness/​dis­gust stud­ies have low val­idity, but the or­der stud­ies seem plau­si­ble and I be­lieve that there’s a real effect there, at least on the mar­gin. Most of the other fac­tors don’t have enough high-qual­ity stud­ies to draw even a ten­ta­tive con­clu­sion. Nev­er­the­less, when you add in my pri­ors and the in­di­rect ev­i­dence, I be­lieve there’s rea­son to be con­cerned.


The most pop­u­lar re­sponse among philoso­phers (sur­prise, sur­prise) is the ex­per­tise defense: The moral judg­ments of the folk may track morally ir­rele­vant fac­tors, but philoso­phers have ac­quired spe­cial ex­per­tise which im­mu­nizes them from these failures[11]. There is an im­me­di­ate ap­peal to the ar­gu­ment: What does ex­per­tise mean if not in­creased skill? There is even sup­port­ing ev­i­dence in the form of trained philoso­phers’ im­proved perfor­mance on cog­ni­tive re­flec­tion tests (This test asks ques­tions with in­tu­itive but in­cor­rect re­sponses. For ex­am­ple, “A bat and a ball cost $1.10 in to­tal. The bat costs $1.00 more than the ball. How much does the ball cost?”. (Fred­er­ick 2005)).

Alas, that’s where the good news ends and the trou­ble be­gins. As (Wein­berg et al. 2010) de­scribes it, the ex­per­tise defense seems to rely on a folk the­ory of ex­per­tise in which ex­pe­rience in a do­main in­evitably im­proves skill in all ar­eas of that do­main. En­gage­ment with the re­search on ex­pert perfor­mance sig­nifi­cantly com­pli­cates this story.

First, it seems to be the case that not all do­mains are con­ducive to the de­vel­op­ment of ex­per­tise. For ex­am­ple, train­ing and ex­pe­rience do not pro­duce ex­per­tise at psy­chi­a­try and stock bro­ker­age ac­cord­ing to (Dawes 1994) and (Shanteau 1992). Clear, im­me­di­ate and ob­jec­tive feed­back ap­pears nec­es­sary for the for­ma­tion of ex­per­tise (Shanteau 1992). Un­for­tu­nately, it’s hard to con­strue what­ever feed­back is available to moral philoso­phers con­sid­er­ing thought ex­per­i­ments and edge cases as clear, im­me­di­ate and ob­jec­tive (Clarke 2013) (Wein­berg 2007).

Se­cond, “one of the most en­dur­ing find­ings in the study of ex­per­tise [is that there is] lit­tle trans­fer from high-level profi­ciency in one do­main to profi­ciency in other do­mains—even when the do­mains seem, in­tu­itively, very similar” (Fel­tovich, Pri­etula, and Eric­s­son 2006). Chess ex­perts have ex­cel­lent re­call for board con­figu­ra­tions, but only when those con­figu­ra­tions could ac­tu­ally arise dur­ing the course of a game (De Groot 2014). Sur­gi­cal ex­per­tise car­ries over very lit­tle from one sur­gi­cal task to an­other (Nor­man et al. 2006). Thus, ev­i­dence of im­proved cog­ni­tive re­flec­tion is not a strong in­di­ca­tor of im­proved moral judg­ment[12]. Nor is ev­i­dence of philo­soph­i­cal ex­cel­lence on any task other than moral judg­ment it­self likely to be par­tic­u­larly com­pel­ling. (And even “moral judg­ment” may be too broad and in­co­her­ent a thing to have uniform skill at.)

Third, it’s not ob­vi­ous that ex­per­tise im­mu­nizes from bi­ases. Stud­ies have claimed that Olympic gym­nas­tics judges and pro­fes­sional au­di­tors are vuln­er­a­ble to or­der effects de­spite be­ing ex­pert in other re­gards (Brown 2009) (Damisch, Muss­weiler, and Pless­ner 2006).

Fi­nally, there is di­rect em­piri­cal ev­i­dence that philoso­phers moral judg­ments con­tinue to track pu­ta­tively morally ir­rele­vant fac­tors[13]. See (K. P. To­bia, Chap­man, and Stich 2013), (K. To­bia, Buck­walter, and Stich 2013) and (Sch­witzgebel and Cush­man 2012) already de­scribed above. ((Schulz, Cokely, and Feltz 2011) find similar re­sults for an­other type of philo­soph­i­cal judg­ment.)

So, in sum, while there’s an im­me­di­ate ap­peal to the ex­per­tise defense (surely we can trust in­tu­itions honed by years of philo­soph­i­cal work), it looks quite trou­bled upon deeper ex­am­i­na­tion.

Ecolog­i­cal validity

It’s always a pop­u­lar move to spec­u­late that the lab isn’t like the real world and so lab re­sults don’t ap­ply in the real world. De­spite the saucy tone in the pre­ced­ing sen­tence, I think there are real con­cerns here:

  • Read­ing vi­gnettes is not the same as ac­tu­ally ex­pe­rienc­ing moral dilem­mas first-hand.

  • Stated judg­ments are not the same as ac­tual judg­ments and be­hav­iors. For ex­am­ple, none of the stud­ies men­tioned do­ing any­thing (be­yond stan­dard anonymiza­tion) to com­bat so­cial de­sir­a­bil­ity bias.

How­ever, it’s not clear to me how these is­sues li­cense a be­lief that real moral judg­ments are likely to be re­li­able. One can per­haps hope that we’re more re­li­able when the stakes truly mat­ter, but it would take a more de­tailed the­ory for the ecolog­i­cal val­idity crit­i­cisms to have an im­pact.


One way of limit­ing the force of the ar­gu­ment against the re­li­a­bil­ity of moral judg­ments is sim­ply to point out that many judg­ments are re­li­able and im­mune to ma­nipu­la­tion. This is cer­tainly true; or­der effects are not om­nipo­tent. I’m not go­ing to go out and mur­der any­one just be­cause you pref­aced the pro­posal with the right vi­gnette.

Another re­sponse to the ex­per­i­men­tal re­sults is to claim that even though peo­ple’s rat­ings as mea­sured with a Lik­ert scale changed, the num­ber of peo­ple ac­tu­ally switch­ing from moral ap­proval to dis­ap­proval or vice versa (i.e. mov­ing from one half of the Lik­ert scale to the other) is un­re­ported and pos­si­bly small (De­ma­ree-Cot­ton 2016).

The first re­sponse to this re­sponse is that the truth of this claim even as re­ported in the pa­per mak­ing the ar­gu­ment de­pends on your defi­ni­tion of “small”. I think a 20% prob­a­bil­ity of switch­ing from moral ap­proval to dis­ap­proval based on the or­der­ing of vi­gnettes is not small.

The sec­ond set of re­sponses to this at­tempted de­fusal is as fol­lows. Even if ex­per­i­ments only found shifts in de­gree of ap­proval or dis­ap­proval, that would be wor­ry­ing be­cause:

  • Real moral de­ci­sions of­ten in­volve trade-offs be­tween two wrongs or two rights and the de­gree of right­ness or wrong­ness of each com­po­nent in such dilem­mas may de­ter­mine the fi­nal judg­ment ren­dered. (An­dow 2016)

  • Much philo­soph­i­cal work in­volves thought ex­per­i­ments on the mar­gin where ap­proval and dis­ap­proval are closely bal­anced (Wright 2016). Even small effects from im­proper sources at the bor­der can lead to the wrong judg­ment. Wrong judg­ments on these marginal thought ex­per­i­ments can cas­cade to more mun­dane and im­por­tant judg­ments if we take them at face value and ap­ply some­thing like re­flec­tive equil­ibrium.

Ecolog­i­cally rational

Gerd Gigeren­zer likes to make the ar­gu­ment (con­tra Kah­ne­man and Tver­sky; some more ex­cel­lent aca­demic slap fights here) that heuris­tics are ecolog­i­cally ra­tio­nal (Todd and Gigeren­zer 2012). By this, he means that they are op­ti­mal in a given con­text. He also talks about less-is-more effects in which sim­ple heuris­tics ac­tu­ally out­perform more com­pli­cated and ap­par­ently ideal strate­gies[14].

One could per­haps make an analo­gous ar­gu­ment for moral judg­ments: though they don’t always con­form to the dic­tates of ideal the­ory, they are near op­ti­mal given the en­vi­ron­ment in which they op­er­ate. Though we can’t re­liti­gate the whole ar­gu­ment here, I’ll point out that there’s lots of push­back against Gigeren­zer’s view. Another re­sponse to the re­sponse would be to high­light ways in which moral judg­ment is unique and the ecolog­i­cal val­idity re­sponse doesn’t ap­ply to moral heuris­tics.

Se­cond-or­der reliability

Even if we were to ac­cept that our moral judg­ments are un­re­li­able, that might not be fatal. If we could judge when our moral judg­ments are re­li­able—if we had re­li­able sec­ond-or­der moral judg­ments—we could rely upon our moral judg­ments only in do­mains where we knew them to be valid.

In­deed, there’s ev­i­dence that, in gen­eral, we are more con­fi­dent in our judg­ments when they turn out to be cor­rect (Gigeren­zer, Hoffrage, and Klein­bölt­ing 1991). But sub­se­quent stud­ies have sug­gested our con­fi­dence ac­tu­ally tracks con­sen­su­al­ity rather than cor­rect­ness (Ko­riat 2008). Peo­ple were highly con­fi­dent when asked about pop­u­lar myths (for ex­am­ple, whether Syd­ney is the cap­i­tal of Aus­tralia). This pos­si­bil­ity of con­sen­sual, con­fi­dent wrong­ness is pretty wor­ry­ing (Willi­ams 2015).

Jen­nifer Wright has two pa­pers ex­am­in­ing this pos­si­bil­ity em­piri­cally. In (Wright 2010), she found that more con­fi­dent episte­molog­i­cal and eth­i­cal judg­ments were less vuln­er­a­ble to or­der effects. Thus, lack of con­fi­dence in a philo­soph­i­cal in­tu­ition may be a re­li­able in­di­ca­tor that the in­tu­ition is un­re­li­able. (Wright 2013) pur­ports to ad­dress re­lated ques­tions, but I found I found it un­con­vinc­ing for a va­ri­ety of rea­sons.

Mo­ral engineering

The fi­nal re­sponse to ev­i­dence of un­re­li­a­bil­ity is to ar­gue that we can over­come our defi­cien­cies by ap­pli­ca­tion of care­ful effort. Eng­ineer­ing re­li­able sys­tems and pro­cesses from un­re­li­able com­po­nents is a re­cur­ring theme in hu­man progress. The phys­i­cal sci­ences work with im­pre­cise in­stru­ments and over­come that limi­ta­tion through care­ful de­sign of pro­ce­dures and statis­ti­cal com­pe­tence. In dis­tributed com­put­ing, we’re able to build re­li­able sys­tems out of un­re­li­able com­po­nents.

As a mo­ti­vat­ing ex­am­ple, imag­ine a set of lit­mus strips which turn red in acid and blue in base (Wein­berg 2016). Now sup­pose that each strip has only a 51% chance of perform­ing cor­rectly—red in an acid and blue in base. Even in the face of this rad­i­cal un­re­li­a­bil­ity, we can drive our con­fi­dence to an ar­bi­trar­ily high level by test­ing the ma­te­rial with more and more pH strips (as long as each test is in­de­pen­dent).

This anal­ogy pro­vides a com­pel­ling mo­ti­va­tion for co­her­ence norms. By de­mand­ing that our moral judg­ments across cases co­here, we are im­plic­itly ag­gre­gat­ing noisy data points into a larger sys­tem that we hope is more re­li­able. It may also mo­ti­vate an in­creased defer­ence to an “out­side view” which ag­gre­gates the moral judg­ments of many.

(Hue­mer 2008) pre­sents an­other con­struc­tive re­sponse to the prob­lem of un­re­li­able judg­ments. It pro­poses that con­crete and mid-level in­tu­itions are es­pe­cially un­re­li­able be­cause they are the most likely to be in­fluenced by cul­ture, biolog­i­cal evolu­tion and emo­tions. On the other hand, fully ab­stract in­tu­itions are prone to over­gen­er­al­iza­tions in which the full im­pli­ca­tions of a claim are not ad­e­quately un­der­stood. If ab­stract judg­ments and con­crete judg­ments are to be dis­trusted, what’s left? Hue­mer pro­poses that for­mal rules are un­usu­ally trust­wor­thy. By for­mal rules, he is refer­ring to rules which im­pose con­straints on other rules but do not them­selves pro­duce moral judg­ments. Ex­am­ples in­clude tran­si­tivity (If A is bet­ter than B and B is bet­ter than C, A must be bet­ter than C.) and com­po­si­tion­al­ity (If do­ing A is wrong and do­ing B is wrong, do­ing both A and B must be wrong.).

Other in­ter­est­ing work in this area in­cludes (Wein­berg et al. 2012), (J. M. Wein­berg 2017b), and (Talbot 2014).

(Wein­berg 2016) sum­ma­rizes this per­spec­tive well:

“Philo­soph­i­cal the­ory-se­lec­tion and em­piri­cal model-se­lec­tion are highly similar prob­lems: in both, we have a data stream in which we ex­pect to find both sig­nal and noise, and we are try­ing to figure out how best to ex­ploit the former with­out in­ad­ver­tently build­ing the lat­ter into our the­o­ries or mod­els them­selves.”


  • In­ter­nal val­idity: The ex­per­i­men­tal ev­i­dence isn’t great, but it still seems hard to be­lieve that our moral judg­ments are perfectly re­li­able.

  • Ex­per­tise: Naive ap­peals to ex­per­tise are un­likely to save us given the liter­a­ture on ex­pert perfor­mance.

  • Ecolog­i­cal val­idity: The ex­per­i­ments that have been con­duct are in­deed differ­ent from in vivo moral judg­ments, but it’s not cur­rently ob­vi­ous that the move from syn­thetic to nat­u­ral moral dilem­mas will im­prove judg­ment.

  • Suffi­cient: Even if the effects of pu­ta­tively ir­rele­vant fac­tors are rel­a­tively small, that still seems con­cern­ing given that moral de­ci­sions of­ten in­volve com­plex trade-offs.

  • Ecolog­i­cally ra­tio­nal: I’m not par­tic­u­larly con­vinced by Gigeren­zer’s view of heuris­tics as ecolog­i­cally ra­tio­nal and I’m even more in­clined to doubt that this is solid ground for moral judg­ments.

  • Se­cond-or­der re­li­a­bil­ity: Our sense of the re­li­a­bil­ity of our moral judg­ments prob­a­bly isn’t pure noise. But it’s also prob­a­bly not perfect.

  • Mo­ral en­g­ineer­ing: Ac­knowl­edg­ing the un­re­li­a­bil­ity of our moral judg­ments and work­ing to ame­lio­rate it through care­ful un­der­stand­ing and de­signed coun­ter­mea­sures seems promis­ing.


Our moral judg­ments are prob­a­bly un­re­li­able. Even if this fact doesn’t jus­tify full skep­ti­cism, it jus­tifies se­ri­ous at­ten­tion. A ful­ler un­der­stand­ing of the limits of our moral fac­ul­ties would help us de­ter­mine how to re­spond.

Ap­pendix: Qual­i­ta­tive dis­cus­sion of methodology


  • (Haidt and Baron 1996): No im­me­di­ate com­plaints here. I will note that or­der effects weren’t the origi­nal pur­pose of the study and just hap­pened to show up dur­ing data anal­y­sis.

  • (Petrinovich and O’Neill 1996): No im­me­di­ate com­plaints.

  • (Lan­teri, Che­lini, and Rizzello 2008): No im­me­di­ate com­plaints.

  • (Lom­brozo 2009): No im­me­di­ate com­plaints.

  • (Zam­zow and Ni­chols 2009): “In­ter­est­ingly, while we found judg­ments of the by­stan­der case seem to be im­pacted by or­der of pre­sen­ta­tion, our re­sults trend in the op­po­site di­rec­tion of Petrinovich and O’Neill. They found that peo­ple were more likely to not pull the switch when the by­stan­der case was pre­sented last. This asym­me­try might re­flect the differ­ence in ques­tions asked—”what is the right thing to do?” ver­sus “what would you do?”

    Or it might re­flect noise.

  • (Wright 2010): No im­me­di­ate com­plaints.

  • (Sch­witzgebel and Cush­man 2012): How many times can you pull the same trick on peo­ple? You kind of have to hope the an­swer is “A lot” for this study since it asked each sub­ject 17 ques­tions in se­quence test­ing the or­der sen­si­tivity of sev­eral differ­ent sce­nar­ios. The au­thors do ac­knowl­edge the pos­si­bil­ity for learn­ing effects.

  • (Liao et al. 2012): No im­me­di­ate com­plaints.

  • (Wieg­mann, Okan, and Nagel 2012): No im­me­di­ate com­plaints.


  • (Petrinovich, O’Neill, and Jor­gensen 1993): No im­me­di­ate com­plaints.

Dis­gust and cleanliness

  • (Wheatley and Haidt 2005): Hyp­no­sis seems pretty weird. I’m not sure how much ex­ter­nal val­idity hyp­not­i­cally-in­duced dis­gust has. Espe­cially af­ter ac­count­ing for the fact that the re­sults only in­clude those who were suc­cess­fully hyp­no­tized—it seems pos­si­ble that those es­pe­cially sus­cep­ti­ble to hyp­no­sis are differ­ent from oth­ers in some way that is rele­vant to moral judg­ments.

  • (Sch­nall et al. 2008): In ex­per­i­ment 1, the mean moral judg­ment in the mild-stink con­di­tion was not sig­nifi­cantly differ­ent from the mean moral judg­ment in the strong-stink con­di­tion de­spite a sig­nifi­cant differ­ence in mean dis­gust. This doesn’t seem ob­vi­ously con­gru­ent with the un­der­ly­ing the­ory and it seems slightly strange that this pos­si­ble anomaly passed com­pletely un­men­tioned.

    More con­cern­ing to me is that, in ex­per­i­ment 2, the dis­gust ma­nipu­la­tion did not work as judged by self-re­ported dis­gust. How­ever, the ex­per­i­menters be­lieve the “dis­gust ma­nipu­la­tion had high face val­idity” and went on to find that the re­sults sup­ported their hy­poth­e­sis when look­ing at the di­choto­mous vari­able of con­trol con­di­tion ver­sus dis­gust con­di­tion. When a ma­nipu­la­tion fails to change a pu­ta­tive cause (as mea­sured by an in­stru­ment), it seems quite strange for the down­stream effect to change any­way. (Again, it strikes me as un­for­tu­nate that the au­thors don’t de­vote any real at­ten­tion to this.) It seems to sig­nifi­cantly raise the like­li­hood that the re­sults are re­flect­ing noise rather than in­sight.

    The non-sig­nifi­cant re­sults re­ported here were not, ap­par­ently, the au­thors’ main in­ter­est. Their pri­mary hy­poth­e­sis (which the ex­per­i­ments sup­ported) was that dis­gust would in­crease sever­ity of moral judg­ment for sub­jects high in pri­vate body con­scious­ness (Miller, Mur­phy, and Buss 1981).

  • (Sch­nall, Ben­ton, and Har­vey 2008): The clean­li­ness ma­nipu­la­tion in ex­per­i­ment 1 seems very weak. Sub­jects com­pleted a scram­bled-sen­tences task with 40 sets of four words. Con­trol con­di­tion par­ti­ci­pants re­ceived neu­tral words while clean­li­ness con­di­tion par­ti­ci­pants had clean­li­ness and pu­rity re­lated words in half their sets.

    In­deed, no group differ­ences be­tween the con­di­tions were found in any mood cat­e­gory in­clud­ing dis­gust which seems plau­si­bly an­tag­o­nis­tic to the clean­li­ness primes. It’s not clear to me why this part of the pro­ce­dure was in­cluded if they ex­pected both con­di­tions to pro­duce in­dis­t­in­guish­able scores. It sug­gests to me that re­sults for the ma­nipu­la­tion weren’t as hoped and the pa­per just doesn’t draw at­ten­tion to it? (In their defense, the pa­per is quite short.)

    The ex­per­i­menters went on to find that clean­li­ness re­duced the sever­ity of moral judg­ment which, as dis­cussed el­se­where, seems a bit wor­ry­ing in light of the po­ten­tially failed ma­nipu­la­tion.

    In ex­per­i­ment 2, “Be­cause of the dan­ger of mak­ing the cleans­ing ma­nipu­la­tion salient, we did not ob­tain ad­di­tional dis­gust rat­ings af­ter the hand-wash­ing pro­ce­dure.” which seems prob­le­matic given pos­si­ble difficul­ties with ma­nipu­la­tions by this lead au­thor el­se­where in this pa­per and by this lead au­thor in an­other pa­per from the same year.

    Al­to­gether, this pa­per strikes me as very repli­ca­tion crisis-y. (I think es­pe­cially be­cause it echoes the in­fa­mous study about prim­ing young peo­ple to walk more slowly with words about ag­ing (Doyen et al. 2012).) (I looked it up af­ter writ­ing all this out and it turns out oth­ers agree.)

  • (Hor­berg et al. 2009): No im­me­di­ate com­plaints.

  • (Lil­jen­quist, Zhong, and Gal­in­sky 2010): I’m tempted to say that ex­per­i­ment 1 is one of those re­sults we should re­ject just be­cause the effect size is im­plau­si­bly big. “The only differ­ence be­tween the two rooms was a spray of cit­rus-scented Win­dex in the clean-scented room” and yet they get a Co­hen’s in a var­i­ant on the dic­ta­tor game of 1.03. This would mean ~85% of peo­ple in the con­trol con­di­tion would share less than the non-con­trol av­er­age. If an effect of this size were real, it seems like we’d have no­ticed and be dous­ing our­selves with Win­dex be­fore tough ne­go­ti­a­tions.

  • (Zhong, Stre­jcek, and Si­vanathan 2010): This study found ev­i­dence for the claim that par­ti­ci­pants who cleansed their hands judged morally-in­flected so­cial is­sues more harshly. Wait, what? Isn’t that the op­po­site of what the other stud­ies found? Not to worry, there’s a sim­ple rec­on­cili­a­tion. The clean­li­ness and dis­gust primes in those other stud­ies were some­how about the tar­get of judg­ment whereas the clean­li­ness primes in this study are about cleans­ing the self.

    It also finds that a dirt­i­ness prime is no differ­ent than the con­trol con­di­tion but, since it’s pri­mar­ily in­ter­ested in the clean­li­ness prime, it makes no com­ment on this re­sult.

  • (Eskine, Kac­inik, and Prinz 2011): No im­me­di­ate com­plaints.

  • (David and Olatunji 2011): It’s a bit weird that their eval­u­a­tive con­di­tion­ing pro­ce­dure pro­duced pos­i­tive emo­tions for the con­trol word which had been paired with neu­tral images. They do briefly ad­dress the con­cern that the eval­u­a­tive con­di­tion­ing ma­nipu­la­tion is weak.

    Ku­dos to the au­thors for not try­ing too hard to ex­plain away the null re­sult: “This find­ing ques­tions the gen­er­al­ity of the role be­tween dis­gust and moral­ity.”.

  • (K. P. To­bia, Chap­man, and Stich 2013): Without any back­ing the­ory pre­dict­ing or ex­plain­ing it, the find­ing that “the clean­li­ness ma­nipu­la­tion caused stu­dents to give higher rat­ings in both the ac­tor and ob­server con­di­tions, and caused philoso­phers to give higher rat­ings in the ac­tor con­di­tion, but lower rat­ings in the ob­server con­di­tion.” strikes me as likely to be noise rather than in­sight.

  • (Huang 2014): The hy­poth­e­sis un­der test in this pa­per was that re­sponse effort mod­er­ates the effect of clean­li­ness primes. If we ig­nore that and just look at whether clean­li­ness primes had an effect, there was a null re­sult in both stud­ies.

    This kind of mil­i­tant re­luc­tance to falsify hy­poth­e­sis is part of what makes me very skep­ti­cal of the dis­gust/​clean­li­ness liter­a­ture:

    “De­spite be­ing a [failed] di­rect repli­ca­tion of SBH, JCD differed from SBH on at least two sub­tle as­pects that might have re­sulted in a slightly higher level of re­sponse effort. First, whereas un­der­grad­u­ate stu­dents from Univer­sity of Ply­mouth in England “par­ti­ci­pated as part of a course re­quire­ment” in SBH (p. 1219), un­der­grad­u­ates from Michi­gan State Univer­sity in the United States par­ti­ci­pated in ex­change of “par­tial fulfill­ment of course re­quire­ments or ex­tra credit” in JCD (p. 210). It is plau­si­ble that stu­dents who par­ti­ci­pated for ex­tra credit in JCD may have been more mo­ti­vated and at­ten­tive than those who were re­quired to par­ti­ci­pate, lead­ing to a higher level of re­sponse effort in JCD than in SBH. Se­cond, JCD in­cluded qual­ity as­surance items near the end of their study to ex­clude par­ti­ci­pants “ad­mit­ting to fabri­cat­ing their an­swers” (p. 210); such fea­tures were not re­ported in SBH. It is pos­si­ble that re­searchers’ rep­u­ta­tion for screen­ing for IER re­sulted in a more effort­ful sam­ple in JCD.”

  • (John­son, Che­ung, and Don­nel­lan 2014b): Wow, much power--0.99. This is a failed repli­ca­tion of (Sch­nall, Ben­ton, and Har­vey 2008).

  • (John­son et al. 2016): The power level for study 1, it’s over 99.99%! This is a failed repli­ca­tion of (Sch­nall et al. 2008).

  • (Ugazio, Lamm, and Singer 2012): I ex­cluded this study from quan­ti­ta­tive re­view be­cause they did this: “As the re­sults ob­tained in Ex­per­i­ment 1a did not repli­cate pre­vi­ous find­ings sug­gest­ing a prim­ing effect of dis­gust in­duc­tion on moral judg­ments [...] we performed an­other ex­per­i­ment [...] the analy­ses that fol­low on the data ob­tained in Ex­per­i­ments 1a and 1b are col­lapsed”. Pretty egre­gious.


  • (Lom­brozo 2009): No im­me­di­ate com­plaints.

  • (Seyed­sayam­dost 2015): This is a failed repli­ca­tion of (Buck­walter and Stich 2014). I didn’t in­clude (Buck­walter and Stich 2014) in the quan­ti­ta­tive re­view be­cause it wasn’t an in­de­pen­dent ex­per­i­ment but se­lec­tive re­port­ing by de­sign: “Fiery Cush­man was one of the re­searchers who agreed to look for gen­der effects in data he had col­lected [...]. One study in which he found them [...]” (Buck­walter and Stich 2014).

  • (Adle­berg, Thomp­son, and Nah­mias 2015): No ma­jor com­plaints.

    They did do a post hoc power anal­y­sis which isn’t quite a real thing.

Cul­ture and so­cioe­co­nomic status

  • (Haidt, Kol­ler, and Dias 1993): They used the Bon­fer­roni pro­ce­dure to cor­rect for mul­ti­ple com­par­i­sons which is good. On the other hand:

    “Sub­jects were asked to de­scribe the ac­tions as perfectly OK, a lit­tle wrong, or very wrong. Be­cause we could not be cer­tain that this scale was an in­ter­val scale in which the mid­dle point was per­ceived to be equidis­tant from the end­points, we di­chotomized the re­sponses, sep­a­rat­ing perfectly OK from the other two re­sponses.”

    Why did they cre­ate a non-di­choto­mous in­stru­ment only to di­chotomize their own in­stru­ment af­ter data col­lec­tion? I’m wor­ried that the di­chotomiza­tion was done post hoc upon see­ing the non-di­chotomized data and anal­y­sis.


  • (Feltz and Cokely 2008): No im­me­di­ate com­plaints.


  • (Nadelhoffer and Feltz 2008): No im­me­di­ate com­plaints.

  • (K. To­bia, Buck­walter, and Stich 2013): No im­me­di­ate com­plaints.


Adle­berg, Toni, Mor­gan Thomp­son, and Eddy Nah­mias. 2015. “Do Men and Women Have Differ­ent Philo­soph­i­cal In­tu­itions? Fur­ther Data.” Philo­soph­i­cal Psy­chol­ogy 28 (5). Tay­lor & Fran­cis: 615--41.

Alexan­der, Joshua. 2016. “Philo­soph­i­cal Ex­per­tise.” A Com­pan­ion to Ex­per­i­men­tal Philos­o­phy. Wiley On­line Library, 557--67.

Amodei, Dario, Chris Olah, Ja­cob Stein­hardt, Paul Chris­ti­ano, John Schul­man, and Dan Mané. 2016. “Con­crete Prob­lems in Ai Safety.” arXiv Preprint arXiv:1606.06565.

An­dow, James. 2016. “Reli­able but Not Home Free? What Fram­ing Effects Mean for Mo­ral In­tu­itions.” Philo­soph­i­cal Psy­chol­ogy 29 (6). Tay­lor & Fran­cis: 904--11.

Arbesfeld, Ju­lia, Tri­cia Col­lins, Demetrius Bald­win, and Kim­berly Daub­man. 2014. “Clean Thoughts Lead to Less Se­vere Mo­ral Judg­ment.” http://​​www.Psy­​​repli­ca­tion.php?at­tempt=MTc3.

Boyd, Robert, Her­bert Gin­tis, Sa­muel Bowles, and Peter J Rich­er­son. 2003. “The Evolu­tion of Altru­is­tic Pu­n­ish­ment.” Pro­ceed­ings of the Na­tional Academy of Sciences 100 (6). Na­tional Acad Sciences: 3531--5.

Boyd, Robert, and Peter J Rich­er­son. 2005. The Ori­gin and Evolu­tion of Cul­tures. Oxford Univer­sity Press.

Brown, Charles A. 2009. “Order Effects and the Au­dit Ma­te­ri­al­ity Re­vi­sion Choice.” Jour­nal of Ap­plied Busi­ness Re­search (JABR) 25 (1).

Buck­walter, Wesley, and Stephen Stich. 2014. “Gen­der and Philo­soph­i­cal In­tu­ition.” Ex­per­i­men­tal Philos­o­phy 2. Oxford Univer­sity Press Oxford: 307--46.

Clarke, Steve. 2013. “In­tu­itions as Ev­i­dence, Philo­soph­i­cal Ex­per­tise and the Devel­op­men­tal Challenge.” Philo­soph­i­cal Papers 42 (2). Tay­lor & Fran­cis: 175--207.

Crock­ett, Molly J. 2013. “Models of Mo­ral­ity.” Trends in Cog­ni­tive Sciences 17 (8). El­se­vier: 363--66.

Crock­ett, Molly J, Zeb Kurth-Nel­son, Jenifer Z Siegel, Peter Dayan, and Ray­mond J Dolan. 2014. “Harm to Others Outweighs Harm to Self in Mo­ral De­ci­sion Mak­ing.” Pro­ceed­ings of the Na­tional Academy of Sciences 111 (48). Na­tional Acad Sciences: 17320--5.

Curry, Oliver Scott. 2016. “Mo­ral­ity as Co­op­er­a­tion: A Prob­lem-Cen­tred Ap­proach.” In The Evolu­tion of Mo­ral­ity, 27--51. Springer.

Damisch, Lysann, Thomas Muss­weiler, and Hen­ning Pless­ner. 2006. “Olympic Medals as Fruits of Com­par­i­son? As­simila­tion and Con­trast in Se­quen­tial Perfor­mance Judg­ments.” Jour­nal of Ex­per­i­men­tal Psy­chol­ogy: Ap­plied 12 (3). Amer­i­can Psy­cholog­i­cal As­so­ci­a­tion: 166.

David, Bieke, and Bunmi O Olatunji. 2011. “The Effect of Dis­gust Con­di­tion­ing and Dis­gust Sen­si­tivity on Ap­praisals of Mo­ral Trans­gres­sions.” Per­son­al­ity and In­di­vi­d­ual Differ­ences 50 (7). El­se­vier: 1142--6.

Dawes, RM. 1994. “Psy­chother­apy: The Myth of Ex­per­tise.” House of Cards: Psy­chol­ogy and Psy­chother­apy Built on Myth, 38--74.

De Groot, Adri­aan D. 2014. Thought and Choice in Chess. Vol. 4. Walter de Gruyter GmbH & Co KG.

De­ma­ree-Cot­ton, Joanna. 2016. “Do Fram­ing Effects Make Mo­ral In­tu­itions Un­re­li­able?” Philo­soph­i­cal Psy­chol­ogy 29 (1). Tay­lor & Fran­cis: 1--22.

Doyen, Stéphane, Olivier Klein, Cora-Lise Pi­chon, and Axel Cleere­mans. 2012. “Be­hav­ioral Prim­ing: It’s All in the Mind, but Whose Mind?” PloS One 7 (1). Public Library of Science: e29081.

Duben­sky, Ca­ton, Leanna Dun­smore, and Kim­berly Daub­man. 2013. “Clean­li­ness Primes Less Se­vere Mo­ral Judg­ments.” http://​​www.Psy­​​repli­ca­tion.php?at­tempt=MTQ5.

Eskine, Ken­dall J, Natalie A Kac­inik, and Jesse J Prinz. 2011. “A Bad Taste in the Mouth: Gus­ta­tory Dis­gust In­fluences Mo­ral Judg­ment.” Psy­cholog­i­cal Science 22 (3). Sage Publi­ca­tions Sage CA: Los An­ge­les, CA: 295--99.

Fel­tovich, Paul J, Michael J Pri­etula, and K An­ders Eric­s­son. 2006. “Stud­ies of Ex­per­tise from Psy­cholog­i­cal Per­spec­tives.” The Cam­bridge Hand­book of Ex­per­tise and Ex­pert Perfor­mance, 41--67.

Feltz, Adam, and Ed­ward T Cokely. 2008. “The Frag­mented Folk: More Ev­i­dence of Stable In­di­vi­d­ual Differ­ences in Mo­ral Judg­ments and Folk In­tu­itions.” In Pro­ceed­ings of the 30th An­nual Con­fer­ence of the Cog­ni­tive Science So­ciety, 1771--6. Cog­ni­tive Science So­ciety Austin, TX.

Fred­er­ick, Shane. 2005. “Cog­ni­tive Reflec­tion and De­ci­sion Mak­ing.” Jour­nal of Eco­nomic Per­spec­tives 19 (4): 25--42.

Gigeren­zer, Gerd. 2008. “Mo­ral In­tu­ition= Fast and Fru­gal Heuris­tics?” In Mo­ral Psy­chol­ogy, 1--26. MIT Press.

Gigeren­zer, Gerd, Ulrich Hoffrage, and Heinz Klein­bölt­ing. 1991. “Prob­a­bil­is­tic Men­tal Models: A Brunswikian The­ory of Con­fi­dence.” Psy­cholog­i­cal Re­view 98 (4). Amer­i­can Psy­cholog­i­cal As­so­ci­a­tion: 506.

Glim­cher, Paul W, Michael C Dor­ris, and Han­nah M Bayer. 2005. “Phys­iolog­i­cal Utility The­ory and the Neu­roe­co­nomics of Choice.” Games and Eco­nomic Be­hav­ior 52 (2). El­se­vier: 213--56.

Greene, Joshua D. 2007. “Why Are Vmpfc Pa­tients More Utili­tar­ian? A Dual-Pro­cess The­ory of Mo­ral Judg­ment Ex­plains.” Trends in Cog­ni­tive Sciences 11 (8). El­se­vier: 322--23.

Haidt, Jonathan, and Jonathan Baron. 1996. “So­cial Roles and the Mo­ral Judge­ment of Acts and Omis­sions.” Euro­pean Jour­nal of So­cial Psy­chol­ogy 26 (2). Wiley On­line Library: 201--18.

Haidt, Jonathan, and Fredrik Bjork­lund. 2008. “So­cial In­tu­ition­ists An­swer Six Ques­tions About Mo­ral­ity.” Oxford Univer­sity Press, Forth­com­ing.

Haidt, Jonathan, Silvia He­lena Kol­ler, and Maria G Dias. 1993. “Affect, Cul­ture, and Mo­ral­ity, or Is It Wrong to Eat Your Dog?” Jour­nal of Per­son­al­ity and So­cial Psy­chol­ogy 65 (4). Amer­i­can Psy­cholog­i­cal As­so­ci­a­tion: 613.

Hauert, Christoph, Arne Traulsen, Han­nelore Brandt, Martin A Nowak, and Karl Sig­mund. 2007. “Via Free­dom to Co­er­cion: The Emer­gence of Costly Pu­n­ish­ment.” Science 316 (5833). Amer­i­can As­so­ci­a­tion for the Ad­vance­ment of Science: 1905--7.

Hechter, Michael, and Karl-Dieter Opp. 2001. So­cial Norms. Rus­sell Sage Foun­da­tion.

Hen­rich, Joseph, Robert Boyd, Sa­muel Bowles, Colin Camerer, Ernst Fehr, Her­bert Gin­tis, and Richard McElreath. 2001. “In Search of Homo Eco­nomi­cus: Be­hav­ioral Ex­per­i­ments in 15 Small-Scale So­cieties.” Amer­i­can Eco­nomic Re­view 91 (2): 73--78.

Hen­rich, Joseph Pa­trick, Robert Boyd, Sa­muel Bowles, Ernst Fehr, Colin Camerer, Her­bert Gin­tis, and oth­ers. 2004. Foun­da­tions of Hu­man So­cial­ity: Eco­nomic Ex­per­i­ments and Ethno­graphic Ev­i­dence from Fif­teen Small-Scale So­cieties. Oxford Univer­sity Press on De­mand.

Hor­berg, Eliz­a­beth J, Christo­pher Oveis, Dacher Kelt­ner, and Adam B Co­hen. 2009. “Dis­gust and the Mo­r­al­iza­tion of Pu­rity.” Jour­nal of Per­son­al­ity and So­cial Psy­chol­ogy 97 (6). Amer­i­can Psy­cholog­i­cal As­so­ci­a­tion: 963.

Huang, Ja­son L. 2014. “Does Clean­li­ness In­fluence Mo­ral Judg­ments? Re­sponse Effort Moder­ates the Effect of Clean­li­ness Prim­ing on Mo­ral Judg­ments.” Fron­tiers in Psy­chol­ogy 5. Fron­tiers: 1276.

Hue­mer, Michael. 2008. “Re­vi­sion­ary In­tu­ition­ism.” So­cial Philos­o­phy and Policy 25 (1). Cam­bridge Univer­sity Press: 368--92.

Hutch­er­son, Cen­dri A, Ben­jamin Bushong, and An­to­nio Ran­gel. 2015. “A Neu­ro­com­pu­ta­tional Model of Altru­is­tic Choice and Its Im­pli­ca­tions.” Neu­ron 87 (2). El­se­vier: 451--62.

John­son, David J, Felix Che­ung, and Brent Don­nel­lan. 2014a. “Clean­li­ness Primes Do Not In­fluence Mo­ral Judg­ment.” http://​​www.Psy­​​repli­ca­tion.php?at­tempt=MTcy.

John­son, David J, Felix Che­ung, and M Brent Don­nel­lan. 2014b. “Does Clean­li­ness In­fluence Mo­ral Judg­ments?” So­cial Psy­chol­ogy. Ho­grefe Pub­lish­ing.

John­son, David J, Jes­sica Wort­man, Felix Che­ung, Me­gan Hein, Richard E Lu­cas, M Brent Don­nel­lan, Charles R Eber­sole, and Rachel K Narr. 2016. “The Effects of Dis­gust on Mo­ral Judg­ments: Test­ing Moder­a­tors.” So­cial Psy­cholog­i­cal and Per­son­al­ity Science 7 (7). Sage Publi­ca­tions Sage CA: Los An­ge­les, CA: 640--47.

Knobe, Joshua. 2003. “In­ten­tional Ac­tion and Side Effects in Or­di­nary Lan­guage.” Anal­y­sis 63 (3). JSTOR: 190--94.

Ko­riat, Asher. 2008. “Sub­jec­tive Con­fi­dence in One’s An­swers: The Con­sen­su­al­ity Prin­ci­ple.” Jour­nal of Ex­per­i­men­tal Psy­chol­ogy: Learn­ing, Me­mory, and Cog­ni­tion 34 (4). Amer­i­can Psy­cholog­i­cal As­so­ci­a­tion: 945.

Korn­blith, Hilary. 2010. “What Reflec­tive En­dorse­ment Can­not Do.” Philos­o­phy and Phenomenolog­i­cal Re­search 80 (1). Wiley On­line Library: 1--19.

Kras­now, Max M. 2017. “An Evolu­tion­ar­ily In­formed Study of Mo­ral Psy­chol­ogy.” In Mo­ral Psy­chol­ogy, 29--41. Springer.

Kuhn, Kris­tine M. 1997. “Com­mu­ni­cat­ing Uncer­tainty: Fram­ing Effects on Re­sponses to Vague Prob­a­bil­ities.” Or­ga­ni­za­tional Be­hav­ior and Hu­man De­ci­sion Pro­cesses 71 (1). El­se­vier: 55--83.

Küh­berger, An­ton. 1995. “The Fram­ing of De­ci­sions: A New Look at Old Prob­lems.” Or­ga­ni­za­tional Be­hav­ior and Hu­man De­ci­sion Pro­cesses 62 (2). El­se­vier: 230--40.

Lan­teri, Ales­san­dro, Chiara Che­lini, and Sal­va­tore Rizzello. 2008. “An Ex­per­i­men­tal In­ves­ti­ga­tion of Emo­tions and Rea­son­ing in the Trol­ley Prob­lem.” Jour­nal of Busi­ness Ethics 83 (4). Springer: 789--804.

Liao, S Matthew, Alex Wieg­mann, Joshua Alexan­der, and Ger­ard Vong. 2012. “Put­ting the Trol­ley in Order: Ex­per­i­men­tal Philos­o­phy and the Loop Case.” Philo­soph­i­cal Psy­chol­ogy 25 (5). Tay­lor & Fran­cis: 661--71.

Lil­jen­quist, Katie, Chen-Bo Zhong, and Adam D Gal­in­sky. 2010. “The Smell of Virtue: Clean Scents Pro­mote Re­ciproc­ity and Char­ity.” Psy­cholog­i­cal Science 21 (3). Sage Publi­ca­tions Sage CA: Los An­ge­les, CA: 381--83.

Lom­brozo, Ta­nia. 2009. “The Role of Mo­ral Com­mit­ments in Mo­ral Judg­ment.” Cog­ni­tive Science 33 (2). Wiley On­line Library: 273--86.

Miller, Ge­offrey F. 2007. “Sex­ual Selec­tion for Mo­ral Virtues.” The Quar­terly Re­view of Biol­ogy 82 (2). The Univer­sity of Chicago Press: 97--125.

Miller, Lynn C, Richard Mur­phy, and Arnold H Buss. 1981. “Con­scious­ness of Body: Pri­vate and Public.” Jour­nal of Per­son­al­ity and So­cial Psy­chol­ogy 41 (2). Amer­i­can Psy­cholog­i­cal As­so­ci­a­tion: 397.

Millhouse, Tyler, Alis­a­beth Ayars, and Shaun Ni­chols. 2018. “Learn­abil­ity and Mo­ral Na­tivism: Ex­plor­ing Wilde Rules.” In Method­ol­ogy and Mo­ral Philos­o­phy, 73--89. Rout­ledge.

Nadelhoffer, Thomas, and Adam Feltz. 2008. “The Ac­tor—Ob­server Bias and Mo­ral In­tu­itions: Ad­ding Fuel to Sin­nott-Arm­strong’s Fire.” Neu­roethics 1 (2). Springer: 133--44.

Nor­man, Ge­off, Kevin Eva, Lee Brooks, and Stan Ham­stra. 2006. “Ex­per­tise in Medicine and Surgery.” The Cam­bridge Hand­book of Ex­per­tise and Ex­pert Perfor­mance 2006: 339--53.

Parpart, Paula, Matt Jones, and Bradley C Love. 2018. “Heuris­tics as Bayesian In­fer­ence Un­der Ex­treme Pri­ors.” Cog­ni­tive Psy­chol­ogy 102. El­se­vier: 127--44.

Petrinovich, Lewis, and Pa­tri­cia O’Neill. 1996. “In­fluence of Word­ing and Fram­ing Effects on Mo­ral In­tu­itions.” Ethol­ogy and So­cio­biol­ogy 17 (3). El­se­vier: 145--71.

Petrinovich, Lewis, Pa­tri­cia O’Neill, and Matthew Jor­gensen. 1993. “An Em­piri­cal Study of Mo­ral In­tu­itions: Toward an Evolu­tion­ary Ethics.” Jour­nal of Per­son­al­ity and So­cial Psy­chol­ogy 64 (3). Amer­i­can Psy­cholog­i­cal As­so­ci­a­tion: 467.

Plomin, Robert, John C DeFries, Valerie S Knopik, and Je­nae M Nei­der­hiser. 2016. “Top 10 Repli­cated Find­ings from Be­hav­ioral Ge­net­ics.” Per­spec­tives on Psy­cholog­i­cal Science 11 (1). Sage Publi­ca­tions Sage CA: Los An­ge­les, CA: 3--23.

Rat­cliff, Roger, and Gail McKoon. 2008. “The Diffu­sion De­ci­sion Model: The­ory and Data for Two-Choice De­ci­sion Tasks.” Neu­ral Com­pu­ta­tion 20 (4). MIT Press: 873--922.

Sch­nall, Si­mone, Jen­nifer Ben­ton, and So­phie Har­vey. 2008. “With a Clean Con­science: Clean­li­ness Re­duces the Sever­ity of Mo­ral Judg­ments.” Psy­cholog­i­cal Science 19 (12). SAGE Publi­ca­tions Sage CA: Los An­ge­les, CA: 1219--22.

Sch­nall, Si­mone, Jonathan Haidt, Ger­ald L Clore, and Alexan­der H Jor­dan. 2008. “Dis­gust as Em­bod­ied Mo­ral Judg­ment.” Per­son­al­ity and So­cial Psy­chol­ogy Bul­letin 34 (8). Sage Publi­ca­tions Sage CA: Los An­ge­les, CA: 1096--1109.

Schulz, Eric, Ed­ward T Cokely, and Adam Feltz. 2011. “Per­sis­tent Bias in Ex­pert Judg­ments About Free Will and Mo­ral Re­spon­si­bil­ity: A Test of the Ex­per­tise Defense.” Con­scious­ness and Cog­ni­tion 20 (4). El­se­vier: 1722--31.

Sch­witzgebel, Eric, and Fiery Cush­man. 2012. “Ex­per­tise in Mo­ral Rea­son­ing? Order Effects on Mo­ral Judg­ment in Pro­fes­sional Philoso­phers and Non-Philoso­phers.” Mind & Lan­guage 27 (2). Wiley On­line Library: 135--53.

Sch­witzgebel, Eric, and Joshua Rust. 2016. “The Be­hav­ior of Ethi­cists.” A Com­pan­ion to Ex­per­i­men­tal Philos­o­phy. Wiley On­line Library, 225.

Seyed­sayam­dost, Hamid. 2015. “On Gen­der and Philo­soph­i­cal In­tu­ition: Failure of Repli­ca­tion and Other Nega­tive Re­sults.” Philo­soph­i­cal Psy­chol­ogy 28 (5). Tay­lor & Fran­cis: 642--73.

Shanteau, James. 1992. “Com­pe­tence in Ex­perts: The Role of Task Char­ac­ter­is­tics.” Or­ga­ni­za­tional Be­hav­ior and Hu­man De­ci­sion Pro­cesses 53 (2). El­se­vier: 252--66.

Singer, Peter, and oth­ers. 2000. A Dar­wi­nian Left: Poli­tics, Evolu­tion and Co­op­er­a­tion. Yale Univer­sity Press.

Sin­nott-Arm­strong, Walter, and Chris­tian B Miller. 2008. Mo­ral Psy­chol­ogy: The Evolu­tion of Mo­ral­ity: Adap­ta­tions and In­nate­ness. Vol.

  1. MIT press.

Sri­pada, Chan­dra Sekhar. 2008. “Na­tivism and Mo­ral Psy­chol­ogy: Three Models of the In­nate Struc­ture That Shapes the Con­tents of Mo­ral Norms.” Mo­ral Psy­chol­ogy 1. MIT Press Cam­bridge: 319--43.

Talbot, Brian. 2014. “Why so Nega­tive? Ev­i­dence Ag­gre­ga­tion and Arm­chair Philos­o­phy.” Syn­these 191 (16). Springer: 3865--96.

To­bia, Kevin, Wesley Buck­walter, and Stephen Stich. 2013. “Mo­ral In­tu­itions: Are Philoso­phers Ex­perts?” Philo­soph­i­cal Psy­chol­ogy 26 (5). Tay­lor & Fran­cis: 629--38.

To­bia, Kevin P, Gretchen B Chap­man, and Stephen Stich. 2013. “Clean­li­ness Is Next to Mo­ral­ity, Even for Philoso­phers.” Jour­nal of Con­scious­ness Stud­ies 20 (11-12).

Todd, Peter M, and Gerd Ed Gigeren­zer. 2012. Ecolog­i­cal Ra­tion­al­ity: In­tel­li­gence in the World. Oxford Univer­sity Press.

Turner, Jonathan H, and Alexan­dra Maryan­ski. 2013. “The Evolu­tion of the Neu­rolog­i­cal Ba­sis of Hu­man So­cial­ity.” In Hand­book of Neu­roso­ciol­ogy, 289--309. Springer.

Ugazio, Giuseppe, Claus Lamm, and Ta­nia Singer. 2012. “The Role of Emo­tions for Mo­ral Judg­ments Depends on the Type of Emo­tion and Mo­ral Sce­nario.” Emo­tion 12 (3). Amer­i­can Psy­cholog­i­cal As­so­ci­a­tion: 579.

Wein­berg, Jonathan, Stephen Crowley, Chad Gon­ner­man, Ian Van­de­walker, and Stacey Swain. 2012. “In­tu­ition & Cal­ibra­tion.” Es­says in Philos­o­phy 13 (1). Pa­cific Univer­sity Libraries: 256--83.

Wein­berg, Jonathan M. 2007. “How to Challenge In­tu­itions Em­piri­cally Without Risk­ing Skep­ti­cism.”

---------. 2016. “Ex­per­i­men­tal Philos­o­phy, Noisy In­tu­itions, and Messy In­fer­ences.” Ad­vances in Ex­per­i­men­tal Philos­o­phy and Philo­soph­i­cal Method­ol­ogy. Blooms­bury Pub­lish­ing, 11.

---------. 2017a. “What Is Nega­tive Ex­per­i­men­tal Philos­o­phy Good for?” In The Cam­bridge Com­pan­ion to Philo­soph­i­cal Method­ol­ogy, 161--83. Cam­bridge Univer­sity Press.

---------. 2017b. “Knowl­edge, Noise, and Curve-Fit­ting: A Method­olog­i­cal Ar­gu­ment for Jtb?”

Wein­berg, Jonathan M, and Joshua Alexan­der. 2014. “In­tu­itions Through Thick and Thin.” In­tu­itions. Oxford Univer­sity Press, USA, 187--231.

Wein­berg, Jonathan M, Chad Gon­ner­man, Cameron Buck­ner, and Joshua Alexan­der. 2010. “Are Philoso­phers Ex­pert In­tu­iters?” Philo­soph­i­cal Psy­chol­ogy 23 (3). Tay­lor & Fran­cis: 331--55.

Wheatley, Thalia, and Jonathan Haidt. 2005. “Hyp­notic Dis­gust Makes Mo­ral Judg­ments More Se­vere.” Psy­cholog­i­cal Science 16 (10). SAGE Publi­ca­tions Sage CA: Los An­ge­les, CA: 780--84.

Wieg­mann, Alex, Yas­mina Okan, and Jonas Nagel. 2012. “Order Effects in Mo­ral Judg­ment.” Philo­soph­i­cal Psy­chol­ogy 25 (6). Tay­lor & Fran­cis: 813--36.

Willi­ams, Evan G. 2015. “The Pos­si­bil­ity of an On­go­ing Mo­ral Catas­tro­phe.” Eth­i­cal The­ory and Mo­ral Prac­tice 18 (5). Springer: 971--82.

Wright, Jen­nifer. 2013. “Track­ing In­sta­bil­ity in Our Philo­soph­i­cal Judg­ments: Is It In­tu­itive?” Philo­soph­i­cal Psy­chol­ogy 26 (4). Tay­lor & Fran­cis: 485--501.

Wright, Jen­nifer Cole. 2010. “On In­tu­itional Sta­bil­ity: The Clear, the Strong, and the Paradig­matic.” Cog­ni­tion 115 (3). El­se­vier: 491--503.

---------. 2016. “In­tu­itional Sta­bil­ity.” A Com­pan­ion to Ex­per­i­men­tal Philos­o­phy. Wiley On­line Library, 568--77.

Zam­zow, Jen­nifer L, and Shaun Ni­chols. 2009. “Vari­a­tions in Eth­i­cal In­tu­itions.” Philo­soph­i­cal Is­sues 19 (1). Black­well Pub­lish­ing Inc Mal­den, USA: 368--88.

Zhong, Chen-Bo, Bren­dan Stre­jcek, and Niro Si­vanathan. 2010. “A Clean Self Can Ren­der Harsh Mo­ral Judg­ment.” Jour­nal of Ex­per­i­men­tal So­cial Psy­chol­ogy 46 (5). El­se­vier: 859--62.

  1. The first, sim­plest sort of un­re­li­a­bil­ity can be sub­sumed in this frame­work by con­sid­er­ing the time of eval­u­a­tion as a morally ir­rele­vant fac­tor. ↩︎

  2. This was origi­nally writ­ten in more in­no­cent times be­fore the post had sprawled to more than 12,000 words. ↩︎

  3. This het­ero­gene­ity is also why I don’t com­pute a fi­nal, sum­mary mea­sure of the effect size. ↩︎

  4. There are some stud­ies ex­am­in­ing only dis­gust and some ex­am­in­ing only clean­li­ness, but I’ve grouped the two here since these ma­nipu­la­tions are con­cep­tu­ally re­lated and many au­thors have ex­am­ined both. ↩︎

  5. There are quite a few cross-cul­tural stud­ies of things like the ul­ti­ma­tum game (Hen­rich et al. 2001). I ex­cluded those be­cause they are not purely moral—the ul­ti­ma­tum-giver is also try­ing to pre­dict the be­hav­ior of the ul­ti­ma­tum-re­cip­i­ent. ↩︎

  6. Yes, not all re­sults in works like Think­ing Fast and Slow have held up and some of the re­sults are in ar­eas prone to repli­ca­tion is­sues. It still seems un­likely that all such re­sults will be swept away and we’ll be left to con­clude that hu­mans were perfectly ra­tio­nal all along. ↩︎

  7. We can also phrase this as fol­lows: Some of our moral in­tu­itions are the re­sult of model-free re­in­force­ment learn­ing (Crock­ett 2013). In the ab­sence of a model spec­i­fy­ing ac­tion-out­come links, these moral in­tu­itions are nec­es­sar­ily ret­ro­spec­tive. Framed in this ML way, the con­cern is that our moral in­tu­itions are not ro­bust to dis­tri­bu­tional shift (Amodei et al. 2016). ↩︎

  8. Aside: There is some amaz­ing aca­demic trash talk in chap­ter 2 of (Sin­nott-Arm­strong and Miller 2008). Just ut­ter con­tempt drip­ping from ev­ery para­graph on both sides (Jerry Fodor ver­sus Tooby and Cos­mides). For ex­am­ple, “Those fa­mil­iar with Fodor’s writ­ing know that he usu­ally re­s­ur­rects his grand­mother when he wants his in­tu­ition to do the work that a good com­pu­ta­tional the­ory should.”. ↩︎

  9. The sep­a­ra­tion be­tween cul­ture and genes is par­tic­u­larly un­clear when look­ing at norms and moral judg­ment since both cul­ture and genes are plau­si­bly work­ing to solve (at least some of) the same prob­lems of so­cial co­op­er­a­tion. One syn­the­sis is to sup­pose that cer­tain fac­ul­ties even­tu­ally evolved to fa­cil­i­tate some cul­turally-origi­nated norms. ↩︎

  10. I will add one com­plaint that ap­plies to pretty much all of the stud­ies: they treat cat­e­gor­i­cal scale data (e.g. re­sponses on a Lik­ert scale) as ra­tio scale. But this sort of thing seems ram­pant so isn’t a mark of ex­cep­tional un­re­li­a­bil­ity in this cor­ner of the liter­a­ture. ↩︎

  11. There’s also the slightly sub­tler claim that ex­per­tise does not purify moral in­tu­itions and judg­ments, but that it helps philoso­phers un­der­stand and ac­co­mo­date their cog­ni­tive flaws (Alexan­der 2016). We’ll not ex­plic­itly ex­am­ine this claim any fur­ther here. ↩︎

  12. There is even rea­son to be­lieve that re­flec­tion is some­times harm­ful (Korn­blith 2010) (Wein­berg and Alexan­der 2014). ↩︎

  13. There’s also the in­ter­est­ing but some­what less rele­vant work of Sch­witzgebel and Rust (Sch­witzgebel and Rust 2016) in which they re­peat­edly find that ethi­cists do not be­have more morally (ac­cord­ing to their met­rics) than non-ethi­cists. ↩︎

  14. Gigeren­zer ex­plains this sur­pris­ing re­sult by ap­peal­ing to the bias-var­i­ance trade­off—com­pli­cated strate­gies over-fit to the data they hap­pen to see and fail to gen­er­al­ize. Another ex­pla­na­tion is that heuris­tics rep­re­sent an in­finitely strong prior and that the “ideal” pro­ce­dures Gigeren­zer tested against rep­re­sent an un­in­for­ma­tive prior (Parpart, Jones, and Love 2018). ↩︎