Elliott Thornley (EJT)

Karma: 1,130

I work on AI alignment. Right now, I’m using ideas from decision theory to design and train safer artificial agents.

I also do work in ethics, focusing on the moral importance of future generations.

You can email me at thornley@mit.edu.

Elliott Thornley (EJT) 26 Apr 2026 13:08 UTC
2 points
0 ∶ 0
in reply to: Jo_🔸’s comment on: AI safety can be a Pascal’s mugging even if p(doom) is high
Yep, when going into AI safety you should take into account p(you cause doom) along with p(you avert doom).

Elliott Thornley (EJT) 26 Apr 2026 13:05 UTC
3 points
0 ∶ 0
in reply to: simon’s comment on: AI safety can be a Pascal’s mugging even if p(doom) is high
That’s my point! p(influencing the outcome positively) is the right thing to focus on, not p(doom).

Elliott Thornley (EJT) 19 Apr 2026 13:39 UTC
8 points
0 ∶ 0
on: Can AI make advancements in moral philosophy by writing proofs?
Nice post! Miscellaneous thoughts:
if individuals have VNM utility functions, and if the Pareto principle holds over groups, then a version of utilitarianism must be true.
Harsanyi’s theorem also requires that the social planner’s preferences satisfy the VNM axioms.
Not many philosophical proofs have been written
I think this all depends on what you mean by ‘many’. I’d guess maybe 10% of analytic philosophy papers include a proof of some kind, so that at least hundreds of proofs are published every year. And in a sense, every valid (spelled-out) argument is a proof.
I agree that the Claude proofs are pretty bad. The Arrhenius point is fairly obvious: what Arrhenius means by ‘theories’ in that paper is weak orders on populations, so if after taking into account moral uncertainty you still have a weak order, then the impossibility theorem still applies. (And later Arrhenius theorems relax both completeness and transitivity, so even departing from a weak order doesn’t get you off the hook.)
Claude makes this kind of point, but first it introduces an Agreement axiom that the proof never uses. Claude later comes close to admitting this (‘Agreement plays almost no role’), tries to walk it back (‘But Agreement rules out the escape route...’), and then fully admits it (‘the fundamental impossibility holds regardless’).
Which Claude model did you use? Did you use extended thinking? The flip-flopping above makes me think there was no extended thinking, and maybe a model with extended thinking would do better. (Though not much better I’d guess. I’ve found LLMs to be surprisingly bad at philosophy, even just the ‘understanding the view and its implications’ parts.)
I didn’t bother checking the second population ethics proof but it looks sloppy:
Axiom (Sufficient Comparability). For any pair of populations A, B that differ by at most some fixed bounded amount (e.g., adding or removing one person, or changing one person’s welfare level by a small amount), M(μ) must rank A and B (no incomparability for “local” comparisons).”
Don’t any pair of populations “differ by at most some fixed bounded amount”? What is Claude doing including ‘e.g.’s in its formal statement of axioms?
With some additional effort, present-day LLMs might be capable of coming up with a good novel proof. If not, then it will likely be possible soon. Most kinds of moral philosophy might be difficult for AIs, but proofs are one area where AI assistance seems promising.
Yes, you’d think so given that they’ve gotten so good at math! But when I’ve tried using LLMs to help with formal philosophy, I’ve found them to be really surprisingly bad, even at parts that seem very math-loaded (e.g. inventing proofs, following arguments, grasping views and their implications, coming up with counterexamples, etc.). I’m not sure why this is. I guess part of it is that it’s hard to do RLVR on philosophy in the same way that you can do RLVR on math, but naively I’d expect more generalization from math to formal philosophy. Maybe the following is a factor: pretraining data doesn’t contain that much bad mathematical reasoning, but it contains a huge amount of bad philosophical reasoning.

Elliott Thornley (EJT) 7 Nov 2025 22:58 UTC
3 points
0 ∶ 0
in reply to: Wei Dai’s comment on: How Well Does RL Scale?
I said a little in another thread. If we get aligned AI, I think it’ll likely be a corrigible assistant that doesn’t have its own philosophical views that it wants to act on. And then we can use these assistants to help us solve philosophical problems. I’m imagining in particular that these AIs could be very good at mapping logical space, tracing all the implications of various views, etc. So you could ask a question and receive a response like: ‘Here are the different views on this question. Here’s why they’re mutually exclusive and jointly exhaustive. Here are all the most serious objections to each view. Here are all the responses to those objections. Here are all the objections to those responses,’ and so on. That would be a huge boost to philosophical progress. Progress has been slow so far because human philosophers take entire lifetimes just to fill in one small part of this enormous map, and because humans make errors so later philosophers can’t even trust that small filled-in part, and because verification in philosophy isn’t much quicker than generation.

Elliott Thornley (EJT) 7 Nov 2025 22:44 UTC
4 points
0 ∶ 0
in reply to: Elliott Thornley (EJT)’s comment on: How Well Does RL Scale?
I’m not sure but I think maybe I also have a different view than you on what problems are going to be bottlenecks to AI development. e.g. I think there’s a big chance that the world would steam ahead even if we don’t solve any of the current (non-philosophical) problems in alignment (interpretability, shutdownability, reward hacking, etc.).

Elliott Thornley (EJT) 7 Nov 2025 22:40 UTC
4 points
0 ∶ 0
in reply to: Wei Dai’s comment on: How Well Does RL Scale?
try to make them “more legible” to others, including AI researchers, key decision makers, and the public
Yes, I agree this is valuable, though I think it’s valuable mainly because it increases the probability that people use future AIs to solve these problems, rather than because it will make people slow down AI development or try very hard to solve them pre-TAI.

Elliott Thornley (EJT) 7 Nov 2025 22:31 UTC
4 points
0 ∶ 0
in reply to: Wei Dai’s comment on: How Well Does RL Scale?
I don’t think philosophical difficulty is that much of an increase to the difficulty of alignment, mainly because I think that AI developers should (and likely will) aim to make AIs corrigible assistants rather than agents with their own philosophical views that they try to impose on the world. And I think it’s fairly likely that we can use these assistants (if we succeed in getting them and aren’t disempowered by a misaligned AI instead) to help a lot with these hard philosophical questions.

Elliott Thornley (EJT) 7 Nov 2025 22:22 UTC
6 points
0 ∶ 0
in reply to: MichaelDickens’s comment on: How Well Does RL Scale?
I didn’t meant to imply that Wei Dai was overrating the problems’ importance. I agree they’re very important! I was making the case that they’re also very intractable.
If I thought solving these problems pre-TAI would be a big increase to the EV of the future, I’d take their difficulty to be a(nother) reason to slow down AI development. But I think I’m more optimistic than you and Wei Dai about waiting until we have smart AIs to help us on these problems.

Elliott Thornley (EJT) 1 Nov 2025 15:37 UTC
23 points
1 ∶ 0
in reply to: Wei Dai’s comment on: How Well Does RL Scale?
I’m a philosopher who’s switched to working on AI safety full-time. I also know there are at least a few philosophers at Anthropic working on alignment.
With regards to your Problems in AI Alignment that philosophers could potentially contribute to:
- I agree that many of these questions are important and that more people should work on them.
- But a fair amount of them are discussed in conventional academic philosophy, e.g.:
  - How to resolve standard debates in decision theory?
  - Infinite/multiversal/astronomical ethics
  - Fair distribution of benefits
  - What is the nature of philosophy?
  - What constitutes correct philosophical reasoning?
  - How should an AI aggregate preferences between its users?
  - What is the nature of normativity?
- And these are all difficult, controversial questions.
  - For each question, you have to read and deeply think about at least 10 papers (and likely many more) to get a good understanding of the question and its current array of candidate answers.
  - Any attempt to resolve the question would have to grapple with a large number of considerations and points that have previously been made in relation to the question.
    Probably, you need to write something at least book-length.
    (And it’s very hard to get people to read book-length things.)
  - In trying to do this, you probably don’t find any answer that you’re really confident in.
    I think most philosophers’ view on the questions they study is: ‘It’s really hard. Here’s my best guess.’
    Or if they’re confident of something, it’ll be a small point within existing debates (e.g. ‘This particular variant of this view is subject to this fatal objection.’).
  - And even if you do find an answer you’re confident in, you’ll have a very hard time convincing other philosophers of that answer.
    They’ll bring up some point that you hadn’t thought of.
    Or they’ll differ from you in their bedrock intuitions, and it’ll be hard for either of you to see any way to argue the other out of their bedrock intuition.
    In some cases—like population ethics and decision theory—we have proofs that every possible answer will have some unsavory implication. You have to pick your poison, and different philosophers will make different picks.
    And on inductive grounds, I suspect that many other philosophical questions also have no poison-free answers.
- Derek Parfit is a good example here.
  - He spent decades working on On What Matters, trying to settle the questions of ethics and meta-ethics.
  - He really tried to get other philosophers to agree with him.
  - But very few do. The general consensus in philosophy is that it’s not a very convincing book.
  - And I think a large part of the problem is a difference of bedrock intuitions. For example, Bernard Williams simply ‘didn’t have the concept of a normative reason,’ and there was nothing that Parfit could do to change that.
- It also seems like there’s not much of an appetite among AI researchers for this kind of work.
  - If there were, we might see more discussions of On What Matters, or any of the other existing works on these questions.
When I decided to start working on AI, I seriously considered working on the kinds of questions you list. But due to the points above, I chose to do my current work instead.
What links here?
- Wei Dai's comment on Legible vs. Illegible AI Safety Problems by Wei Dai (LessWrong; 5 Nov 2025 8:43 UTC; 5 points)

Elliott Thornley (EJT) 20 Oct 2025 20:15 UTC
4 points
3 ∶ 0
in reply to: cb’s comment on: Talking about longtermism isn’t very important
Makes sense! Unfortunately any x-risk cost-effectiveness calculation has to be a little vibes-based because one of the factors is ‘By how much would this intervention reduce x-risk?’, and there’s little evidence to guide these estimates.

Elliott Thornley (EJT) 20 Oct 2025 14:06 UTC
19 points
5 ∶ 0
on: Talking about longtermism isn’t very important
Whether longtermism is a crux will depend on what we mean by ‘long,’ but I think concern for future people is a crux for x-risk reduction. If future people don’t matter, then working on global health or animal welfare is the more effective way to improve the world. The more optimistic of the calculations that Carl and I do suggest that, by funding x-risk reduction, we can save a present person’s life for about $9,000 in expectation. But we could save about 2 present people if we spent that money on malaria prevention, or we could mitigate the suffering of about 12.6 million shrimp if we donated to SWP.

Elliott Thornley (EJT) 7 Oct 2025 23:23 UTC
8 points
6 ∶ 0
in reply to: Aaron Bergman’s comment on: Utilitarians Should Accept that Some Suffering Cannot be “Offset”
Oops yes, fundamentals between my and Bruce’s cases are very similar. Should have read Bruce’s comment!
The claim we’re discussing—about the possibility of small steps of various kinds—sounds kinda like a claim that gets called ‘Finite Fine-Grainedness’/‘Small Steps’ in the population axiology literature. It seems hard to convincingly argue for, so in this paper I present a problem for lexical views that doesn’t depend on it. I sort of gestured at it above with the point about risk without making it super precise. The one-line summary is that expected welfare levels are finitely fine-grained.

Elliott Thornley (EJT) 7 Oct 2025 15:44 UTC
12 points
9 ∶ 0
in reply to: Ben_West🔸’s comment on: Utilitarians Should Accept that Some Suffering Cannot be “Offset”
Oh yep nice point, though note that—e.g. - there are uncountably many reals between 1,000,000 and 1,000,001 and yet it still seems correct (at least talking loosely) to say that 1,000,001 is only a tiny bit bigger than 1,000,000.
But in any case, we can modify the argument to say that S* feels only a tiny bit worse than S. Or instead we can modify it so that S is degrees celsius of a fire that causes suffering that just about can be outweighed, and S* is degrees celsius of a fire that causes suffering that just about can’t be outweighed.

Elliott Thornley (EJT) 6 Oct 2025 14:08 UTC
8 points
0 ∶ 0
on: Utilitarians Should Accept that Some Suffering Cannot be “Offset”
Also you might be interested in this paper from Andreas Mogensen which discusses a similar idea.

Elliott Thornley (EJT) 6 Oct 2025 14:03 UTC
63 points
19 ∶ 6
on: Utilitarians Should Accept that Some Suffering Cannot be “Offset”
Nice post! Here’s an argument that extreme suffering can always be outweighed.
Suppose you have a choice between:
(S+G): The most intense suffering S that can be outweighed, plus a population that’s good enough to outweigh it G, so that S+G is good overall: better than an empty population.
(S*+nG): The least intense suffering S* that can’t be outweighed, plus a population that’s n times better than the good population G.
If extreme suffering can’t be outweighed, we’re required to choose S+G over S*+nG, no matter how big n is. But that seems implausible. S* is only a tiny bit worse than S, and n could be enormous. To make the implication seem more implausible, we can imagine that the improvement nG comes about by extending the lives of an enormous number of people who died early in G, or by removing (non-extreme) suffering from the lives of an enormous number of people who suffer intensely (but non-extremely) in G.
We can also make things more difficult by introducing risk into the case (in this sort of way). Suppose now that the choice is between:
(S+G): The most intense suffering S that can be outweighed, plus a population that’s good enough to outweigh it G, so that S+G is good overall: better than an empty population.
(Risky S*+nG): With probability $1 - ϵ$ , the most intense suffering S that can be outweighed. With probability $ϵ$ , the least intense suffering S* that can’t be outweighed. Plus (with certainty) a population that’s n times better than the good population G.
We’ve amended the case so that the move from S+G to Risky S*+nG now involves just a $ϵ$ increase in the probability of a tiny increase in suffering (from S to S*). As before, the move also improves the lives of those in the good population G by as much as you like. Plausibly, each $ϵ$ increase (for very small $ϵ$ ) in the probability of getting S* instead of S (together with an n increase in the quality of G, for very large n) is an improvement. Then with Transitivity, we get the result that S*+nG is better than S+G, and therefore that extreme suffering can always be outweighed.
I think the view that extreme suffering can’t always be outweighed has some counterintuitive prudential implications too. It implies that basically we should never think about how happy our choices would make us. Almost always, we should think only about how to minimize our expected quantities of extreme suffering. Even when we’re—e.g. - choosing between chocolate and vanilla at the ice cream shop, we should first determine which choice minimizes our expected quantity of extreme suffering. Only if we conclude that these quantities are exactly the same should we even consider which of chocolate and vanilla tastes nicer. That seems counterintuitive to me.
Note also that you can accept outweighability and still believe that extreme suffering is really bad. You could—e.g. - think that 1 second of a cluster headache can only be outweighed by trillions upon trillions of years of bliss. That would give you all the same practical implications without the theoretical trouble.

Elliott Thornley (EJT) 11 Dec 2024 16:36 UTC
3 points
1 ∶ 0
in reply to: Peleg Shilo’s comment on: My favourite arguments against person-affecting views
Nice point, but I think it comes at a serious cost.
To see how, consider a different case. In X, ten billion people live awful lives. In Y, those same ten billion people live wonderful lives. Clearly, Y is much better than X.
Now consider instead Y* which is exactly the same as Y except that we also add one extra person, also with a wonderful life. As before, Y* is much better than X for the original ten billion people. If we say that the value of adding the extra person is undefined and that this undefined value renders the value of the whole change from X to Y* undefined, we get the implausible result that Y* is not better than X. Given plausible principles linking betterness and moral requirements, we get the result that we’re permitted to choose X over Y*. That seems very implausible, and so it counts against the claim that adding people results in undefined comparisons.

Elliott Thornley (EJT) 23 Jul 2024 18:15 UTC
0 points
0 ∶ 0
in reply to: JWS 🔸’s comment on: defun’s Quick takes
Thanks!

Elliott Thornley (EJT) 23 Jul 2024 16:48 UTC
11 points
0 ∶ 0
in reply to: defun 🔸’s comment on: defun’s Quick takes
Wait, all the LLMs get 90+ on ARC? I thought LLMs were supposed to do badly on ARC.

Elliott Thornley (EJT) 9 Apr 2024 16:29 UTC
3 points
0 ∶ 0
in reply to: Lukas_Gloor’s comment on: My favourite arguments against person-affecting views
You should read the post! Section 4.1.1 makes the move that you suggest (rescuing PAVs by de-emphasising axiology). Section 5 then presents arguments against PAVs that don’t appeal to axiology.

Elliott Thornley (EJT) 9 Apr 2024 16:07 UTC
1 point
0 ∶ 0
in reply to: titotal’s comment on: A non-identity dilemma for person-affecting views (Elliott Thornley)
I think my objections still work if we ‘go anonymous’ and remove direct information about personal identity across different options. We just need to add some extra detail. Let the new version of One-Shot Non-Identity be as follows. You have a choice between: (1) combining some pair of gametes A, which will eventually result in the existence of a person with welfare 1, and (2) combining some other pair of gametes B, which will eventually result in the existence of a person with welfare 100.
The new version of Expanded Non-Identity is then the same as the above, except it also has available option (3): combine the pair of gametes A and the pair of gametes B, which will eventually result in the existence of two people each with welfare 10.
Narrow views claim that each option is permissible in One-Shot Non-Identity. What should they say about Expanded Non-Identity? The same trilemma applies. It seems implausible to say that (1) is permissible, because (3) looks better. It seems implausible to say that (3) is permissible, because (2) looks better. And if only (2) is permissible, then narrow views imply the implausible-seeming Losers Can Dislodge Winners.
Now consider wide views and Two-Shot Non-Identity, again redescribed in terms of combining pairs of gametes A and B. You first choose whether to combine pair A (which would eventually result in the existence of a person with welfare 1), and then later choose whether to combine pair B (which would eventually result in the existence of a person with welfare 100). Suppose that you know your predicament in advance, and suppose that you choose to combine pair A. Then (your view implies) you’re required to combine pair B, even if that choice occurs many decades later, and even though you wouldn’t be required to combine pair B if you hadn’t (many decades earlier) chose to combine pair A. Now consider a slightly different case: you first choose whether to combine pair C (which would eventually result in the existence of a person with welfare 101), then later choose whether to combine pair B. Suppose that you know your predicament in advance, and suppose that you decline to combine pair C. Many decades later, you face the choice of whether to combine pair B. Your view seems to imply that you’re not permitted to do so. There are thus cases where (all else being equal) you’re not even permitted to create a person who would enjoy a wonderful life.