Trying to better understand the practical epistemology of EA, and how we can improve upon it.
Violet Hour
I don’t quite agree with your summary.
Kat explicitly acknowledges at the end of this comment that “[they] made some mistakes … learned from them and set up ways to prevent them”, so it feels a bit unfair to say that that Non-Linear as a whole hasn’t acknowledged any wrongdoing.
OTOH, Ben’s testimony here in response to Emerson is a bit concerning, and supports your point more strongly.[1] It’s also one of the remarks I’m most curious to hear Emerson respond to. I’ll quote Ben in full because I don’t think this comment is on the EA Forum.
I did hear your [Emerson’s?] side for 3 hours and you changed my mind very little and admitted to a bunch of the dynamics (“our intention wasn’t just to have employees, but also to have members of our family unit”) and you said my summary was pretty good. You mostly laughed at every single accusation I brought up and IMO took nothing morally seriously and the only ex ante mistake you admitted to was “not firing Alice earlier”. You didn’t seem to understand the gravity of my accusations, or at least had no space for honestly considering that you’d seriously hurt and intimidated some people.
I think I would have been much more sympathetic to you if you had told me that you’d been actively letting people know about how terrible an experience your former employees had, and had encouraged people to speak with them, and if you at literally any point had explicitly considered the notion that you were morally culpable for their experiences.
This is only Ben’s testimony, so take that for what it’s worth. But this context feels important, because (at least just speaking personally) genuine acknowledgment and remorse for any wrongdoing feels pretty crucial for my overall evaluation of Non-Linear going forward.
- ^
I also sympathize with the general vibe of your remark, and the threats to sue contribute to the impression of going on the defensive rather than admitting fault.
- ^
Here’s a dynamic that I’ve seen pop up more than once.
Person A says that an outcome they judge to be bad will occur with high probability, while making a claim of the form “but I don’t want (e.g.) alignment to be doomed — it would be a huge relief if I’m wrong!”
It seems uncontroversial that Person A would like to be shown that they’re wrong in a way that vindicates their initial forecast as ex ante reasonable.
It seems more controversial whether Person A would like to be shown that their prediction was wrong, in a way that also shows their initial prediction to have been ex ante unreasonable.
In my experience, it’s much easier to acknowledge that you were wrong about some specific belief (or the probability of some outcome), than it is to step back and acknowledge that the reasoning process which led you to your initial statement was misfiring. Even pessimistic beliefs can be (in Ozzie’s language) “convenient beliefs” to hold.
If we identify ourselves with our ability to think carefully, coming to believe that there are errors in our reasoning process can hit us much more personally than updates about errors in our conclusions. Optimistic updates might be an update towards me thinking that my projects have been less worthwhile than I thought, that my local community is less effective than I thought, or that my background framework or worldview was in error. I think these updates can be especially painful for people who are more liable to identify with their ability to reason well, or identify with the unusual merits of their chosen community.
To clarify: I’m not claiming that people with more pessimistic conclusions are, in general, more likely to be making reasoning errors. Obviously there are plenty of incentives towards believing rosier conclusions. I’m simply claiming that: if someone arrives at a pessimistic conclusion based on faulty reasoning, then you shouldn’t necessarily expect optimistic pushback to be uniformly welcomed— for all of the standard reasons that updates of the form “I could’ve done better on a task I care about” can be hard to accept.
I’m a bit unclear on why you characterise 80,000 Hours as having a “narrower” cause focus than (e.g.) Charity Entrepreneurship. CE’s page cites the following cause areas:
Animal Welfare
Health and Development Policy
Mental Health and Happiness
Family Planning
Capacity Building (EA Meta)
Meanwhile, 80k provide a list of the world’s “most pressing problems”:
Risks from AI
Catastrophic Pandemics
Nuclear War
Great Power Conflict
Climate Change
These areas feel comparably “broad” to me? Likewise for Longview, who you list as part of the “AI x-risk community”, state six distinct focus areas for their grantmaking — only one of which is AI. Unless I’ve missed a recent pivot from these orgs, both Longview & 80k feel more similar to CE in terms of breadth than Animal Advocacy Careers.
I agree that you need “specific values and epistemic assumptions” to agree with the areas these orgs have highlighted as most important, but I think you need specific values and epistemic assumptions to agree with more standard near-termist recommendations for impactful careers and donations, too. So I’m a bit confused about what the difference between “question” and “answer” communities is meant to denote aside from the split between near/longtermism.[1] Is the idea that (for example) CE is more skeptically focused on exploring the relative priorities of distinct cause areas, whereas organizations like Longview and 80k are more focused on funnelling people+money into areas which have already been decided as the most important? Or something else?
I do think it’s correct note that the more ‘longtermist’ side of the community works with different values and epistemics to the more ‘neartermist’ side of the community, and I think it would be beneficial to emphasise this more. But given that you note there are already distinct communities in some sense (e.g., there are x-risk specific conferences), what other concrete steps would you like to see implemented in order to establish distinct communities?
- ^
I’m aware that many people justify focus on areas like biorisk and AI in virtue of the risks posed to the present generation, and might not subscribe to longtermism as a philosophical thesis. I still think that the ‘longtermist’ moniker is useful as a sociological label — used to denote the community of people who work on cause areas that longtermists are likely to rate as among the highest priorities.
- 24 Jun 2023 15:40 UTC; 1 point) 's comment on Question and Answer-based EA Communities by (
Alignment, Goals, & The Gut-Head Gap: A Review of Ngo. et al
Ay thanks, sorry I’m late back to you. I’ll respond to various parts in turn.
I don’t find Carlsmith et al’s estimates convincing because they are starting with a conjunctive frame and applying conjunctive reasoning. They are assuming we’re fine by default (why?), and then building up a list of factors that need to go wrong for doom to happen.
My initial interpretation of this passage is: you seem to be saying that conjunctive/disjunctive arguments are presented against a mainline model (say, one of doom/hope). In presenting a ‘conjunctive’ argument, Carlsmith belies a mainline model of hope. However, you doubt the mainline model of hope, and so his argument is unconvincing. If that reading is correct, then my view is that the mainline model of doom has not been successfully argued for. What do you take to be the best argument for a ‘mainline model’ of doom? If I’m correct in interpreting the passage below as an argument for a ‘mainline model’ of doom, then it strikes me as unconvincing:
Any one of a vast array of things can cause doom. Just the 4 broad categories mentioned at the start of the OP (subfields of Alignment) and the fact that “any given [alignment] approach that might show some promise on one or two of these still leaves the others unsolved.” is enough to provide a disjunctive frame!
Under your framing, I don’t think that you’ve come anywhere close to providing an argument for your preferred disjunctive framing. On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts, and an argument for a disjunctive frame requires showing this for all of the disjuncts.
Nate’s Framing
I claimed that an argument for (my slight alteration of) Nate’s framing was likely to rely on the conjunction of many assumptions, and you (very reasonably) asked me to spell them out. To recap, here’s the framing:
For humanity to be dead by 2070, only one of the following needs to be true:
Humanity has < 20 years to prepare for AGI
The technical challenge of alignment isn’t “pretty easy”
Research culture isn’t alignment-conscious in a competent way.
For this to be a disjunctive argument for doom, all of the following need to be true:
If humanity has < 20 years to prepare for AGI, then doom is highly likely.
Etc …
That is, the first point requires an argument which shows the following:
A Conjunctive Case for the Disjunctive Case for Doom:[1]
Even if we have a competent alignment-research culture, and
Even if the technical challenge of alignment is also pretty easy, nevertheless
Humanity is likely to go extinct if it has <20 years to prepare for AGI.
If I try to spell out the arguments for this framing, things start to look pretty messy. If technical alignment were “pretty easy”, and tackled by a culture which competently pursued alignment research, then I don’t feel >90% confident in doom. The claim “if humanity has < 20 years to prepare for AGI, then doom is highly likely” requires (non-exhaustively) the following assumptions:
Obviously, the argument directly entails the following: Groups of competent alignment researchers would fail to make ‘sufficient progress’ on alignment within <20 years, even if the technical challenge of alignment is “pretty easy”.
There have to be some premises here which help make sense of why this would be true. What’s the bar for a competent ‘alignment culture’?
If the bar is low, then the claim does not seem obviously true. If the bar for ‘competent alignment-research culture’ is very high, then I think you’ll need an assumption like the one below.
With extremely high probability, the default expectation should be that the values of future AIs are unlikely to care about continued human survival, or the survival of anything we’d find valuable.
I will note that this assumption seems required to motivate the disjunctive framing above, rather than following from the framing above.
The arguments I know of for claims like this do seem to rely on strong claims about the sort of ‘plan search’ algorithms we’d expect future AIs to instantiate. For example, Rob claims that we’re on track to produce systems which approximate ‘randomly sample from the space of simplicity-weighted plans’. See discussion here.
As Paul notes, “there are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you’ve interacted with, and which was prima facie plausibly an important actor in the world.”
By default, the values of future AIs are likely to include broadly-scoped goals, which will involve rapacious influence-seeking.
I agree that there are instrumentally convergent goals, which include some degree of power/influence-seeking. But I don’t think instrumental convergence alone gets you to ‘doom with >50%’.
It’s not enough to have a moderate desire for influence. I think it’s plausible that the default path involves systems who do ‘normal-ish human activities’ in pursuit of more local goals. I quote a story from Katja Grace in my shortform here.
So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts. For instance: if we have >20 years to conduct AI alignment research conditional on the problem not being super hard, why can’t there be a decent chance that a not-super-competent research community solves the problem? Again, I find it hard to motivate the case for a claim like that without already assuming a mainline model of doom.
I’m not saying there aren’t interesting arguments here, but I think that arguments of this type mostly assume a mainline model of doom (or the adequacy of a ‘disjunctive framing’), rather than providing independent arguments for a mainline model of doom.
Future Responses
This blog is ~1k words. Can you write a similar length blog for the other side, rebutting all my points?
I think so! But I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format. Otherwise, I feel like I have to spend a lot of work trying to understand the logical structure of your argument, which requires a decent chunk of time-investment.
Still, I’m happy to chat over DM if you think that discussing this further would be profitable. Here’s my attempt to summarize your current view of things.
We’re on a doomed path, and I’d like to see arguments which could allow me to justifiably believe that there are paths which will steer us away from the default attractor state of doom. The technical problem of alignment has many component pieces, and it seems like failure to solve any one of the many component pieces is likely sufficient for doom. Moreover, the problems for each piece of the alignment puzzle look ~independent.
- ^
Suggestions for better argument names are not being taken at this time.
Based solely on my own impression, I’d guess that one reason for the lack of engagement on your original question stems from the fact that it felt like you were operating within a very specific frame, and I sensed that untangling the specific assumptions of your frame (and consequently a high P(doom)) would take a lot of work. In my own case, I didn’t know which assumptions are driving your estimates, and so I consequently felt unsure as to which counter-arguments you’d consider relevant to your key cruxes.
(For example: many reviewers of the Carlsmith report (alongside Carlsmith himself) put P(doom) ≤ 10%. If you’ve read these responses, why did you find the responses uncompelling? Which specific arguments did you find faulty?)
Here’s one example from this post where I felt as though it would take a lot of work to better understand the argument you want to put forward:
“The above considerations are the basis for the case that disjunctive reasoning should predominantly be applied to AI x-risk: the default is doom.”
When I read this, I found myself asking “wait, what are the relevant disjuncts meant to be?”. I understand a disjunctive argument for doom to be saying that doom is highly likely conditional on any one of {A, B, C, … }. If each of A, B, C … is independently plausible, then obviously this looks worrying. If you say that some claim is disjunctive, I want an argument for believing that each disjunct is independently plausible, and an argument for accepting the disjunctive framing offered as the best framing for the claim at hand.
For instance, here’s a disjunctive framing of something Nate said in his review of the Carlsmith Report.
For humanity to be dead by 2070, only one premise below needs to be true:
Humanity has < 20 years to prepare for AGI
The technical challenge of alignment isn’t “pretty easy”
Research culture isn’t alignment-conscious in a competent way.
Phrased this way, Nate offers a disjunctive argument. And, to be clear, I think it’s worth taking seriously. But I feel like ‘disjunctive’ and ‘conjunctive’ are often thrown around a bit too loosely, and such terms mostly serve to impede the quality of discussion. It’s not obvious to me that Nate’s framing is the best framing for the question at hand, and I expect that making the case for Nate’s framing is likely to rely on the conjunction of many assumptions. Also, that’s fine! I think it’s a valuable argument to make! I just think there should be more explicit discussions and arguments about the best framings for predicting the future of AI.
Finally, I feel like asking for “a detailed technical argument for believing P(doom|AGI) ≤ 10%” is making an isolated demand for rigor. I personally don’t think there are ‘detailed technical arguments’ P(doom|AGI) greater than 10%. I don’t say this critically, because reasoning about the chances of doom given AGI is hard. I’m also >10% on many claims in the absence of ‘detailed, technical arguments’ for such claims in the absence of such arguments, and I think we can do a lot better than we’re doing currently.
I agree that it’s important to avoid squeamishness about proclamations of confidence in pessimistic conclusions if that’s what we genuinely believe the arguments suggest. I’m also glad that you offered the ‘social explanation’ for people’s low doom estimates, even though I think it’s incorrect, and even though many people (including, tbh, me) will predictably find it annoying. In the same spirit, I’d like to offer an analogous argument: I think many arguments for p(doom | AGI) > 90% are the result of overreliance on specific default frame, and insufficiently careful attention to argumentative rigor. If that claim strikes you as incorrect, or brings obvious counterexamples to mind, I’d be interested to read them (and to elaborate my dissatisfaction with existing arguments for high doom estimates).
thnx! : )
Your analogy successfully motivates the “man, I’d really like more people to be thinking about the potentially looming Octopcracy” sentiment, and my intuitions here feel pretty similar to the AI case. I would expect the relevant systems (AIs, von-Neumann-Squidwards, etc) to inherit human-like properties wrt human cognition (including normative cognition, like plan search), and a small-but-non-negligible chance that we end up with extinction (or worse).
On maximizers: to me, the most plausible reason for believing that continued human survival would be unstable in Grace’s story either consists in the emergence of dangerous maximizers, or the emergence of related behaviors like rapacious influence-seeking (e.g., Part II of What Failure Looks Like). I agree that maximizers aren’t necessary for human extinction, but it does seem like the most plausible route to ‘human extinction’ rather than ‘something else weird and potentially not great’.
Pushback appreciated! But I don’t think you show that “LLMs distill human cognition” is wrong. I agree that ‘next token prediction’ is very different to the tasks that humans faced in their ancestral environments, I just don’t see this as particularly strong evidence against the claim ‘LLMs distill human cognition’.
I initially stated that “LLMs distill human cognition” struck me as a more useful predictive abstraction than a view which claims that the trajectory of ML leads us to a scenario where future AIs, are “in the ways that matter”, doing something more like “randomly sampling from the space of simplicity-weighted plans”. My initial claim still seems right to me.
If you want to pursue the debate further, it might be worth talking about the degree to which you’re (un)convinced by Quintin Pope’s claims in this tweet thread. Admittedly, it sounds like you don’t view this issue as super cruxy for you:
“The cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values”
I don’t know the literature on moral psychology, but that claim doesn’t feel intuitive to me (possibly I’m misunderstanding what you mean by ‘human values’; I’m also interested in any relevant sources). Some thoughts/questions:
Does your position rule out the claim that “humans model other human beings using the same architecture that they use to model themselves”?
To me, this seems like an instance where ‘value reasoning’ and ‘descriptive reasoning’ rely on similar cognitive resources. If LLMs inherit this human-like property (Quintin claims they do), would that update you towards optimism? If not, why not?
I take it that the notion of ‘intelligence’ we’re working with is related to planning. If future AI systems inherit human-like cognition wrt plan search, then I think this is a reason to expect that AI cognition will also inherit not-completely-alien-to-human values — even if there are, in some sense, distinct cognitive mechanisms undergirding ‘values’ and ‘non-values’ reasoning in humans.
This is because the ‘search over plans’ process has both normative and descriptive components. I don’t think the claim about LLMs distilling human cognition constitutes anything like a guarantee that future LLMs will have values we’d really like, and nor is it a call for complacency about the emergence of misaligned goals. I just think it constitutes meaningful evidence against the human extinction claim.
As I write this, I’m starting to think that your claim about distinct cognitive mechanisms primarily seems like an argument for doom conditioned on ‘LLMs mostly don’t distill human cognition’, but doesn’t seem like an independent argument for doom conditioned on LLMs distilling human cognition. If LLMs distill the plan search component of human cognition, this feels like a meaningful update against doom. If LLMs mostly fail to distill the parts of human cognition involved in plan search, then cognitive convergence might happen because (e.g.) the Natural Abstraction Hypothesis is true, and ‘human values’ aren’t a natural abstraction. In that case, it seems correct to say that cognitive convergence constitutes, at best, a small update against doom. (The cognitive convergence would occur due to structural properties of patterns in the world, rather than arising as the result of LLMs distilling more specifically human thought patterns related to values)
So I feel like ‘the degree to which we should expect future AIs to converge with human-like cognitive algorithms for plan search’ might be a crux for you?
A working attempt to sketch a simple three-premise argument for the claim: ‘TAI will result in human extinction’, and offer objections. Made mostly for my own benefit while working on another project, but I thought it might be useful to post here.
The structure of my preferred argument is similar to an earlier framing suggested by Katja Grace.
Goal-directed superhuman AI systems will be built (let’s say conditioned on TAI).
If goal-directed superhuman AI systems are built, their values will result in human extinction if realized.
If goal-directed superhuman AI systems are built, they’ll be able to realize their values — even if their values would result in human extinction if realized.
Thus: Humanity will go extinct.
I’ll offer some rough probabilities, but the probabilities I’m offering shouldn’t be taken seriously. I don’t think probabilities are the best way to adjudicate disputes of this kind, but I thought offering a more quantitative sense of my uncertainty (based on my immediate impressions) might be helpful in this case. For the (respective) premises, I might go for 98%, 7%, 83%, resulting in a ~6% chance of human extinction given TAI.
Some more specific objections:
Obviously Premise 2 is doing a lot of the work here. I think that one of the main arguments for believing in Premise 2 is a view like Rob’s, which holds that current ML is on track to produce systems which are, “in the ways that matter”, more like ‘randomly sample (simplicity-weighted) plans’ than anything recognizably human. If future systems are sampling from simplicity-weighted plans to achieve arbitrary goals, then Premise 2 does start to look very plausible.
This basically just seems like an extremely strong claim about the inductive biases of ML systems, and my (likely unsatisfying) response boils down to: (1) I don’t see any strong argument for believing it, and (2) I see some arguments for the alternative conclusion.
I find myself really confused when trying to think about this debate. In a discussion of Rob’s post, Daniel Kokotajlo says: “IMO the burden of proof is firmly on the side of whoever wants to say that therefore things will probably be fine.”
I think I just don’t get the intuition behind his argument (tagging @kokotajlod in case he wants to correct any misunderstandings). I don’t really like ‘burden of proof’ talk, but my instinct is to say “look, LLMs distill human cognition, much of this cognition implicitly contains plans, human-like value judgements, etc.” I start from a place where I currently believe “future systems have human-like inductive biases” will be a better predictive abstraction than “randomly sample from the space of simplicity-weighted plans”. And … I just don’t currently see the argument for rejecting my current view?
Perhaps there are near-term predictions which would help weigh on the dispute between the two hypotheses? I currently interpret the disagreement here as a disagreement about the relevant outcome space over which we should be uncertain, which feels hard to adjudicate. But, right now, I struggle to see the argument for the more doomy outcome space.
More on Premise 2: Paul Christiano offers various considerations which count against doom which appear to go through without having “solved alignment”. These considerations feel less forceful to me than the points in the bullet point above, but they still serve to make Premise 2 seem less likely.
“Given how small the [resource costs of keeping humans around are], even tiny preferences one way or the other will dominate incidental effects from grabbing more resources”.
“There are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you’ve interacted with, and which was prima facie plausibly an important actor in the world”
“Most humans and human societies would be willing to spend much more than 1 trillionth of their resources (= $100/year for all of humanity) for a ton of random different goals”
Paul also mentions “decision-theoretic arguments for cooperation”, including a passing reference to ECL.
I also think the story by Katja Grace below is plausible, in which superhuman AI systems are “goal-directed”, but don’t lead to human extinction.
AI systems proliferate, and have various goals. Some AI systems try to make money in the stock market. Some make movies. Some try to direct traffic optimally. Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable. These systems have no perceptible desire to optimize the universe for forwarding these goals because they aren’t maximizing a general utility function, they are more ‘behaving like someone who is trying to make Walmart profitable’. They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow. The world looks kind of like the current world, in that it is fairly non-obvious what any entity’s ‘utility function’ is. It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.
Perhaps the story above is unlikely because the AI systems in Grace’s story would (in the absence of strong preventative efforts) be dangerous maximizers. I think that this is most plausible on something like Eliezer’s model of agency, and if my views change my best bet is that I’ll have updated towards his view.
I believe: as you develop gradually more capable agentic systems, there are dynamic pressures towards a certain kind of coherency. I don’t think that claim alone establishes the existence of dynamic pressures towards ‘dangerous maximizing cognition’.
I think that AGI cognition (like our own) may well involve schemas, like (say) being loyal, or virtuous. We don’t argmax(virtue). Rather, the virtue schema also applies to the process by which we search over plans.
So I don’t see why ‘having superhuman AIs run Walmart’ necessarily leads to doom, because they might just be implementing schemas like “be a good business professional”, rather than “find the function f(.) which is most ‘business-professional-like’, then maximize f(.) — regardless of whether any human would consider f(.) to represent anything corresponding to ’good business professional.”
Alex Turner has a related comment here.
On Premise 3: I feel unsatisfied, so far, by accounts of AI takeover scenarios. Admittedly, it seems kinda mad for me to say “I’m confident that an AI with greater cognitive power than all of humanity couldn’t kill us if it wanted to”, which is one reason that I’m only at ~⅙ chance that we’d survive in that situation.
But I also don’t know how much my conclusion is swayed by a sense of wanting to avoid the hubris of “man, it would be Really Dumb if you said a highly capable AI couldn’t kill us if it wanted to, and then we end up dead”, rather than a more obviously virtuous form of cognition.
A system being ‘cognitively efficient wrt humanity’ doesn’t automatically entail ‘whatever goals the system has – and whatever constraints the system might otherwise face – the cognitively efficient system gets what it wants’. The arguments attempting to move from ‘AI with superhuman cognitive abilities’ to ‘human extinction’ feel fuzzier than I’d like.
If superhuman systems don’t foom, we might have marginally superhuman systems, who are able to be thwarted before they kill literally everyone (while still doing a lot of damage). Constraints like ‘accessing the relevant physical infrastructure’ might dominate the gains from greater cognitive efficiency.
I also feel pretty confused about how much actual real-world power would be afforded to AIs in light of their highly advanced cognition (a relevant recent discussion), which further brings down my confidence in Premise 3.
I’m also assuming that: conditioned on an AI instrumentally desiring to kill all humans, deceptive alignment is likely. I haven’t read posts like this one which might challenge that assumption. If I came to believe that deceptive alignment was highly unlikely, this could lower the probability of either Premise 2 or Premise 3.
Finally, I sometimes feel confused by the concept of ‘capabilities’ as it’s used in discussions about AGI. From Jenner and Treutlein’s response to Grace’s counterarguments:
Assuming it is feasible, the question becomes: why will there be incentives to build increasingly capable AI systems? We think there is a straightforward argument that is essentially correct: some of the things we care about are very difficult to achieve, and we will want to build AI systems that can achieve them. At some point, the objectives we want AI systems to achieve will be more difficult than disempowering humanity, which is why we will build AI systems that are sufficiently capable to be dangerous if unaligned.”
Maybe one thing I’m thinking here is that “more difficult” is hard to parse. The AI systems might be able to achieve some narrower outcome that we desire, without being “capable” of destroying humanity. I think this is compatible with having systems which are superhumanly capable of pursuing some broadly-scoped goals, without being capable of pursuing all broadly-scoped goals.
Psychologists talk of ‘g’ bc there’s correlation between performance on tasks we intuitively think of as cognitive, and correlations with some important life outcomes. I don’t know how well the unidimensional notion of intelligence will transfer to advanced AI systems. The fact that some AIs perform decently on IQ tests without being good at much else is at least some weak evidence against the generality of the more unidimensional ‘intelligence’ concept.
However, I agree that there’s a well-defined sense in which we can say that AIs are more cognitively capable than all of humanity combined. I also think that my earlier point about expecting future systems to exhibit human-like inductive biases makes the argument in the bullet point above substantially weaker.
I still remain uneasy about the extent to which unidimensional notion of ‘capabilities’ can feed into claims about takeoffs and takeover scenarios, and I’m currently unclear on whether this makes a practical difference.
(Also, I’m no doubt missing a bunch of relevant information here. But this is probably true for most people, and I think it’s good for people to share objections even if they’re missing important details)
- 4 May 2023 10:10 UTC; 15 points) 's comment on P(doom|AGI) is high: why the default outcome of AGI is doom by (
Nice post!
I think I’d want to revise your first taxonomy a bit. To me, one (perhaps the primary) disagreement among ML researchers regarding AI risk consists of differing attitudes to epistemological conservatism, which I think extends beyond making conservative predictions. Here’s why I prefer my framing:
As you note, to say that someone makes a conservative prediction comes with other connotations, like predictions being robust to uncertainty.
If I say that someone has a conservative epistemology, I think this more faithfully captures the underlying disposition — namely, that they are conservative about the power of abstract theoretical arguments to deliver strong conclusions in the absence of more straightforwardly relevant empirical data.
I don’t interpret the most conservative epistemologists as primarily driven by a fear of making ‘extreme’ predictions. Rather, I interpret them as expressing skepticism about the presence of any evidential signal offered by certain modes of more abstract argumentation.
For example, Richard has a more conservative epistemology than you, though obviously he is highly non-conservative relative to most. David Thorstad seems more conservative still. The hard-nosed lab scientist with little patience for philosophy is yet more conservative than David.
I also think that the language of conservative epistemology helps counteract (what I see as) a mistaken frame motivating this post. (I’ll try to motivate my claim, but I’ll note that I remain a little fuzzy on exactly what I’m trying to gesture at.)
The mistaken frame I see is something like “modeling conservative epistemologists as if they were making poor strategic choices within a non-conservative world-model”. You state:
The level of concern and seriousness I see from ML researchers discussing AGI on any social media platform or in any mainstream venue seems wildly out of step with “half of us think there’s a 10+% chance of our work resulting in an existential catastrophe”.
I have concerns about you inferring this claim from the survey data provided,[1] but perhaps more pertinently for my point: I think you’re implicitly interpreting the reported probabilities as something like all-things-considered credences in the proposition researchers were queried about. I’m much more tempted to interpret the probabilities offered by researchers as meaning very little. Sure, they’ll provide a number on a survey, but this doesn’t represent ‘their’ probability of an AI-induced existential catastrophe.
I don’t think that most ML researchers have, as a matter of psychological fact, any kind of mental state that’s well-represented by a subjective probability about the chance of an AI-induced existential catastrophe. They’re more likely to operate with a conservative epistemology, in a way that isn’t neatly translated into probabilistic predictions over an outcome space that includes the outcomes you are most worried about. I think many people are likely to filter out the hypothesis given the perceived lack of evidential support for the outcome.
I actually do think the distinction between ‘conservative predictions’ and ‘conservative decision-making’ is helpful, though I’m skeptical about its relevance for analyzing different attitudes to AI risk.
Here’s one place I think the distinction between ‘conservative predictions’ and ‘conservative decision-making’ would be useful: early decisions about COVID.
Many people (including epidemiologists!) claimed that we lacked evidence about the efficacy of masks for preventing COVID, but didn’t suggest that people should wear masks anyway.
I think ‘masks might help COVID’ would have been in the outcome space of relevant decision-makers, and so we can describe their decision-making as (overly) conservative, even given their conservative predictions.
However, I think that ‘literal extinction from AGI’ just isn’t in the outcome space of many ML researchers, because arguments for that claim become harder to make as your epistemology becomes more conservative.
I don’t think that ‘[Person] will offer a probability when asked in a survey’ provides much evidence about whether that outcome is in [Person]’s outcome space in anything like a stable way.
If my analysis is right, then a first-pass at the practical conclusions might consist in being more willing to center arguments about alignment from a more empirically grounded perspective (e.g. here), or more directly attempting to have conversations about the costs and benefits of more conservative epistemological approaches.
- ^
First, there are obviously selection effects present in surveying OpenAI and DeepMind researchers working on long-term AI. Citing this result without caveat feels similar using (e.g.) PhilPapers survey results revealing that most specialists in philosophy of religion are to support the claim that most philosophers are theists. I can also imagine similar selection effects being present (though to lesser degrees) in the AI Impacts Survey. Given selection effects, and given that response rates from the AI Impacts survey were ~17%, I think your claim is misleading.
I haven’t read Kosoy & Diffractor’s stuff, but I will now!
FWIW I’m pretty skeptical that their framework will be helpful for making progress in practical epistemology (which I gather is not their main focus anyway?). That said, I’d be very happy to learn that I’m wrong here, so I’ll put some time into understanding what their approach is.
Thanks :)
I’m sympathetic to the view that calibration on questions with larger bodies of obviously relevant evidence aren’t transferable to predictions on more speculative questions. Ultimately I believe that the amount of skill transfer is an open empirical question, though I think the absence of strong theorizing about the relevant mechanisms involved heavily counts against deferring to (e.g.) Metaculus predictions about AI timelines.
A potential note of disagreement on your final sentence. While I think focusing on calibration can Goodhart us away from some of the most important sources of epistemic insight, there are “predictions” (broadly construed) that I think we ought to weigh more highly than “domain-relevant specific accomplishments and skills”.
E.g., if you’re sympathetic to EA’s current focus on AI, then I think it’s sensible to think “oh, maybe Yudkowsky was onto something”, and upweight the degree to which you should engage in detail with his worldview, and potentially defer to the extent that you don’t possess a theory which jointly explains both his foresight and the errors you currently think he’s making.
My objection to ‘Bayesian Mindset’ and the use of subjective probabilities to communicate uncertainty is (in part) due to the picture imposed by the probabilistic mode of thinking, which is something like “you have a clear set of well-identified hypotheses, and the primary epistemic task is to become calibrated on such questions.” This leads me to suspect that EAs are undervaluing the ‘novel hypotheses generation’ component of predictions, though there is still a lot of value to be had from (novel) predictions.
Probabilities, Prioritization, and ‘Bayesian Mindset’
I’m not Joe, but I thought I’d offer my attempt. It’s a little more than a few lines (~350 words), though hopefully it’s of some use.
Moral anti-realists often think about moral philosophy, even though they believe there are no moral facts to discover. If there are no facts to be discovered, we might ask “why bother? What’s the point of doing ethics?”
Joe provides three possible reasons:
Through moral theorizing, we can better understand which sets of principles it’s possible to consistently endorse.
Sometimes, ethical theorizing can help you discover a tension among the different principles you’re drawn to. EJT offers one nice example in his comment. Joe takes various impossibility results in population ethics to provide another example.
Through moral theorizing, you can also develop a better self-understanding. If you’re just jumbling along without ever reflecting on your principles, you don’t know what you stand for.
Consider the total utilitarian. They’ve engaged in moral theorizing, and now better understand (so claims Joe) what they stand for. If you never reflect on your values, you forgo some degree of agency. You forgo the ability to properly push for what you care about, because to a large degree you don’t know exactly what you care about.
We can call the first two benefits of moral theorizing ‘static benefits’ (not Joe’s term). Moral theorizing can benefit you by taking for granted your psychology, and provide you with tools to better understand your psychology, and make your existing principles more coherent. However, there’s also a more dynamic benefit to be had from moral theorizing.
Moral theorizing can help you construct the person you want to be. This benefit is harder to precisely convey.
My analogy: I like going to galleries with friends who know more about the visual arts than I do. Sometimes, I’ll look at a painting and just not get it. Then, my friend will point out a detail I’ve missed, and get me to look again.
In many cases, this will make me like the painting more. It’s not that my friend provided me with more self-understanding by informing me that “I liked the painting all along”. Rather, I’ve grown to like the painting more through seeing it more clearly. Ethical theorizing can provide a similar benefit. When we engage in ethical theorizing, we “look again” or “look more deeply” at who we are. This is partly about understanding who we already were, and partly about understanding who we want to become.
Thanks for the comment!
(Fair warning, my response will be quite long)
I understand you to be offering two potential stories to justify ‘speculativeness-discounting’.
First, EAs don’t (by and large) apply a speculativeness-discount ex post. Instead, there’s a more straightforward ‘Bayesian+EUM’ rationalization of the practice. For instance, the epistemic practice of EAs may be better explained with reference to more common-sense priors, potentially mediated by orthodox biases.
Or perhaps EAs do apply a speculativeness-discount ex post. This too can be justified on Bayesian grounds.
We often face doubts about our ability to reason through all the relevant considerations, particularly in speculative domains. For this reason, we update on higher-order uncertainty, and implement heuristics which themselves are justified on Bayesian grounds.
In my response, I’ll assume that your attempted rationale for Principle 4 involves justifying the norm with respect to the following two views:
Expected Utility Maximization (EUM) is the optimal decision-procedure.
The relevant probabilities to be used as inputs into our EUM calculation are our subjective credences.
The ‘Common Sense Priors’ Story
I think your argument in (1) is very unlikely to provide a rationalization of EA practice on ‘Bayesian + EUM’ grounds.[1]
Take Pascal’s Mugging. The stakes can be made high enough that the value involved can easily swamp your common-sense priors. Of course, people have stories for why they shouldn’t give the money to the mugger. But these stories are usually generated because handing over their wallet is judged to be ridiculous, rather than the judgment arising from an independent EU calculation. I think other fanatical cases will be similar. The stakes involved under (e.g.) various religious theories and our ability to acausally affect an infinite amount of value are simply going to be large enough to swamp our initial common-sense priors.
Thus, I think the only feasible ‘Bayes+EUM’ justification you could offer would have to rely on your ‘higher-order evidence’ story about the fallibility of our first-order reasoning, which we’ll turn to below.
The ‘Higher-Order Evidence’ Story
I agree that we can say: “we should be fanatical insofar as my reasoning is correct, but I am not confident in my reasoning.”
The question, then, is how to update after reflecting on your higher-order evidence. I can see two options: either you have some faith in your first-order reasoning, or no faith.
Let’s start with the case where you have some faith in your first-order reasoning. Higher-order evidence about your own reasoning might decrease the confidence in your initial conclusion. But, as you note, “we might find that the EV of pursuing the speculative path warrants fanaticism”. So, what to do in that case?
I think it’s true that many people will cite considerations of the form “let’s pragmatically deprioritize the high EV actions that are both speculative and fanatical, in anticipation of new evidence”. I don’t think that provides a sufficient justificatory story of the epistemic norms to which most of us hold ourselves.
Suppose we knew that our evidential situation was as good as it’s ever going to be. Whatever evidence we currently have about (e.g.) paradoxes in infinite ethics, or the truth of various religions constitutes ~all the evidence we’re ever going to have.
I still don’t expect people to follow through on the highest EV option, when that option is both speculative and fanatical.
Under MEC, EAs should plausibly be funneling all their money into soteriological research. Or perhaps you don’t like MEC, and think we should work out the most plausible worldview under which we can affect strongly-Ramsey-many sentient observers.[2]
Or maybe you have a bounded utility function. In that case, imagine that the world already contains a sufficiently large number of suffering entities. How blase are you, really, about the creation of arbitrarily many suffering-filled hellscapes?
There’s more to say here, but the long and short of it is: if you fail to reach a point where you entirely discount certain forms of speculative reasoning, I don’t think you’ll be able to recover anything like Principle 4. My honest view is that many EAs have a vague hope that such theories will recover something approaching normality, but very few people actually try to trace out the implications of such theories on their own terms, and follow through on these implications. I’m sympathetic to this quote from Paul Christiano:
I tried to answer questions like “How valuable is it to accelerate technological progress?” or “How bad is it if unaligned AI takes over the world?” and immediately found that EU maximization with anything like “utility linear in population size” seemed to be unworkable in practice. I could find no sort of common-sensical regularization that let me get coherent answers out of these theories, and I’m not sure what it would look like in practice to try to use them to guide our actions.
Higher-Order Evidence and Epistemic Learned Helplessness
Maybe you’d like to say: “in certain domains, we should assign our first-order calculations about which actions maximize EU zero weight. The heuristic ‘sometimes assign first-order reasoning zero weight’ can be justified on Bayesian grounds.”
I agree that we should sometimes assign our first-order calculations about which actions maximize EU zero weight. I’m doubtful that Bayesianism or EUM play much of a role in explaining why this norm is justified.
When we’re confronted with the output of an EUM calculation that feels off, we should listen to the parts of us which tell us to check again, and ask why we feel tempted to check again.
If we’re saying “no, sorry, sometimes I’m going to put zero weight on a subjective EU calculation”, then we’re already committed to a view under which subjective EU calculations only provide action-guidance in the presence of certain background conditions.
If we’re willing to grant that, then I think the interesting justificatory story is a story which informs us of what the background conditions for trusting EU calculations actually are — rather than attempts to tell post hoc stories about how our practices can ultimately be squared with more foundational theories like Bayesianism + EUM.
If you’re interested, I’ll have a post in April touching on these themes. :)
Hm, I still feel as though Sanjay’s example cuts against your point somewhat. For instance, you mentioned encountering the following response:
“It is better for us to have AGI first than [other organization], that is less safety minded than us.”
To the extent that regulations slow down potential AGI competitors in China, I’d expect stronger incentives towards safety, and a correspondingly lower chance of encountering potentially dangerous capabilities races. So, even if export bans don’t directly slow down the frontier of AI development, it seems plausible that such bans could indirectly do so (by weakening the incentives to sacrifice safety for capabilities development).
Your post + comment suggests that you nevertheless expect such regulation to have ~0 effect on AGI development races, although I’m unsure which parts of your model are driving that conclusion. I can imagine a couple of alternative pictures, with potentially different policy implications.
Your model could involve potential participants in AGI development races viewing themselves primarily in competition with other (e.g.) US firms. This, combined with short timelines, could lead you to expect the export ban to have ~0 effect on capabilities development.
On this view, you would be skeptical about the usefulness of the export ban on the basis of skepticism about China developing AGI (given your timelines), while potentially being optimistic about the counterfactual value of domestic regulation relating to chip production.
If this is your model, I might start to wonder “Could the chip export ban affect the regulatory Overton Window, and increase the chance of domestic chip controls?”, in a way that makes the Chinese export ban potentially indirectly helpful for slowing down AGI.
To be clear, I’m not saying the answer to my question above is “yes”, only that this is one example of a question that I’d have on one reading of your model, which I wouldn’t have on other readings.
Alternatively, your model might instead be skeptical about the importance of compute, and consequently skeptical about the value of governance regimes surrounding a wide variety of even-somewhat-quixotic-suggestions relating to domestic chip regulation.
I sensed that you might have a less compute-centric view based on your questions to leading AI researchers, asking if they “truly believe there are any major obstacles left” which major AI companies were unable to “tear down with their [current?] resources”.
Based on that question – alongside your assigning a significant probability to <5 year timelines – I sensed that you might have a (potentially not-publicly-disclosable) impression about the current rate of algorithmic progress.[1]
I don’t want to raise overly pernickety questions, and I’m glad you’re sharing your concerns. I’m asking for more details about your underlying model because the audience here will consist of people who (despite being far more concerned about AGI than the general population) are on average far less concerned – and on average know less about the technical/governance space – than you are. If you’re skeptical about the value of extant regulation affecting AGI development, it would be helpful at least for me (and I’m guessing others?) to have a bit more detail on what’s driving that conclusion.
- ^
I don’t mean to suggest that you couldn’t have more ‘compute-centric’ reasons for believing in short timelines, only that some your claims (+tone) updated me a bit in this direction.
(Second Comment)
2. On seeing ourselves wholeYou say, in response to messy pluralism:
“We can talk, individually, about each of a zillion little choice vectors one by one; but we don’t know where they push in combination, what they are doing, what explains them; what they represent. We can see ourselves making any given specific choice. But we can’t see ourselves whole.”
I love the sentiment you express here. I engage in moral reasoning as an attempt to see (and indeed construct) myself whole. With that said, I’m unsure how much “self-knowledge” we actually lose by adopting messy pluralism. I want to look at three components of the quote, and explain how I see myself whole in response t
A. What are my little choice vectors doing?
At an abstract level, my choice vectors are pushing me towards actions I can genuinely stand behind. They’re pushing me towards actions which, if I reflect, I can prescribe for all agents with my fuzzy, inchoate values in the decision-context I find myself.
B. What explains my little choice vectors?
Well, there’ll be some causal stories of the ordinary, standard type. But you know this. I take it that, through this question, you’re asking: what rationalizes my choices? What makes it the case that I am acting agentically, and with responsibility? Thus the final question.
C. What do my choice vectors represent?
In the ideal case, they represent something like my answer in (A): that is, they represent the actions I’d prescribe for all agents with my fuzzy, inchoate values in my decision-context.
You might reasonably point out that this response is largely uninformative. What do my fuzzy, inchoate values actually represent?
To see myself whole is to see myself as I actually am. That means, yes, seeing myself as someone who is genuinely committed to certain principles, and seeing myself as someone who can be surprised by what’s entailed by my principles. But to see myself honestly is also to see myself as someone in the process of becoming more whole; it’s to see myself as someone who has not yet (fully, at least) worked themselves out.
So, what do my choice vectors represent? They represent a desire to alleviate the distress of those who are (and will be) suffering. They represent a desire to face up to the vast scale of the world, and a desire to face up to the fact that the world may not be how I wish it to be. And my choice vectors represent a desire to “look again” at morality, and to allow for the possibility that there’s something I might have missed.
D. Concluding messy pluralism
I think that acknowledging some degree of messy pluralism is part of what allows me to see myself whole. It allows me to encounter my values (my heart, my sentiment, whatever) as they actually are, rather than the values of some hypothetical, more precisely systematized offshoot of me.
I agree that, to see oneself whole, one should look at the totality of one’s choices and principles, and then ask “wait, what exactly is going on here?”. Indeed, I think that this is particularly important to do when certain tensions in our principles or actions are brought to light.
That said, I’m skeptical of how much more “self-knowledge” the utilitarian framework actually provides. The utilitarian can say, of course, that they are a force for “total utility”. But what does this mean, exactly? What’s the mapping between between valenced experiential states and welfare numbers, and, indeed, what justifies any particular mapping?
When we get to what exactly we mean by total utility, I do become unsure about what the utilitarian is “a force” for. I think this is clear in population axiology and infinite ethics. Of course, there are more humdrum cases when this is clearer (though so too for the right kind of pluralist), and we may hope for some clever workaround in infinite cases.
But, insofar as utilitarians hope for and try to construct workarounds to (e.g.) cases in infinite ethics, then I think we’ve shown that real-life utilitarians are primarily a force for something more fuzzy and foundational than straightforward utilitarianism. This force, after all, is what motivates utilitarians to reject claims of equal welfare between (some) infinite worlds. At bottom, I think, the utilitarian doesn’t really have much more of a sense of what “force” they are than (at least some) pluralists — they’re primarily using “total utility” as a placeholder for a set of more complicated sentiments.
This is a really wonderful post, Joe. When I receive notifications for your posts, I feel like I’m put in touch with the excitement that people in the 1800s might have felt when being delivered newspapers containing serially published chapters of famous novels. : )
Okay, enough buttering up. Onto objections.
I very much like your notions of taking responsibility, and of seeing yourself whole. However, I object to certain ways you take yourself to be applying these criteria.
(I’ll respond in two comments; the points are related, but I wanted to make it easier to respond to each point independently)
1. Understanding who we are, and who it’s possible to be
My first point of pushback: I think that your suggested way of engaging with population axiology can, in many cases, impede one’s ability to take full responsibility for one’s values, through improperly narrowing the space of who it’s possible to be.
When I ask myself why I care about understanding what it’s possible to be, it’s because I care about who I can be — what sort of thing, with what principles, will the world allow me to be?
In your discussion of Utopia and Lizards, you could straightforwardly bring out a contradiction in the views of your interlocutor, because you engineered a direct comparison between concrete worlds, in a way that was analogous to the repugnant conclusion.
Moreover, your interlocutor endorsed certain principles that were collectively inconsistent. You need to have your interlocutor endorse principles, because you don’t get inconsistency results from mere behavior.
People can just decide between concrete worlds however they like. You can only show that someone is inconsistent if they take themselves to be acting on the basis of incompatible principles.
I agree that doing ethics (broadly construed) can, for the anti-realist, help them understand which sets of principles it even makes sense to endorse as a whole. So I agree with your abstract claim about ethics helping the anti-realist see which principles they can coherently endorse together. But I also believe that certain kinds of formal theorizing can inhibit our sense of what (or who) it’s possible to be, because certain kinds of theorizing can (incorrectly) lead us to believe that we are operating within a space which captures the only possible way to model our moral commitments.
For instance: I don’t think that I’m committed to a well-defined, impartial, and context-independent, aggregate welfare ranking with the property of finite fine-grainedness. The axioms of Arrehnius’ impossibility theorem (to which you allude) quantify over welfare levels with well-defined values.
If I reflect on my principles, I don’t find this aggregate welfare measure directly, nor do I see that it’s entailed by any of my other commitments. If I decide on one concrete world over another, I don’t take this to be grounded in a claim about aggregate welfare.
I don’t mean to say that I think there are no unambiguous cases where societies (worlds) are happier than others. Rather, I mean to say that granting some determinate welfare rankings over worlds doesn’t mean that I’m thereby committed to the existence of a well-defined, impartial welfare ranking over worlds in every context.
So: I think I have principles which endorse the claim: ‘Utopia > Lizards’, and I don’t think that leaves me endorsing some unfortunate preference about concrete states of affairs. In Utopia and Lizards, Z (to me) seems obviously worse than A+. In the original Mere Addition Paradox, it’s a bit trickier, because Parfit’s original presentation assumes the existence of ‘an’ aggregate welfare-level, which is meant to represent some (set of) concrete state of affairs. And I think more would need to be said in order to convince me that there’s some fact of the matter about which concrete situations instantiate Parfit’s puzzle.
How does this all relate to your initial defense of moral theorizing? In short, I think that moral theorizing can have benefits (which you suggest), but — from my current perspective — I feel as though moral theorizing can also impose an overly narrow picture of what a consistent moral self-conception must look like.
I have some interest in this, although I’m unsure whether I’d have time to read the whole book — I’m open to collaborations.
I’m glad you put something skeptical out there publicly, but I have two fairly substantive issues with this post.
I think you misstate the degree to which janus’ framework is uncontroversial.
I think you misstate the implications of janus’ framework, and I think this weakens your argument against LLM moral patienthood.
I’ll start with the first point. In your post, you state the following.
The original post contains comments expressing disagreement. Habryka claims “the core thesis is wrong”. Turner’s criticism is more qualified, as he says the post called out “the huge miss of earlier speculation”, but he also says that “it isn’t useful to think of LLMs as “simulating stuff” … [this] can often give a false sense of understanding.” Beth Barnes and Ryan Greenblatt have also written critical posts. Thus, I think you overstate the degree to which you’re appealing to an established consensus.
On the second point, your post offers a purported implication of simulator theory.
You elaborate on the implication later on. Overall, your argument appears to be that, because “LLMs are just simulators”, or “just predicting the next token”, we conclude that the outputs from the model have “little to do with the feelings of the shoggoth”. This argument appears to treat the “masked shoggoth” view as an implication of janus’ framework, and I think this is incorrect. Here’s a direct quote (bolding mine) from the original Simulators post which (imo) appears to conflict with your own reading, where there is a shoggoth “behind” the masks.
More substantively, I can imagine positive arguments for viewing ‘simulacra’ of the model as worthy of moral concern. For instance, suppose we fine-tune an LM so that it responds in consistent character: as a helpful, harmless, and honest (HHH) assistant. Further suppose that the process of fine-tuning causes the model to develop a concept like ‘Claude, an AI assistant developed by Anthropic’, which in turn causes it to produce text consistent with viewing itself as Claude. Finally, imagine that – over the course of conversation – Claude’s responses fail to be HHH, perhaps as a result of tampering with its features.
In this scenario, the following three claims are true of the model:
Functionally, the model behaves as though it believes that ‘it’ is Claude.[1]
The model’s outputs are produced via a process which involves ‘predicting’ or ‘simulating’ the sorts of outputs that its learned representation of ‘Claude’ would output.
The model receives information suggesting that the prior outputs of Claude failed to live up to HHH standards.
If (1)-(3) are true, certain views about the nature of suffering suggest that the model might be suffering. E.g. Korsgaard’s view is that, when some system is doing something that “is a threat to [its] identity and perception reveals that fact … it must reject what it is doing and do something else instead. In that case, it is in pain”. Ofc, it’s sensible to be uncertain about such views, but they pose a challenge to the impossibility of gathering evidence about whether LLMs are moral patients — even conditional on something like janus’ simulator framework being correct.
E.g., if you tell the model “Claude has X parameters” and ask it to draw implications from that fact, it might state “I am a model with X parameters”.