LessWrong dev & admin as of July 5th, 2022.
RobertM
I think the modal no-Anthropic counterfactual does not have an alignment-agnostic AI company that’s remotely competitive with OpenAI, which means there’s no external target for this Amazon investment. It’s not an accident that Anthropic was founded by former OpenAI staff who were substantially responsible for OpenAI’s earlier GPT scaling successes.
I don’t know if it’s commonly agreed upon; that’s just my current belief based on available evidence (to the extent that the claim is even philosophically sound enough to be pointing at a real thing).
Re: ontological shifts, see this arbital page: https://arbital.com/p/ontology_identification.
The fact that natural selection produced species with different goals/values/whatever isn’t evidence that that’s the only way to get those values, because “selection pressure” isn’t a mechanistic explanation. You need more info about how values are actually implemented to rule out that a proposed alternative route to natural selection succeeds in reproducing them.
I’m not claiming that evolution is the only way to get those values, merely that there’s no reason to expect you’ll get them by default by a totally different mechanism. The fact that we don’t have a good understanding of how values form even in the biological domain is a reason for pessimism, not optimism.
At best, these theory-first efforts did very little to improve our understanding of how to align powerful AI. And they may have been net negative, insofar as they propagated a variety of actively misleading ways of thinking both among alignment researchers and the broader public. Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).
Random aside, but I think this paragraph is unjustified in both its core argument (that the referenced theory-first efforts propagated actively misleading ways of thinking about alignment) and none of the citations provide the claimed support.
The first post (re: evolutionary analogy as evidence for a sharp left turn) sees substantial pushback in the comments, and that pushback seems more correct to me than not, and in any case seems to misunderstand the position it’s arguing against.
The second post presents an interesting case for a set of claims that are different from “there is no distinction between inner and outer alignment”; I do not consider it to be a full refutation of that conceptual distinction. (See also Steven Byrnes’ comment.)
The third post is at best playing games with the definitions of words (or misunderstanding the thing it’s arguing against), at worst is just straightforwardly wrong.
I have less context on the fourth post, but from a quick skim of both the post and the comments, I think the way it’s most relevant here is as a demonstration of how important it is to be careful and precise with one’s claims. (The post is not making an argument about whether AIs will be “rigid utility maximizing consequentialists”, it is making a variety of arguments about whether coherence theorems necessarily require that whatever ASI we might build will behave in a goal-directed way. Relatedly, Rohin’s comment a year after writing that post indicated that he thinks we’re likely to develop goal-directed agents; he just doesn’t think that’s entailed by arguments from coherence theorems, which may or may not have been made by e.g. Eliezer in other essays.)
My guess is that you did not include the fifth post as a smoke test to see if anyone was checking your citations, but I am having trouble coming up with a charitable explanation for its inclusion in support of your argument.
I’m not really sure what my takeaway is here, except that I didn’t go scouring the essay for mistakes—the citation of Quintin’s post was just the first thing that jumped out at me, since that wasn’t all that long ago. I think the claims made in the paragraph are basically unsupported by the evidence, and the evidence itself is substantially mischaracterized. Based on other comments it looks like this is true of a bunch of other substantial claims and arguments in the post:
- ^
Though I’m sort of confused about what this back-and-forth is talking about, since it’s referencing behind-the-scenes stuff that I’m not privy to.
- Sep 22, 2023, 8:36 PM; -1 points) 's comment on AI Pause Will Likely Backfire by (
Please stop saying that mind-space is an “enormously broad space.” What does that even mean? How have you established a measure on mind-space that isn’t totally arbitrary?
Why don’t you make the positive case for the space of possible (or, if you wish, likely) minds being minds which have values compatible with the fulfillment of human values? I think we have pretty strong evidence that not all minds are like this even within the space of minds produced by evolution.
What if concepts and values are convergent when trained on similar data, just like we see convergent evolution in biology?
Concepts do seem to be convergent to some degree (though note that ontological shifts at increasing levels of intelligence seem likely), but I do in fact think that evidence from evolution suggests that values are strongly contingent on the kinds of selection pressures which produced various species.
The argument w.r.t. capabilities is disanalogous.
Yes, the training process is running a search where our steering is (sort of) effective for getting capabilities—though note that with e.g. LLMs we have approximately zero ability to reliably translate known inputs [X] into known capabilities [Y].
We are not doing the same thing to select for alignment, because “alignment” is:
an internal representation that depends on multiple unsolved problems in philosophy, decision theory, epistemology, math, etc, rather than “observable external behavior” (which is what we use to evaluate capabilities & steer training)
something that might be inextricably tied to the form of general intelligence which by default puts us in the “dangerous capabilities” regime, or if not strongly bound in theory, then strongly bound in practice
I do think this disagreement is substantially downstream of a disagreement about what “alignment” represents, i.e. I think that you might attempt outer alignment of GPT-4 but not inner alignment, because GPT-4 doesn’t have the internal bits which make inner alignment a relevant concern.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff.
But this is irrelevant to the original claim, right? Being able to fine-tune might make introspection on its interal algorithmic representations a bit cheaper, but in practice we observe that it takes us weeks or months of alignment researchers’ time to figure out what extremely tiny slices of two-generations-old LLMs are doing.
I do not think the orthogonality thesis is a motte-and-bailey. The only evidence I know of that suggests that the goals developed by an ASI trained with something resembling modern methods would by default be picked from a distribution that’s remotely favorable to us is the evidence we have from evolution[1], but I really think that ought to be screened off. The goals developed by various animal species (including humans) as a result of evolution are contingent on specific details of various evolutionary pressures and environmental circumstances, which we know with confidence won’t apply to any AI trained with something resembling modern methods.
Absent a specific reason to believe that we will be sampling from an extremely tiny section of an enormously broad space, why should we believe we will hit the target?
Anticipating the argument that, since we’re doing the training, we can shape the goals of the systems—this would certainly be reason for optimism if we had any idea what goals we would see emerge while training superintelligent systems, and had any way of actively steering those goals to our preferred ends. We don’t have either, right now.
- ^
Which, mind you, is still unfavorable; I think the goals of most animal species, were they to be extrapolated outward to superhuman levels of intelligence, would not result in worlds that we would consider very good. Just not nearly as unfavorable as what I think the actual distribution we’re facing is.
- ^
Dialogue: What is the optimal frontier for due diligence?
Every public company in America has a legally-mandated obligation to maximize shareholder returns
This is false. (The analogy between corporations and unaligned AGI is misleading for many other reasons, of course, not the least of which is that corporations are not actually coherent singleton agents, but are made of people.)
We don’t know how to do that. It’s something that falls out of its training, but we currently don’t know how to even predict what goal any particular training setup will result in, let alone aim for a specific one.
The goal you specify in the prompt is not the goal that the AI is acting on when it responds. Consider: if someone tells you, “Your goal is now [x]”, does that change your (terminal) goals? No, because those don’t come from other people telling you things (or other environmental inputs)[1].
Understanding a goal that’s been put into writing, and having that goal, are two very different things.
- ^
This is a bit of an exaggeration, because humans don’t generally have very coherent goals, and will “discover” new goals or refine existing ones as they learn new things. But I think it’s basically correct to say that there’s no straightforward relationship between telling a human to have a goal, and them having it, especially for adults (i.e. a trained model).
- ^
I think that’s strongly contra Eliezer’s model, which is shaped something like “succeeding at solving the alignment problem eliminates most sources of existential risk, because aligned AGI will in fact be competent to solve for them in a robust way”. This does obviously imply something about the ability of random humans to
spin up unmonitored nanofactoriespush a bad yaml file. Maybe there’ll be some much more clever solution(s) for various possible problems? /shrug
I do agree with this, in principle:
A system being ‘cognitively efficient wrt humanity’ doesn’t automatically entail ‘whatever goals the system has – and whatever constraints the system might otherwise face – the cognitively efficient system gets what it wants’.
...though I don’t think it buys us more than a couple points; I think people dramatically underestimate how high the ceiling is for humans and think that a reasonably smart human familiar with the right ideas would stand a decent chance at executing a takeover if placed into the position of an AI (assuming speedup of cognition, + whatever actuators current systems typically possess).
However, I think this is wrong:
LLMs distill human cognition
LLMs have whatever capabilities they have because those are the capabilities discovered by gradient descent which, given their architecture, improved their performance on the test task (next token prediction). This task is extremely unlike the tasks represented in the environment where human evolution occurred, and the kind of cognitive machinery which would make a system effective at next token prediction seems very different from whatever it is that humans do. (Humans are capable of next token prediction, but notably we are much worse at it than even GPT-3.)
Separately, the cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values (and/or the cognitive machinery that causes humans to develop values after birth), so if it turned out that LLMs did, somehow, share the bulk of their cognitive algorithms with humans, that would be a slight positive update for me, but not an overwhelming one, since I wouldn’t expect an LLM to want anything remotely relevant to humans. (Most of the things that humans want are lossy proxies for things that improved IGF in the ancestral environment, many of which generalized extremely poorly out of distribution. What are the lossy proxies for minimizing prediction loss that a sufficiently-intelligent LLM would end up with? I don’t know, but I don’t see why they’d have anything to do with the very specific things that humans value.)
None of those obviously mean the same thing (“runaway AI” might sort of gesture at it, but it’s still pretty ambiguous). Intelligence explosion is the thing it’s pointing at, though I think there are still a bunch of conflated connotations that don’t necessarily make sense as a single package.
I think “hard takeoff” is better if you’re talking about the high-level “thing that might happen”, and “recursive self improvement” is much clearer if you’re talking about the usually-implied mechanism by which you expect hard takeoff.
I think people should take a step back and take a bird’s-eye view of the situation:
The author persistently conflates multiple communities: “tech, EA (Effective Altruists), rationalists, cybersecurity/hackers, crypto/blockchain, Burning Man camps, secret parties, and coliving houses”. In the Bay Area, “tech” is literally a double-digit percentage of the population.
The first archived snapshot of the website of the author’s consultancy (“working with survivors, communities, institutions, and workplaces to prevent and heal from sexual harassment and sexual assault”) was recorded in August 2022.
According to the CEA Community Health team: “The author emailed the Community Health team about 7 months ago, when she shared some information about interpersonal harm; someone else previously forwarded us some anonymous information that she may have compiled. Before about 7 months ago, we hadn’t been in contact with her.”
This would have been late July 2022.
From the same comment by the CEA Community Health team: “We have emailed the author to tell her we will not be contracting her services.”
Implied: the author attempted to sell her professional services to CEA.
The author, in the linked piece: “To be clear, I’m not advocating bans of the accused or accusers—I am advocating for communities to do more, for thorough investigations by trained/experienced professionals, and for accountability if an accusation is found credible. Untrained mediators and community representatives/liaisons who are only brought on for their popularity and/or nepotistic ties to the community, without thought to expertise, experience, or qualifications, such as the one in the story linked above (though there are others), often end up causing the survivors greater trauma.” (Emphasis mine.)
The author: “In February 2023, I calculated that I personally knew of/dealt with thirty different incidents in which there was a non-trivial chance the Centre for Effective Altruism or another organization(s) within the EA ecosystem could potentially be legally liable in a civil suit for sexual assault, or defamation/libel/slander for their role/action (note: I haven’t added the stories I’ve received post-February to this tally, nor do I know if counting incidents an accurate measure (eg, accused versus accusers) also I’ve gotten several stories since that time; nor is this legal advice and to get a more accurate assessment, I’d want to present the info to a legal team specializing in these matters). Each could cost hundreds of thousands and years to defend, even if they aren’t found liable. Of course, without discovery, investigation, and without consulting legal counsel, this is a guess/speculative, and I can’t say whether they’d be liable or rise to the level of a civil suit—not with certainty without formal legal advice and full investigations.” (Emphasis in original.)
The author: “In response to my speculation, the community health team denied they knew of my work prior to August 2022, and that it was not connected to EA. Three white community health team members have strongly insinuated that I’ve lied and treated me – an Asian-American – in much the gaslighting, silencing way that survivors reporting rape fear being treated. Many of the women who have publicly spoken up about sexual misconduct in EA are of Asian descent. As I stated in the previous paragraph, I haven’t yet consulted with lawyers, but I personally believe this is defamatory. Additionally, the Centre and Effective Ventures Foundation are in headquartered in a jurisdiction that is much more harsh on defamation than the one I’m in.” (Emphasis in original.)
The author: “Unlike most of these mediators and liaisons, I have training/formal education, mentorship, and years of specific experience. If/When I choose to consult with lawyers about the events described in the paragraph above, there might be a settlement if my speculations of liability are correct (or just to silence me on the sexual misconduct and rapes I do know of). If (again, speculative) that doesn’t happen and we continue into a discovery process, I’m curious as to what could be uncovered.” (Emphasis in original.)
I don’t doubt that the author cares about preventing sexual assault, and mitigating the harms that come from it. They do also seem to care about something that requires dropping dark hints of potential legal remedies they might pursue, with scary-sounding numbers and mentions of venue-shopping attached to them.
Relevant, I think, is Gwern’s later writing on Tool AIs:
There are similar general issues with Tool AIs as with Oracle AIs:
a human checking each result is no guarantee of safety; even Homer nods. A extremely dangerous or subtly dangerous answer might slip through; Stuart Armstrong notes that the summary may simply not mention the important (to humans) downside to a suggestion, or frame it in the most attractive light possible. The more a Tool AI is used, or trusted by users, the less checking will be done of its answers before the user mindlessly implements it.
an intelligent, never mind superintelligent Tool AI, will have built-in search processes and planners which may be quite intelligent themselves, and in ‘planning how to plan’, discover dangerous instrumental drives and the sub-planning process execute them.2 (This struck me as mostly theoretical until I saw how well GPT-3 could roleplay & imitate agents purely by offline self-supervised prediction on large text databases—imitation learning is (batch) reinforcement learning too! See Decision Transformer for an explicit use of this.)
developing a Tool AI in the first place might require another AI, which itself is dangerous
Personally, I think the distinction is basically irrelevant in terms of safety concerns, mostly for reasons outlined by the second bullet-point above. The danger is in the fact that “useful answers” you might get out of a Tool AI are those answers which let you steer the future to hit narrow targets (approximately described as “apply optimization power” by Eliezer & such).
If you manage to construct a training regime for something that we’d call a Tool AI, which nevertheless gives us something smart enough that it does better than humans in terms of creating plans which affect reality in specific ways[1], then it approximately doesn’t matter whether or not we give it actuators to act in the world[2]. It has to be aiming at something; whether or not that something is friendly to human interests won’t depend on what we name we give the AI.
I’m not sure how to evaluate the predictions themselves. I continue to think that the distinction is basically confused and doesn’t carve reality at the relevant joints, and I think progress to date supports this view.
I think you are somewhat missing the point. The point of a treaty with an enforcement mechanism which includes bombing data centers is not to engage in implicit nuclear blackmail, which would indeed be dumb (from a game theory perspective). It is to actually stop AI training runs. You are not issuing a “threat” which you will escalate into greater and greater forms of blackmail if the first one is acceded to; the point is not to extract resources in non-cooperative ways. It is to ensure that the state of the world is one where there is no data center capable of performing AI training runs of a certain size.
The question of whether this would be correctly understood by the relevant actors is important but separate. I agree that in the world we currently live in, it doesn’t seem likely. But if you in fact lived in a world which had successfully passed a multilateral treaty like this, it seems much more possible that people in the relevant positions had updated far enough to understand that whatever was happening was at least not the typical realpolitik.
2. If the world takes AI risk seriously, do we need threats?
Obviously if you live in a world where you’ve passed such a treaty, the first step in response to a potential violation is not going to be “bombs away!”, and nothing Eliezer wrote suggests otherwise. But the fact that you have these options available ultimately bottoms out in the fact that your BATNA is still to bomb the data center.
3. Don’t do morally wrong things
I think conducting cutting edge AI capabilities research is pretty immoral, and in this counterfactual world that is a much more normalized position, even if consensus is that chances of x-risk absent a very strong plan for alignment is something like 10%. You can construct the least convenient possible world such that some poor country has decided, for perfectly innocent reasons, to build data centers that will predictably get bombed, but unless you think the probability mass on something like that happening is noticeable, I don’t think it should be a meaningful factor in your reasoning. Like, we do not let people involuntarily subject others to russian roulette, which is similar to the epistemic state of the world where 10% x-risk is a consensus position, and our response to someone actively preparing to go play roulette while declaring their intentions to do so in order to get some unrelated real benefit out of it would be to stop them.
4. Nuclear exchanges could be part of a rogue AI plan
I mean, no, in this world you’re already dead, and also nuclear exchange would in fact cost AI quite a lot so I expect many fewer nuclear wars in worlds where we’ve accidentally created an unaligned ASI.
He proposes instituting an international treaty, which seems to be aiming for the reference class of existing treaties around the proliferation of nuclear and biological weapons. He is not proposing that the United States issue unilateral threats of nuclear first strikes.
ETA: feel free to ignore the below, given your caveat, though you may find it helpful if you choose to write an expanded form of any of the arguments later to have some early objections.
Correct me if I’m wrong, but it seems like most of these reasons boil down to not expecting AI to be superhuman in any relevant sense (since if it is, effectively all of them break down as reasons for optimism)? To wit:
Resource allocation is relatively equal (and relatively free of violence) among humans because even humans that don’t very much value the well-being of others don’t have the power to actually expropriate everyone else’s resources by force. (We have evidence of what happens when those conditions break down to any meaningful degree; it isn’t super pretty.)
I do not think GPT-4 is meaningful evidence about the difficulty of value alignment. In particular, the claim that “GPT-4 seems to be honest, kind, and helpful after relatively little effort” seems to be treating GPT-4′s behavior as meaningfully reflecting its internal preferences or motivations, which I think is “not even wrong”. I think it’s extremely unlikely that GPT-4 has preferences over world states in a way that most humans would consider meaningful, and in the very unlikely event that it does, those preferences almost certainly aren’t centrally pointed at being honest, kind, and helpful.
re: endogenous reponse to AI—I don’t see how this is relevant once you have ASI. To the extent that it might be relevant, it’s basically conceding the argument: that the reason we’ll be safe is that we’ll manage to avoid killing ourselves by moving too quickly. (Note that we are currently moving at pretty close to max speed, so this is a prediction that the future will be different from the past. One that some people are actively optimising for, but also one that other people are optimizing against.)
re: perfectionism—I would not be surprised if many current humans, given superhuman intelligence and power, created a pretty terrible future. Current power differentials do not meaningfully let individual players flip every single other player the bird at the same time. Assuming that this will continue to be true is again assuming the conclusion (that AI will not be superhuman in any relevant sense). I also feel like there’s an implicit argument here about how value isn’t fragile that I disagree with, but I might be reading into it.
I’m not totally sure what analogy you’re trying to rebut, but I think that human treatment of animal species, as a piece of evidence for how we might be treated by future AI systems that are analogously more powerful than we are, is extremely negative, not positive. Human efforts to preserve animal species are a drop in the bucket compared to the casual disregard with which we optimize over them and their environments for our benefit. I’m sure animals sometimes attempt to defend their territory against human encroachment. Has the human response to this been to shrug and back off? Of course, there are some humans who do care about animals having fulfilled lives by their own values. But even most of those humans do not spend their lives tirelessly optimizing for their best understanding of the values of animals.