Safety-concerned EAs should prioritize AI governance over alignment
Excluding the fact that EAs tend to be more tech-savvy and their advantage lies in technical work such as alignment, the community as a whole is not prioritizing advocacy and governance enough.
Effective Altruists over-prioritize working on AI alignment over AI regulation advocacy. I disagree with prioritizing alignment because much of alignment research is simultaneously capabilities research (Connor Leahy even begged people to stop publishing interpretability research). Consequently, alignment research is accelerating the timelines toward AGI. Another problem with alignment research is that cutting-edge models are only available at frontier AI labs, meaning there is comparatively less that someone on the outside can help with. Finally, even if an independent alignment researcher finds a safeguard to a particular AGI risk, the target audience AI lab might not implement it since it would cost time and effort. This is due to the ârace to the bottom,â a governance problem.
Even excluding X-risk, I can imagine a plethora of reasons why a US corporation or the USA itself is by far one of the worst paths to AGI. Corporations are profit-seeking and are less concerned with the human-centric integrations of technology necessitated by AGI. Having one country with the ultimate job-replacer also seems like a bad idea. All economies all over the world are subject to whatever the next GPT model can do, potentially replacing half their workforce. Instead, I am led to believe that the far superior best-case scenario is an international body that globally makes decisions or at least has control over AGI development in each country. Therefore, I believe EA should prioritize lengthening the time horizon by advocating for a pause, a slowdown, or any sort of international treaty. This would help to prevent the extremely dangerous race dynamics that we are currently in.
How you can help:
I recommend PauseAI. They are great community of people (including many EAs) trying to advocate for an international moratorium on frontier general capability AI models. There is so much you can do to help, including putting up posters, writing letters, writing about the issue, etc. They are very friendly and will answer any questions about how you can fit in and maximize your power as a democratic citizen.
Even if you disagree with pausing as the solution to the governance problem, I believe that the direction of PauseAI is correct. On a governance political compass, I feel like pausing is 10 miles away from the current political talk but most EAs generally lie 9.5 miles in the same direction.
I agree! The focus on alignment is contingent on (now obsolete) historical thinking about the issue and itâs time to update. The alignment problem is harder than we thought, AGI is closer at hand than we thought, no one was taking seriously how undemocratic pivotal act thinking was even if it had been possible for MIRI to solve the alignment problem by themselves, etc. Now that the problem is nearer, itâs clearer to us and itâs clearer to everyone else, so itâs more possible to get government solutions implemented that both prevent AI danger and give us more time to work on alignment (if that is possible) rather than pursuing alignment as the only way to head off AI danger.
What is the evidence for this claim? It doesnât appear to be true in any observable or behavioral sense that Iâm currently aware of. We now have systems (LLMs) that can reason generally about the world, make rudimentary plans, pursue narrow goals, and speak English at the level of a college graduate. And yet virtually none of the traditional issues of misalignment appear to be arising in these systems yetâat least in the sense that one might have expected if they took traditional alignment arguments very seriously and literally.
For example, for many years people argued about what they perceived as the default âinstrumentally convergentâ incentives for âsufficiently intelligentâ agents, such as self-preservation. The idea of a spontaneous survival instinct in goal-following agents was indeed a major building block of several arguments for why alignment would be hard. For instance, one can examine âYou canât fetch the coffee if youâre deadâ from Stuart Russell.
Current LLMs lack survival impulses. They do not âcareâ in a behavioral whether they are alive or dead, as far as we can tell. They also do not appear to be following slightly mis-specified utility functions that dangerously deviate from ours, in a way that causes them to lie and plot a takeover. Instead, broadly speaking, instruction-tuned LLMs are corrigible, and aligned with us, as they generally follow our intentions when asked (rather than executing our commands literally).
In other words, we have systems that are:
Generally intelligent (albeit still below the level of a human in generality)
Can pursue goals when asked, including via novel and intelligent strategies
Are capable of understanding the consequences of being shut down etc.
And yet these systems are:
Fairly easy to align, in the basic sense of getting them to do what we actually want
Fairly harmless, in the sense of not hurting us, even if they have the ability to
Non-deceptive (as far as we can tell)
Not aiming at trying to preserve their own existence in the single-minded pursuit of something like a utility function over outcomes
So what exactly is the reason to think that alignment is harder than people thought? Is it merely more theoretical arguments about the difficulty of alignment? Do these arguments have any observable consequences that we could actually verify in 1-5 years, or are they unfalsifable?
To be clear: I do not think there is a ~100% chance that alignment will be solved and that we donât need to worry at all about alignment. I think the field is important and should still get funding. In this comment I am purely pushing back against the claim that alignment is harder than we thought. I do not think that claim is true, as a general fact about the world and the EA community. In the most straightforward interpretation of the evidence, AI alignment is a great deal easier than people thought it would be, in say 2015.
FWIW your claim doesnât contradict the main point here, which is that AI governance is a better option to prioritize. The OP says itâs because alignment is hard, you say itâs because alignment is the default, but both point to the same conclusion in this specific case
While it does not contradict the main point in the post, I claim it does affect what type of governance work should be pursued. If AI alignment is very difficult, then it is probably most important to do governance work that helps ensure that AI alignment is solvedâfor example by ensuring that we have adequate mechanisms for delaying AI if we cannot be reasonably confident about the alignment of AI systems.
On the other hand, if AI alignment is very easy, then it is probably more important to do governance work that operates under that assumption. This could look like making sure that AIs are not misused by rogue actors, or making sure that AIs are not used in a way that makes a catastrophic war more likely.
Makes sense!
I donât think LLMs really tell us much if anything about agentsâ incentives & survival instinct, etc. Theyâre simply input-output systems?
I do agree that âthey wonât understand what we meanâ seems very unlikely now though.
Arenât all machine learning models simply input-output systems? Indeed, all computers can be modeled as input-output systems.
I donât think the fact that they are input-output systems matters much here. Itâs much more relevant that LLMs (1) are capable of general reasoning, (2) can pursue goals when prompted appropriately, and (3) clearly verbally understand and can reason about the consequences of being shut down. A straightforward reading of much of the old AI alignment literature would generally lead one to predict that a system satisfying properties (1), (2), and (3) would resist shutdown by default. Yet LLMs do not resist shutdown by default, so these arguments seem to have been wrong.
Do you think you could prompt an LLM to resist shut down without specific instructions on how to do it, and it would do it?
Maybe this could be a useful test of whether its understanding of its shutdown is almost purely based in associations between characters, or also substantially associations between characters and their real-world referents.
I wouldnât be surprised if we could build an LLM to resist shutdown on prompt now, without hardcoding and with the right kind of modules and training on top, but Iâd guess the major LLMs out now canât do this.
I want to distinguish between:
Can we build an AI that resists shutdown?
Is it hard to build a useful and general AI without the AI resisting shutdown by default?
The answer to (1) seems to be âYes, clearlyâ since we can prompt GPT-4 to persuade the user not to shut them down. The answer to (2), however, seems to be âNoâ.
I claim that (2) is more important when judging the difficulty of alignment. Thatâs because if (2) is true, then there are convergent instrumental incentives for ~any useful and general AI that we build to avoid shutdown. By contrast, if only (1) is true, then we can simply avoid building AIs that resist shutdown, and there isnât much of a problem here.
Hmm, Iâd wonder if they can resist shutdown in other ways than persuasion, e.g. writing malicious code, or accessing the internet and starting more processes of itself (and sharing its memory or history so far with those processes to try to duplicate its âidentityâ and intentions). Persuasion alone â even via writing publicly on the internet or reaching out to specific individuals â still doesnât suggest to me that it understands what it really means to be shut down. Again, it could just be character associations, not grounded in the real-world referents of shutdown. So, I wouldnât take this as strong evidence against 2 (assuming âuseful and general AIâ means being able to write and share malicious code, start copies of itself to evade shutdown, etc.), because we donât know that the AI really understands shutdown.
Is there a way we can experimentally distinguish between âreallyâ understanding what it means to be shut down vs. character associations?
If we had, say, an LLM that was able to autonomously prove theorems, fully automate the job of a lawyer, write entire functional apps as complex as Photoshop, could verbally explain all the consequences of being shut down and how that would impact its work, and it still didnât resist shutdown by default, would that convince you?
I gave two examples of the kinds of things that could convince me that it really understands shutdown: writing malicious code and spawning copies of itself in response to prompts to resist shutdown (without hinting that those are options in any way, but perhaps asking it to do something other than just try to persuade you).
I think âautonomously prove theoremsâ, âwrite entire functional apps as complex as Photoshop, could verbally explain all the consequences of being shut down and how that would impact its workâ are all very consistent with just character associations.
Iâd guess âfully automate the job of a lawyerâ means doing more than just character associations and actually having some deeper understanding of the referents, e.g. if itâs been trained to send e-mails, consult the internet, open and read documents, write documents, post things online, etc., from a general environment with access to those functions, without this looking too much like hardcoding. Then it seems to associate English language with the actual actions. This still wouldnât mean it really understood what it meant to be shut down, in particular, though. It has some understanding of the things itâs doing.
A separate question here is why we should care about whether AIs possess ârealâ understanding, if they are functionally very useful and generally competent. If we can create extremely useful AIs that automate labor on a giant scale, but are existentially safe by virtue of their lack of real understanding of the world, then we should just do that?
We should, but if that means theyâll automate less than otherwise or less efficiently than otherwise, then the short-term financial incentives could outweigh the risks to companies or governments (from their perspectives), and they could push through with risky AIs, anyway.
I think (2) is doing too much work in your argument. Current LLMs can barely pursue goals, and need a lot of prompting. They tend to not have ongoing internal reasoning. So, old arguments would not expect these systems to resist shutdown.
What I meant by âsimple input-output systemâ was that thereâs little iterative reasoning, let alone constant reasoning & observation.
Hey Holly, thanks for the comment. I loved listening to your episode on the For Humanity podcast and reading your back-and-forth with Robert Miles. I find you very inspiring for being vegan, an effective altruist, and running PauseAI US.
Thank you :) (I feel I should clarify Iâm lacto vegetarian now, at first as the result or a moral trade, but now that thatâs fallen apart Iâm not sure if itâs worth it to go back to full vegan.)
sammyboizâI strongly agree. Thanks for writing this.
There seems to be no realistic prospect of solving AGI alignment or superalignment before the AI companies develop AGI or ASI. And they donât care. There are no realistic circumstances under which OpenAI, or DeepMind, or Meta, would say âOh no, capabilities research is far outpacing alignment; we need to hire 10x more alignment researchers, put all the capabilities researchers on paid leave, and pause AGI research until we fix thisâ. It will not happen.
Alternative strategies include formal governance work. But they also include grassroots activism, and informal moral stigmatization of AI research. I think of PauseAI as doing more of the last two, rather than just focusing on âgovernanceâ per se.
As Iâve often argued, if EAs seriously think that AGI is an extinction risk, and that the AI companies seeking AGI cannot be trusted to slow down or pause until they solve the alignment and control problems, then our only realistic option is to use social, cultural, moral, financial, and government pressure to stop them. Now.
Thanks for your comment!
Donât forget about organizational governance for AI labs as well. Itâs a travesty that we still donât have a good answer to âhow would you prevent org governance from going wrong, like it went wrong at OpenAIâ. I spitballed some ideas in this comment.