Public Weights?
While this is close to areas I work in, itās a personal post. No one reviewed this before I published it, or asked me to (or not to) write something. All mistakes are my own.
A few days ago, some of my coworkers at SecureBio put out a preprint, āWill releasing the weights of future large language models grant widespread access to pandemic agents?ā (Gopal et. al 2023) They took Facebook/āMetaās Llama-2-70B large language model (LLM) and (cheaply!) adjusted it to remove the built in safeguards, after which it was willing to answer questions on how to get infectious 1918 flu. I like a bunch of things about the paper, but I also think it suffers from being undecided on whether itās communicating:
Making LLMs public is dangerous because by publishing the weights you allow others to easily remove safeguards.
Once you remove the safeguards, current LLMs are already helpful in getting at the key information necessary to cause a pandemic.
I think it demonstrates the first point pretty well. The main way we avoid LLMs from telling people how to cause harm is to train them on a lot of examples of someone asking how to cause harm and being told ānoā, and this can easily be reversed by additional training with āyesā examples. So even if you get incredibly good at this, if you make your LLM public you make it very easy for others to turn it into something that compliantly shares any knowledge it contains.
Now, you might think that there isnāt actually any dangerous knowledge, at least not within what an LLM could have learned from publicly available sources. I think this is pretty clearly not true: the process of creating infectious 1918 flu is scattered across the internet and hard for most people to assemble. If you had an experienced virologist on call and happy to answer any question, however, they could walk you there through a mixture of doing things yourself and duping others into doing things. And if they were able to read and synthesize all virology literature they could tell you how to create things quite a bit worse than this former pandemic.
GPT-4 is already significantly better than Llama-2, and GPT-5 in 2024 is more likely than not. Public models will likely continue to move forward, and while itās unlikely that we get a GPT-4 level Llama-3 in 2024 I do think the default path involves very good public models within a few years. At which point anyone with a good GPU can have their own personal amoral virologist advisor. Which seems like a problem!
But the paper also seems to be trying to get into the question of whether current models are capable of teaching people how to make 1918 flu today. If they just wanted to assess whether the models were willing and able to answer questions on how to create bioweapons they could have just asked it. Instead, they ran a hackathon to see whether people could, in one hour, get the no-safeguards model to fully walk them through the process of creating infectious flu. I think the question of whether LLMs have already lowered the bar for causing massive harm through biology is a really important one, and Iād love to see a follow-up that addressed that with a no-LLM control group. That still wouldnāt be perfect, since outside the constraints of a hackathon you could take a biology class, read textbooks, or pay experienced people to answer your questions, but it would tell us a lot. My guess is that the synthesis functionality of current LLMs is actually adding something here and a no-LLM group would do quite a bit worse, but the market only has that at 17%:
Even if no-safeguards public LLMs donāt lower the bar today, and given how frustrating Llama-2 can be this wouldnāt be too surprising, it seems pretty likely we get to where they do significantly lower the bar within the next few years. Lower it enough, and some troll or committed zealot will go for it. Which, aside from the existential worries, just makes me pretty sad. LLMs with open weights are just getting started in democratizing access to this incredibly transformative technology, and a world in which we all only have access to LLMs through a small number of highly regulated and very conservative organizations feels like a massive loss of potential. But unless we figure out how to create LLMs where the safeguards canāt just be trivially removed, I donāt see how to avoid this non-free outcome while also avoiding widespread destruction.
(Back in 2017 I asked for examples of risk from AI, and didnāt like any of them all that much. Today, āsomeone asks an LLM how to kill everyone and it walks them through creating a pandemicā seems pretty plausible.)
Comment via: facebook, lesswrong, the EA Forum, mastodon
I also like the way you divide up the claims. I think this paper is a really neat demonstration of point 1, and Iām kinda disappointed with the discourse for getting distracted arguing about point 2.
Thatās fair, though since a lot of people already knew about #1 and are very interested in whether #2 is true (or might soon become true) itās not that surprising that this is where the interest is
I like this way of splitting it up. I think the paper made a good case for point 1, but I think point 2 is greatly overstated. With current tech you would still need an expert to sift through hallucinations and to guide the LLM, and the same expert could do the same thing without the LLM. On this issue current LLMās are timesavers, not gamechangers.
For this reason I doubt you can convince people to hide their weight now, but possibly you can convince them to do so later, when the tech is improved enough to be dangerous.
Sort of: because once you publish the weights for a model thereās no going back Iām hoping even the next round of models will not be published, or at least not published without a thorough set of evals. The problem is that if you miss that a private model is able to meaningfully lower the bar to causing harm (ex: telling people how to make pandemics) you can restrict access or modify it, while you learn that a public model can do that youāre out of luck.
Iām encouraging people to stop using the framing of ādemocratizing accessā.
I think this framing is misleading because, given current polls, itās not at all clear that the population (at least of the countries Iāve seen surveyed) would vote for frontier models to be open-sourced.
The phrase ādemocratizing accessā doesnāt mean ādistributing access in line with a popular voteā but ādistributing access to the peopleā. This is definition #2, āmake (something) accessible to everyone.ā See democratization of knowledge for more of this kind of usage.
Sure, and I think we should stop using this definition as it unnecessarily confuses people/ādistorts the conversation.