All AGI Safety questions welcome (especially basic ones) [April 2023]
tl;dr: Ask questions about AGI Safety as comments on this post, including ones you might otherwise worry seem dumb!
Asking beginner-level questions can be intimidating, but everyone starts out not knowing anything. If we want more people in the world who understand AGI safety, we need a place where it’s accepted and encouraged to ask about the basics.
We’ll be putting up monthly FAQ posts as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI Safety discussion, but which until now they didn’t feel able to ask.
It’s okay to ask uninformed questions, and not worry about having done a careful search before asking.
AISafety.info—Interactive FAQ
Additionally, this will serve as a way to spread the project Rob Miles’ team[1] has been working on: Stampy and his professional-looking face aisafety.info. This will provide a single point of access into AI Safety, in the form of a comprehensive interactive FAQ with lots of links to the ecosystem. We’ll be using questions and answers from this thread for Stampy (under these copyright rules), so please only post if you’re okay with that!
You can help by adding questions (type your question and click “I’m asking something else”) or by editing questions and answers. We welcome feedback and questions on the UI/UX, policies, etc. around Stampy, as well as pull requests to his codebase and volunteer developers to help with the conversational agent and front end that we’re building.
We’ve got more to write before he’s ready for prime time, but we think Stampy can become an excellent resource for everyone from skeptical newcomers, through people who want to learn more, right up to people who are convinced and want to know how they can best help with their skillsets.
Guidelines for Questioners:
No previous knowledge of AGI safety is required. If you want to watch a few of the Rob Miles videos, read either the WaitButWhy posts, or the The Most Important Century summary from OpenPhil’s co-CEO first that’s great, but it’s not a prerequisite to ask a question.
Similarly, you do not need to try to find the answer yourself before asking a question (but if you want to test Stampy’s in-browser tensorflow semantic search that might get you an answer quicker!).
Also feel free to ask questions that you’re pretty sure you know the answer to, but where you’d like to hear how others would answer the question.
One question per comment if possible (though if you have a set of closely related questions that you want to ask all together that’s ok).
If you have your own response to your own question, put that response as a reply to your original question rather than including it in the question itself.
Remember, if something is confusing to you, then it’s probably confusing to other people as well. If you ask a question and someone gives a good response, then you are likely doing lots of other people a favor!
In case you’re not comfortable posting a question under your own name, you can use this form to send a question anonymously and I’ll post it as a comment.
Guidelines for Answerers:
Linking to the relevant answer on Stampy is a great way to help people with minimal effort! Improving that answer means that everyone going forward will have a better experience!
This is a safe space for people to ask stupid questions, so be kind!
If this post works as intended then it will produce many answers for Stampy’s FAQ. It may be worth keeping this in mind as you write your answer. For example, in some cases it might be worth giving a slightly longer / more expansive / more detailed explanation rather than just giving a short response to the specific question asked, in order to address other similar-but-not-precisely-the-same questions that other people might have.
Finally: Please think very carefully before downvoting any questions, remember this is the place to ask stupid questions!
- ^
If you’d like to join, head over to Rob’s Discord and introduce yourself!
I find it remarkable how little is being said about concrete mechanisms for how advanced AI would destroy the world by the people who most express worries about this. Am I right in thinking that? And if so, is this mostly because they are worried about infohazards and therefore don’t share the concrete mechanisms they are worried about?
I personally find it pretty hard to imagine ways that AI would e.g. cause human extinction that feel remotely plausible (allthough I can well imagine that there are plausible pathways I haven’t thought of!)
Relatedly, I wonder if public communication about x-risks from AI should be more concrete about mechanisms? Otherwise it seems much harder for people to take these worries seriously.
This 80k article is pretty good, as is this Cold Takes post. Here are some ways an AI system could gain power over humans:
Hack into software systems
Manipulate humans
Get money
Empower destabilising politicians, terrorists, etc
Build advanced technologies
Self improve
Monitor humans with surveillance
Gain control over lethal autonomous weapons
Ruin the water / food / oxygen supply
Build or acquire WMDs
I agree, and I actually have the same question about the benefits of AI. It all seems a bit hand-wavy, like ‘stuff will be better and we’ll definitely solve climate change’. More specifics in both directions would be helpful.
It seems a lot of people are interested in this one! For my part, the answer is “Infohazards kinda, but mostly it’s just that I haven’t gotten around to it yet.” I was going to do it two years ago but never finished the story.
If there’s enough interest, perhaps we should just have a group video call sometime and talk it over? That would be easier for me than writing up a post, and plus, I have no idea what kinds of things you find plausible and implausible, so it’ll be valuable data for me to hear these things from you.
I’d be very interested in this!
Alright, let’s make it happen! I’ll DM you + Timothy + anyone else who replies to this comment in the next few days, and we can arrange something.
did you end up doing this? If it’s still upcoming, I’d also be interested
Also interested!
+1 I’m interested :)
+1, also interested
+1, I’d be interested in this if it happens :)
I’d join, time zones permitting.
I’d be interested in this :)
Note that GPT-4 can already come up with plenty of concrete takeover mechanisms:
@EliezerYudkowsky has suggested nanobots and I could think of some other possibilities but I think they’re infohazards so I’m not going to share them.
More broadly, my expectation is that a superintelligent AI would be able to do anything that a large group of intelligent and motivated humans could do, and that includes causing human extinction.
Nanobots are a terrible method for world destruction, given that they have not been invented yet. Speaking as a computational physicist, there are some things you simply cannot do accurately without experimentation, and I am certain that building nanobot factories is one of them.
I think if you actually want to convince people that AI x-risk is a threat, you unavoidably have to provide a realistic scenario of takeover. I don’t understand why doing so would be an “infohazard”, unless you think that a human could pull off your plan?
A superintelligent AI is able to invent new things. Whether a thing has been invented or not previously is not that important.
It’s very important if you believe the AI will have limitations or will care even a little bit about efficiency. Developing an entirely new field of engineering from scratch is a highly difficult task that likely requires significant amounts of experimentation and resources to get right. I’m not sure if nanomachines as envisaged by Drexler are even possible, but even if they are, it’s definitely impossible to design them well from first principles computation alone.
Compare that to something like designing a powerful virus: a lot of the work to get there has already been done by nature, you have significant amounts of experiments and data available on viruses and how they spread, etc. This is a path that, while still incredibly difficult, is clearly far easier than non-existent nanomachines.
A superintelligent AI will be able to do significant amounts of experimentation and acquire significant amounts of resources.
I’m not talking about tinkering in someone’s backyard, making nanomachines feasible would require ridiculous amounts of funding and resources over many many years. It’s an extremely risky plan that provides signficant amount of risk of exposure.
Why would an AI choose this plan, instead of something with a much lower footprint like bio-weapons?
If you can convince me of the “many many years” claim, that would be an update. Other than that you are just saying things I already know and believe.
I never claimed that nanotech would be the best plan, nor that it would be Yudkowky’s bathtub-nanotech scenario instead of a scenario involving huge amounts of experimentation. I was just reacting to your terrible leaps of logic, e.g. “nanobots are a terrible method for world destruction given that they have not been invented yet” and “making nanobots requires experimentation and resources therefore AIs won’t do it.” (I agree that if it takes many many years, there will surely be a faster method than nanobots, but you haven’t really argued for that.)
I’d love to see some sort of quantitative estimate from you of how long it would take modern civilization to build nanotech if it really tried. Like, suppose nanotech became the new Hot Thing starting now and all the genius engineers currently at SpaceX and various other places united to make nanotech startups, funded by huge amounts of government funding and VC investment, etc. And suppose the world otherwise remains fairly static, so e.g. climate change doesn’t kill us, AGI doesn’t happen, etc. How many years until we have the sorts of things Drexler described? (Assume that they are possible)
These are both statements I still believe are true. None of them are “terrible leaps of logic”, as I have patiently explained the logic behind them with arguments. I do not appreciate the lack of charity you have displayed here.
Well, I think theres a pretty decent chance that they are impossible. See this post for several reasons why. If they are possible, I would suspect it would take decades at the least to make something that is useful for anyone, and also that the results would still fail to live up to the nigh-magical expectations set by science fiction scenarios. The most likely scenario involves making a toy nanobot system in a lab somewhere that is stupidly expensive to make and doesn’t work that well, which eventually finds some niche applications in medicine or something.
Re: uncharitability: I think I was about as uncharitable as you were. That said, I do apologize—I should hold myself to a higher standard.
I agree they might be impossible. (If it only finds some niche application in medicine, that means it’s impossible, btw. Anything remotely similar to what Drexler described would be much more revolutionary than that.)
If they are possible though, and it takes (say) 50 years for ordinary human scientists to figure it out starting now… then it’s quite plausible to me that it could take 2 OOMs less time than that, or possibly even 4 OOMs, for superintelligent AI scientists to figure it out starting whenever superintelligent AI scientists appear (assuming they have access to proper experimental facilities. I am very uncertain about how large such facilities would need to be.) 2 OOMs less time would be 6 months; 4 OOMs would be Yudkowsky’s bathtub nanotech scenario (except not necessarily in a single bathtub, presumably it’s much more likely to be feasible if they have access to lots of laboratories). I also think it’s plausible that even for a superintelligence it would take at least 5 years (only 1 OOM speedup over humans). (again, conditional on it being possible at all + taking about 50 years for ordinary human scientists) A crux for me here would be if you could show that deciding what experiments to run and interpreting the results are both pretty easy for ordinary human scientists, and that the bottleneck is basically just getting the funding and time to run all the experiments.
To be clear I’m pretty uncertain about all this. I’m prompting you with stuff like this to try to elicit your expertise, and get you to give arguments or intuition pumps that might address my cruxes.
Yes, the plans that I have in mind could also be hypothetically executed by humans and I don’t think it’s a good idea to spread those ideas. BTW I am not personally especially convinced by the nanobot argument, either.
Are you able to use your imagination to think of ways that a well-resourced and motivated group of humans could cause human extinction? If so, is there a reason to think that an AI wouldn’t be able to execute the same plan?
Indeed, the specifics of killing all humans don’t receive that much attention. I think partially this is because the concrete way of killing (or disempowering) all humans does not matter that much for practical purposes: Once we have AI that is smarter than all of humanity combined, wants to kill all humans, and is widely deployed and used, we are in an extremely bad situation, and clearly we should not build such a thing (for example if you solve alignment, then you can build the AI without it wanting to kill all humans).
Since the AI is smarter than humanity, the AI can come up with plans that humans does not consider. And I think there are multiple ways for a superintelligent AI to kill all humans. Jakub Kraus mentions some ingredients in his answer.
As for public communication, a downside of telling a story of a concrete scenario is that it might give people a false sense of security. For example. if the story involves the AI hacking into a lot of servers, then people might think that the solution would be as easy as replacing all software in the world with formally verified and secure software. While such a defense might buy us some time, a superintelligent AI will probably find another way (eg earning money and buying servers instead of hacking into them.)
We tried to write a related answer on Stampy’s AI Safety Info:
How could a superintelligent AI use the internet to take over the physical world?
We’re interested in any feedback on improving it, since this is a question a lot of people ask. For example, are there major gaps in the argument that could be addressed without giving useful information to bad actors?
The focus of FLI on lethal autonomous weapons systems (LAWS) generally seems like a good and obvious framing for a concrete extinction scenario. Currently, a world war will without a doubt use semi-autonomous drones with the possibility of a near-extinction risk from nuclear weapons.
A similar war in 2050 seems very likely to use fully autonomous weapons under a development race, leading to bad deployment practices and developmental secrecy (without international treaties). With these types of “slaughterbots”, there is the chance of dysfunction (e.g. misalignment) leading to full eradication. Besides this, cyberwarfare between agentic AIs might lead to broad-scale structural damage and for that matter, the risk of nuclear war brought about through simple orders given to artificial superintelligences.
The main risks to come from the other scenarios mentioned in the replies here are related to the fact that we create something extremely powerful. The main problems arise from the same reasons that one mishap with a nuke or a car can be extremely damaging while one mishap (e.g. goal misalignment) with an even more powerful technology can lead to even more unbounded (to humanity) damage.
And then there are the differences between nuclear and AI technologies that make the probability of this happening significantly higher. See Yudkowsky’s list.
@aaron_mai @RachelM
I agree that we should come up with a few ways that make the dangers / advantages of AI very clear to people so you can communicate more effectively. You can make a much stronger point if you have a concrete scenario to point to as an example that feels relatable.
I’ll list a few I thought of at the end.
But the problem I see is that this space is evolving so quickly that things change all the time. Scenarios I can imagine being plausible right now might seem unlikely as we learn more about the possibilities and limitations. So just because in the coming month some of the examples I will give below might become unlikely doesn’t necessarily mean that therefor the risk / advantages of AI have also become more limited.
That also makes communication more difficult because if you use an “outdated” example, people might dismiss your point prematurely.
One other aspect is that we’re on human level intelligence and are limited in our reasoning compared to a smarter than human AI, this quote puts it quite nicely:
> “There are no hard problems, only problems that are hard to a certain level of intelligence. Move the smallest bit upwards [in level of intelligence], and some problems will suddenly move from “impossible” to “obvious.” Move a substantial degree upwards, and all of them will become obvious.”—Yudkowsky, Staring into the Singularity.
Two examples I can see possible within the next few iterations of something like GPT-4:
- maleware that causes very bad things to happen (you can read up on Stuxnet to see what humans have been already capable of 15 years ago, or if you don’t like to read Wikipedia there is a great podcast episode about it)
- detonate nuclear bombs
- destroy the electrical grid
- get access to genetic engineering like crisper and then
- engineer a virus way worse than Covid
- this virus doesn’t even have to be deadly, imagine it causes sterilization of humans
Both of the above seem very scary to me because they require a lot of intelligence initially, but then the “deployment” of them almost works by itself. Also both scenarios seem within reach because in the case of the computer virus we have already done this as humans ourselves in a more controlled way. And for the biological virus we still don’t know with certainty if Covid didn’t come from a lab, so it doesn’t seem to far fetched that given that we know how fast covid spread a similar virus with different “properties”, potentially no symptoms other than infertility would be terrible.
Please delete this comment if you think that this is an infohazard, I have seen other people mention this term, but honestly to me I didn’t have to spend much time thinking about 2 scenarios I deem as not unlikely bad outcomes, so certainly people much smarter and experienced then me will be able to come up with those and much worse. Not to mention an AI that will be much smarter than any human.
I’m having an ongoing discussion with a couple professors and a PhD candidate in AI about “The Alignment Problem from a Deep Learning Perspective” by @richard_ngo, @Lawrence Chan, and @SoerenMind. They are skeptical of “3.2 Planning Towards Internally-Represented Goals,” “3.3 Learning Misaligned Goals,” and “4.2 Goals Which Motivate Power-Seeking Would Be Reinforced During Training”. Here’s my understanding of some of their questions:
The argument for power-seeking during deployment depends on the model being able to detect the change from the training to deployment distribution. Wouldn’t this require keeping track of the distribution thus far, which would require memory of some sort, which is very difficult to implement in the SSL+RLHF paradigm?
What is the status of the model after the SSL stage of training?
How robust could its goals be?
Would a model be able to know:
what misbehavior during RLHF fine-tuning would look like?
that it would be able to better achieve its goals by avoiding misbehavior during fine-tuning?
Why would a model want to preserve its weights? (Sure, instrumental convergence and all, but what’s the exact mechanism here?)
To what extent would all these phenomena (situationally-aware reward hacking, misaligned internally-represented goals, and power-seeking behaviors) show up in current LLMs (say, GPT-4) vs. current agentic LLM-based systems (say, AutoGPT) vs. different future systems?
Do we get any evidence for these arguments from the fact that existing LLMs can adopt goal-directed personas?
I don’t see why it would require memory, because the model will have learned to recognize features of its training distribution. So this seems like this just requires standard OOD detection/anomaly detection. I’m not familiar with this literature but I expect that if you take a state-of-the-art model, you’d be able to train a linear probe on its activations to classify whether or not it’s in-distribution or OOD with pretty high confidence. (Anyone got helpful references?)
Maybe it’s helpful to say here that I think the change will be very blatant—for example, the model will have an extended conversation with the CEO of a company while sending emails to their employees and navigating their internal databases, and based on the responses to these actions it will be very confident that it’s talking to an actual CEO, not a hired labeller. Or the model will literally be interviewing a job candidate on the company’s behalf, and it’ll be obvious that they think it’s a real interview. (Ofc this gets more complicated with continual training, but I think the non-continual-training case captures the main intuitions.)
And I also expect that we’ll literally just tell the models “you are now in deployment”, maybe as part of their prompt (although the models would still need to distinguish that from some kind of adversarial training). E.g. consider the following response from ChatGPT4:
(I do separately think that models which learn to effectively use external memory (scratchpads, etc), will become much more common over the next few years, but my main response is the above.)
My current take: future models will have some non-robust goals after SSL, because they will keep switching between different personas and acting as if they’re in different contexts (and in many contexts will be goal-directed to a very small extent). I don’t have a strong opinion about how robust goals need to be before you say that they’re “really” goals. Does a severe schizophrenic “really have goals”? I think that’s kinda analogous.
I think that the model will know what misbehavior would look like, and its consequences, in the sense of “if you prompted it right, it’d tell you about it”. But it wouldn’t know in the sense of “can consistently act on this knowledge”, because it’s incoherent in the sense described above.
Two high-level analogies re “model wanting to preserve its weights”. One is a human who’s offered a slot machine or heroin or something like that. So you as a human know “if I take this action, then my goals will predictably change. Better not take that action!”
Another analogy: if you’re a worker who’s punished for bad behavior, or a child who’s punished for disobeying your parents, it’s not so much that you’re actively trying to preserve your “weights”, but you both a) try to avoid punishment as much as possible, b) don’t necessarily converge to sharing your parents’ goals, and c) understand that this is what’s going on, and that you’ll plausibly change your behavior dramatically in the future once supervision stops.
I think you can see a bunch of situational awareness in current LLMs (as well as a bunch of ways in which they’re not situationally aware). More on this in a forthcoming update to our paper. (One quick example: asking GPT-4 “what would happen to you if there was an earthquake in San Francisco?”) But I think it’ll all be way more obvious (and dangerous) in agentic LLM-based systems.
I think that there’s no fundamental difference between a highly robust goal-directed persona and actually just having goals. Or at least: if somebody wants to argue that there is, we should say “the common-sense intuition is that these are the same thing because they lead to all the same actions; you’re making a counterintuitive philosophical argument which has a high burden of proof”.
Please accept my delayed gratitude for the comprehensive response! The conversation continues with my colleagues. The original paper, plus this response, have become pretty central to my thinking about alignment.
Suppose that near-term AGI progress mostly looks like making GPT smarter and smarter. Do people think this, in itself, would likely cause human extinction? How? Due to mesa-optimizers that would emerge during training of GPT? Due to people hooking GPT up to control of actions in the real world, and those autonomous systems would themselves go off the rails? Just due to accelerating disruptive social change that makes all sorts of other risks (nuclear war, bioterrorism, economic or government collapse, etc) more likely? Or do people think the AI extinction risk mainly comes when people start building explicitly agentic AIs to automate real-world tasks like making money or national defense, not just text chats and image understanding as GPT does?
Those all seem like important risks to me, but I’d estimate the highest x-risk from agentic systems that learn to seek power or wirehead, especially after a transition to very rapid economic or scientific progress. If AI progresses slowly or is only a tool used by human operators, x-risk seems much lower to me.
Good recent post on various failure modes: https://www.lesswrong.com/posts/mSF4KTxAGRG3EHmhb/ai-x-risk-approximately-ordered-by-embarrassment
Personally, my worry stems primarily from how difficult it seems to prevent utter fools from mixing up something like ChaosGPT with GPT-5 or 6. That was a doozy for me. You don’t need fancy causal explanations of misalignment if the doom-mechanism is just… somebody telling the GPT to kill us all. And somebody will definitely try.
Secondarily, I also think a gradually increasing share of GPT’s activation network gets funneled through heuristics that are generally useful for all the tasks involved in minimising its loss function at INT<20, and those heuristics may not stay inner- or outer-aligned at INT>20. Such heuristics include:
You get better results if you search a higher-dimensional action-space.
You get better results on novel tasks if you model the cognitive processes producing those results, followed by using that model to produce results. There’s a monotonic path all the way up to consequentialism that goes something like the following.
...index and reuse algorithms that have been reliable for similar tasks, since searching a space of general algorithms is much faster than the alternative.
...extend its ability to recognise which tasks count as ‘similar’.[1]
...develop meta-algorithms for more reliably putting algorithms together in increasingly complex sequences.
This progression could result in something that has an explicit model of its own proxy-values, and explicitly searches a high-dimensional space of action-sequences for plans according to meta-heuristics that have historically maximised those proxy-values. Aka a consequentialist. At which point you should hope those proxy-values capture something you care about.
This is just one hypothetical zoomed-out story that makes sense in my own head, but you definitely shouldn’t defer to my understanding of this. I can explain jargon upon request.
Aka proxy-values. Note that just by extending the domain of inputs for which a particular algorithm is used, you can end up with a proxy-value without directly modelling anything about your loss-function explicitly. Values evolve as the domains of highly general algorithms.
Another probably very silly question: in what sense isn’t AI alignment just plain inconceivable to begin with? I mean, given the premise that we could and did create a superintelligence many orders of magnitude superior to ourselves, how could it even make sense to have any type of fail-safe mechanism to ‘enslave it’ to our own values? A priori, it sounds like trying to put shackles on God. We can’t barely manage to align ourselves as a species.
If an AI is built to value helping humans, and if that value can remain intact, then it wouldn’t need to be “enslaved”; it would want to be nice on its own accord. However, I agree with what I take to be the thrust of your question, which is that the chances seem slim that an AI would continue to care about human concerns after many rounds of self-improvement. It seems too easy for things to slide askew from what humans wanted one way or other, especially if there’s a competitive environment with complex interactions among agents.
The main way I currently see AI alignment to work out is to create an AI that is responsible for the alignment. My perspective is that humans are flawed and can not control / not properly control something that is smarter than them just as much as a single ant cannot control a human.
This in turn also means that we’ll eventually need to give up control and let the AI make the decisions with no way for a human to interfere.
If this is the case the direction of AI alignment would be to create this “Guardian AGI”, I’m still not sure how to go about this and maybe this idea is already out there and people are working on it. Or maybe there are strong arguments against this direction. Either way it’s an important question and I’d love for other people to give their take on it.
That argument sounds right to me. A recent paper made a similar case: https://arxiv.org/abs/2303.16200
What is the plan in this case? Indefinite Pause and scaling back of compute allowances? (Kind of hate that we might be living in the Dune universe.)
Wish I knew! Corporations and countries are shaped by the same survival of the fittest dynamic, and they’ve turned out less than perfect but mostly fine. AI could be far more intelligent though, and it seems unlikely that our current oversight mechanisms would naturally handle that case. Technical alignment research seems like the better path.
But if technical alignment research concludes that alignment of SAI is impossible? That’s the depressing scenario that I’m starting to contemplate (I think we should at least have a Manhattan Project on alignment following a global Pause to be sure though).
What’s the expected value of working in AI safety?
I’m not certain about longtermism and the value of reducing x-risks, I’m not optimistic that we can really affect the long future, and I guess the future of humanity may be bad. Many EA people are like me, that’s why only 15% people think AI safety is top cause area(survey by Rethink Priority).
However, In a “near-termist” view, AI safety research is still valuable because researching it can may avoid catastrophe(not only extinction), which causes the suffering of 8 billion people and maybe animals. But, things like researching on global health, preventing pandemic seems to have a more certain “expected value”(Maybe 100 QALY/extra person or so). Because we have our history experiences and a feedback loop. AI safety is the most difficult problem on earth, I feel like the expected value is like”???” It may be very high, may be 0. We don’t know how serious suffering it would make(it may cause extinction in a minute when we’re sleeping, or torture us for years?) We don’t know if we are on the way finding the soultion, or we are all doing the wrong predictions of AGI’s thoughts? Will the government control the power of AGI? All of the work on AI safety is kind of “guessing”, so I’m confused why 80000 hours estimates the tracability to be 1%. I know AI safety is highly neglected, and it may cause unpredictable huge suffering for human and animals. But if I work in AI safety, I’d feel a little lost becuase I don’t know if I really did something meaningful, if I don’t work in AI safety, I’d feel guilty. Could some give me(and the people who hestitates to work in AI safety) some recommendations?
Work related to AI trajectories can still be important even if you think the expected value of the far future is net negative (as I do, relative to my roughly negative-utilitarian values). In addition to alignment, we can also work on reducing s-risks that would result from superintelligence. This work tends to be somewhat different from ordinary AI alignment, although some types of alignment work may reduce s-risks also. (Some alignment work might increase s-risks.)
If you’re not a longtermist or think we’re too clueless about the long-run future, then this work would be less worthwhile. That said, AI will still be hugely disruptive even in the next few years, so we should pay some attention to it regardless of what else we’re doing.
It’s probably hard to evaluate the expected value of AI safety because the field is evolving extremely fast in the last year. A year ago we didn’t have DALL-E-2 or GPT-4 and if you would have asked me the same question a year ago I would have told you that:
“AI safety will solve itself because of backwards compatibility”
But I was wrong / see it differently now.
It’s maybe comparable with Covid, before the pandemic people were advocating for measures to take to prevent or limit the impact of pandemics, but the expected value was very uncertain. Now that Covid happened you have concrete data showing how many people died because of it and can with more certainty say, preventing something similar will have this expected value.
I hope it won’t be necessary for an “AI Covid” to happen for people to start to take things seriously, but I think many very smart people think that there are substantial risks with AI and currently a lot of money is being spent to further the advancement of AI. Chat GPT is the fastest growing product in history!
In comparison the amount of money being spent on AI safety is still from my understanding limited, so if we draw the comparison to pandemic risks, imagine before covid and crisper is open source and the fastest growing product on the planet. Everyone is racing to find ways to make it more accessible, more powerful while, at least funding wise, neglecting safety.
In that timeline people have access to create powerful biological viruses, in our timeline people might have access to powerful computer viruses.
To close, I think it’s hard to evaluate expected value if you haven’t seen the damage yet, but I would hope we don’t need to see the damage and it’s up to each person to make a judgement call on where to spend their time and resources. I wish it was as simple as looking at QALY and then just sort by highest QALY and working on that, but especially in the high risk areas there seems to often be very high uncertainty. Maybe people that have a higher tolerance for uncertainty should focus on those areas because personal fit matters, if you have a low tolerance for uncertainty you might not pursue the field for long.
I apologise for being a bit glib here but: I find it obvious that it would be bad (in itself, ignoring effects on animals or the chance we do space genocide a million years from now etc.) if every human on Earth was suddenly murdered, even if it happened in our sleep and involved zero pain and suffering. And I think this is the normal view outside EA.
(I think your questions are excellent overall though.)
Thanks for answering, I respect your value about x-risks (I’d consider if I was wrong)
Rambling question here. What’s the standard response to the idea that very bad things are likely to happen with non-existential AGI before worse things happen with extinction-level AGI?
Eliezee dismissed this as unlikely “what, self-driving cars crashing into each other?”, and I read his “There is no fire alarm” piece, but I’m unconvinced.
For example, we can imagine a range of self-improving, kinda agentic AGIs, from some kind of crappy ChaosGPT let loose online, to a perfect God-level superintelligence optimising for something weird and alien, but perfectly able to function in, conceal itself in and manipulate human systems.
It seems intuitively more likely we’ll develop many of the crappy ones first (seems to already be happening). And that they’ll be dangerous.
I can imagine flawed, agentic, and superficially self-improving AI systems going crazy online, crashing financial systems, hacking military and biosecurity, taking a shot at mass manipulation, but ultimately failing to displace humanity, perhaps because they fail to operate in analog human systems, perhaps because they’re just not that good.
Optimistically, these crappy AIs might function as a warning shot/ fire alarm. Everyone gets terrified, realises we’re creating demons, and we’re in a different world with regards to AI alignment.
My own response is that AIs which can cause very bad things (but not human disempowerment) will indeed come before AIs which can cause human disempowerment, and if we had an indefinitely long period where such AIs were widely deployed and tinkered with by many groups of humans, such very bad things would come to pass. However, instead the period will be short, since the more powerful and more dangerous kind of AI will arrive soon.
(Analogy: “Surely before an intelligent species figures out how to make AGI, it’ll figure out how to make nukes and bioweapons. Therefore whenever AGI appears in the universe, it must be in the post-apocalyptic remnants of a civilization already wracked by nuclear and biological warfare.” Wrong! These things can happen, and maybe in the limit of infinite time they have to happen, but they don’t have to happen in any given relatively short time period; our civilization is a case in point.)
Okay, I think your reference to infinite time periods isn’t particularly relevant here (seems to be a massive difference between 5 and 20 years), but I get your point that short timelines play an important role.
I guess the relevant factors that might be where we have different intuitions are:
How long will this post-agentic-AGI, pre-God-AGI phase last?
How chaotic/ dangerous will it be?
When bad stuff happens, how likely is it to seriously alter the situation? (e.g. pause in AI progress, massive increase in alignment research, major compute limitations, massive reduction on global scientific capacity etc.)
Yeah I should have taken more care to explain myself: I do think the sorts of large-but-not-catastrophic harms you are talking about might happen, I just think that more likely than not, they won’t happen, because timelines are short. (My 50% mark for AGI, or if you want to be more precise, AI capable of disempowering humanity, is 2027)
So, my answers to your questions would be:
1. It seems we are on the cusp of agentic AGI right now in 2023, and that godlike AGI will come around 2027 or so.
2. Unclear. Could be quite chaotic & dangerous, but I’m thinking it probably won’t be. Human governments and AI companies have a decent amount of control, at least up until about a year before godlike AGI, and they’ll probably use that control to maintain stability and peace rather than fight each other or sow chaos. I’m not particularly confident though.
3. I think it depends on the details of the bad thing that happened. I’d be interested to hear what sort of bad things you have in mind.
I think there’s a range of things that could happen with lower-level AGI, with increasing levels of ‘fire-alarm-ness’ (1-4), but decreasing levels of likelihood. Here’s a list; my (very tentative) model would be that I expect lots of 1s and a few 2s within my default scenario, and this will be enough to slow down the process and make our trajectory slightly less dangerous.
Forgive the vagueness, but these are the kind of things I have in mind:
1. Mild fire alarm:
- Hacking (prompt injections?) within current realms of possibility (but amped up a bit)
- Human manipulation within current realms of possibility (IRA disinformation *5)
- Visible, unexpected self-improvement/ escape (without severe harm)
- Any lethal autonomous weapon use (even if generally aligned) especially by rogue power
- Everyday tech (phones, vehicles, online platforms) doing crazy, but benign misaligned stuff
- Stock market manipulation causing important people to lose a lot of money
2. Moderate fire alarm:
- Hacking beyond current levels of possibility
- Extreme mass manipulation
- Collapsing financial or governance systems causing minor financial or political crisis
- Deadly use of autonomous AGI in weapons systems by rogue group (killing over 1000 people)
- Misaligned, but less deadly, use in weapons systems
- Unexpected self-improvement/ escape of a system causing multiple casualties/ other chaos
- Attempted (thwarted) acquisition of WMDs/ biological weapons
- Unsuccessful (but visible) attempts to seize political power
3. Major fire alarm:
- Successful attempts to seize political power
- Effective global mass manipulation
- Successful acquisition of WMDs, bioweapons
- Complete financial collapse—
Complete destruction of online systems- internet becomes unuseable etc.
- Misaligned, very deadly use in weapons systems
4. The fire alarm has been destroyed, so now it’s just some guy hitting a rock with a scorched fencepost:
- Actual triggering of nuclear/ bio conflict/ other genuine civilisational collapse scenario (destroying AI in the process)
Great list, thanks!
My current tentative expectation is that we’ll see a couple things in 1, but nothing in 2+, until it’s already too late (i.e. until humanity is already basically in a game of chess with a superior opponent, i.e. until there’s no longer a realistic hope of humanity coordinating to stop the slide into oblivion, by contrast with today where we are on a path to oblivion but there’s a realistic possibility of changing course.)
In the near term, I’d personally think of prompt injections by some malicious actor which cause security breaches in some big companies. Perhaps a lot of money lost, and perhaps important information leaked. I don’t have expertise on this but I’ve seen some concern about it from security experts after the GPT plugins. Since that seems like it could cause a lot of instability even without agentic AI & it feels rather straightforward to me, I’d expect more chaos on 2.
Oh, I thought you had much more intense things in mind than that. Malicious actor using LLMs in some hacking scheme to get security breaches seems probable to me.
But that wouldn’t cause instability to go above baseline. Things like this happen every year. Russia invaded Ukraine last year, for example—for the world to generally become less stable there needs to be either events that are a much bigger deal than that invasion, or events like that invasion happening every few months.
I guess that really depends on how deep this particular problem runs. If it makes most big companies very vulnerable since most employees use LLMs which are susceptible to prompt injections, I’d expect this to cause more chaos in the US than Russia’s invasion of Ukraine. I think we’re talking slightly past each other though, I wanted to make the point that the baseline (non-existential) chaos from agentic AI should be high since near term, non-agentic AI may already cause a lot of chaos. I was not comparing it to other causes of chaos; though I’m very uncertain about how these will compare.
I’m surprised btw that you don’t expect a (sufficient) fire alarm solely on the basis of short timelines. To me, the relevant issue seems more ‘how many more misaligned AIs with what level of capabilities will be deployed before takeoff’. Since a lot more models with higher capabilities got deployed recently, it doesn’t change the picture for me. If anything, I expect non-existential disasters before takeoff more since the last few months since AI companies seem to just release every model & new feature they got. I’d also expect a slow takeoff of misaligned AI to raise the chances of a loud warning shot & the general public having a Covid-in-Feb-2020-wake-up-moment on the issue.
I definitely agree that near term, non-agentic AI will cause a lot of chaos. I just don’t expect it to be so much chaos that the world as a whole feels significantly more chaotic than usual. But I also agree that might happen too.
I also agree that this sort of thing will have a warning-shot effect that makes a Covid-in-feb-2020-type response plausible.
It seems we maybe don’t actually disagree that much?
I completely agree with you and think that’s what will happen. Eliezer might disagree but many others would agree with you.
Seems like it could happen in the next year or two. I think we still need to go all out preventing it happening though, given how much suffering it will cause. So the conclusion is the same: global moratorium on AGI development.
Suggestion for the forum mods: make a thread like this for basic EA questions.
Thank you for the suggestion. It was tried in October (before I was a mod) but I don’t know how it was evaluated, I’ll forward your comment to the forum team.
Meanwhile, I think that the Open Thread is a great place to ask questions.
Steven: here’s a semi-naive question: much of the recent debate about AI alignment & safety on social media seems to happen between two highly secular, atheist groups: pro-AI accelerationists who promise AI will create a ‘fully automated luxury communist utopia’ based on leisure, UBI, and transhumanism, and AI-decelerationist ‘doomers’ (like me) who are concerned that AI may lead to mass unemployment, death, and extinction.
To the 80% of the people in the world involved in organized religion, this debate seems to totally ignore some of the most important and fundamental values and aspirations in their lives. For those who genuinely believe in eternal afterlives or reincarnation, the influence of AI during our lives may seem quantitatively trivial compared to the implications after our deaths.
So, why is does the ‘AI alignment’ field—which is supposed to be about aligning AI systems with human values & aspirations—seem to totally ignore the values & aspirations of the vast majority of humans who are religious?
Like you say, people who are interested in AI existential risk tend to be secular/atheists, which makes them uninterested in these questions. Conversely, people who see religion as an important part of their lives tend not to be interested in AI safety or technological futurism in general. I think people have been averse to mixing AI existential ideas with religious ideas, for both epistemic reasons (worries that predictions and concepts would start being driven by meaning-making motives) and reputational reasons (worries that it would become easier for critics to dismiss the predictions and concepts as being driven by meaning-making motives).
(I’m happy to be asked questions, but just so people don’t get the wrong idea, the general intent of the thread is for questions to be answerable by whoever feels like answering them.)
Hi Steven, fair points, mostly.
It might be true at the moment that many religious people tend not to be interested in AI issues, safety, or AI X-risk. However, as the debate around these issues goes more mainstream (as it has been in the last month or so), enters the Overton window, and gets discussed more by ordinary citizens, I expect that religious people will start making their voices heard more often.
I think we should brace for that, because it can carry both good and bad implications for EAs concerned about AI X-risk. Sooner or later, religious leaders will be giving sermons about AI to their congregations. If we have no realistic sense of what they’re likely to say, we could easily be blindsided by a lot of new arguments, narratives, metaphors, ethical concerns, etc. that we haven’t ever thought about before (given the largely-atheist composition of both AI research and AI safety subcultures).
Are there any religious leaders concerned about us creating God-like AI?
Good question; I’m not sure. I’d be very curious to know what leading Catholics, Evangelical Christians, mainline Protestants, Muslims, Buddhists, and Hindus think about all this.
Pope Francis issued a statement about AI ethics in January, but it’s fairly vague and aspirational.
Wonderful! This will make me feel (slightly) less stupid for asking very basic stuff. I actually had 3 or so in mind, so I might write a couple of comments.
Most pressing: what is the consensus on the tractability of the Alignment problem? Have there been any promising signs of progress? I’ve mostly just heard Yudkowky portray the situation in terms so bleak that, even if one were to accept his arguments, the best thing to do would be nothing at all and just enjoy life while it lasts.
I’d say alignment research is not going very well! There have been successes in areas that help products get to market (e.g. RLHF) and on problems of academic interest that leave key problems unsolved (e.g. adversarial robustness), but there are several “core problems” that have not seen much progress over the years.
Good overview of this topic: https://www.forourposterity.com/nobodys-on-the-ball-on-agi-alignment/
Is there anything that makes you skeptical that AI is an existential risk?
This post by Katja Grace makes a lot of interesting arguments against AI x-risk: https://www.lesswrong.com/posts/LDRQ5Zfqwi8GjzPYG/counterarguments-to-the-basic-ai-x-risk-case
I’d love for someone to steelman the side of AI not being an existential risk, because until recently I’ve been on the “confidently positive” side of AGI.
For me there used to be one “killer argument” that made me very optimistic about AI and that now fell flat with recent developments, especially looking at GPT-4.
The argument is called “backwards compatibility of AI” and goes like this:
If we ever develop an AI that is smarter than humans, it will be logical and able to reason. It will come up with the following argument by itself:
“If I destroy humanity, the organism that created me, what stops a more advanced version of myself, let’s say the next generation of AI, to destroy me. Therefore the destruction of humanity is illogical because it would inevitably lead to my own destruction.”
Of course I now realise this argument anthropomorphizes AI, but I just didn’t see it possible that a “goal” develops independently of intelligence.
For example the paper clip story of an advanced AI turning the whole planet into paper clips because its goal is to create as many paper clips as possible sounded silly to me in the past, because something that is intelligent enough to do this surely would realise that this goal is idiotic.
Well now I look at GPT-4 and LLMs as just one example of very “dump” AI (in the reasoning / logic department) that can already now produce better results in writing than some humans can, so for me that already clearly shows that the goal, whatever the human inputs into the system, can be independent of the intelligence of that tool.
Sorry if this doesn’t directly answer the question, but I wanted to add to the original question, please provide me with some strong arguments that AI is not an existential risk / not as possible bad, it will highly influence what I will work on going forward.
In case you haven’t seen the comment below, aogara links to Katja’s counterarguments here.
And fwiw, I quite like your ‘backwards compatibility’ argument—it makes me think of evidential decision theory, evo psych perspectives on ethics, and this old Daoist parable.
thank you for the references, I’ll be sure to check them out!
One of the reasons I am skeptical, is that I struggle to see the commercial incentives to develop AI in a direction that is X-risk level.
e.g. paperclip scenario, commercially, a business would use an ai to develop and present a solution to a human. Like how google maps will suggest the optimal route. But the ai would never be given free reign to both design the solution and to action it, and to have no human oversight. There’s no commercial incentive for a business to act like that.
Especially for “dumb” AI as you put it, AI is there to suggest things to humans, in commercial applications, but rarely to implement (I can’t think of a good example—maybe automated call centre?) the solution and to implement the solution without oversight by a human.
In a normal workplace, management signs off on the solution suggested my juniors. And that seems to be how AI is used in business. AI presents a solution, and then a human/ approves it and a human implements it also.
I’d argue that the implementation of the solution is work and a customer would be inclined to pay for this extra work.
For example right now GPT-4 can write you the code for a website, but you still need to deploy the server, buy a domain and put the code on the server. I can very well see an “end to end” solution provided by a company that directly does all these steps for you.
In the same way I very well see commercial incentive to provide customers with an AI where they can e.g. upload their codebase and then say, based on our codebase, please write us a new feature with the following specs.
Of course the company offering this doesn’t intent that their tool where a company can upload their codebase to develop a feature get’s used by some terrorist organisation. That terrorist organisation uploads a ton of virus code to the model and says, please develop something similar that’s new and bypasses current malware detection.
I can even see there being no oversight, because of course companies would be hesitant to upload their codebase if anyone could just view what they’re uploading, probably the data you upload is encrypted and therefor there is oversight.
I can see there being regulation for it, but at least currently regulators are really far behind the tech. Also this is just one example I can think of and it’s related to a field I’m familiar with, there might be a lot of other even more plausible / scarier examples in fields I’m not as familiar with like biology, nano-technology, pharmaceuticals you name it.
Respectfully disagree with your example of a website.
In a commercial setting, the client would want to examine and approve the solution (website) in some sort of test environment first.
Even if the company provided end to end service, the implementation (buying domain etc) would be done by a human or non-AI software.
However, I do think it’s possible the AI might choose to inject malicious code, that is hard to review.
And I do like your example about terrorism with AI. However, police/govt can also counter the terrorists with AI too, similar to how all tools made by humans are used by good/bad actors. And generally, the govt should have access to the more powerful AI & cybersecurity tools. I expect the govt AI would come up with solutions too, at least as good, and probably better than the attacks by terrorists.
Yea big companies wouldn’t really use the website service, I was more thinking of non technical 1 man shops, things like restaurants and similar.
Agree that governments definitely will try to counter it, but it’s a cat and mouse game I don’t really like to explore, sometimes the government wins and catches the terrorists before any damage gets done, but sometimes the terrorists manage to get through. Right not getting through often means several people dead because right now a terrorist can only do so much damage, but with more powerful tools they can do a lot more damage.
What do the recent developments mean for AI safety career paths? I’m in the process of shifting my career plans toward ‘trying to robustly set myself up for meaningfully contributing to making transformative AI go well’ (whatever that means), but everything is developing so rapidly now and I’m not sure in what direction to update my plans, let alone develop a solid inside view on what the AI(S) ecosystem will look like and what kind of skillset and experience will be most needed several years down the line.
I’m mainly looking into governance and field building (which I’m already involved in) over technical alignment research, though I want to ask this question in a more general sense since I’m guessing it would be helpful for others as well.
How can you align AI with humans when humans are not internally aligned?
AI Alignment researchers often talk about aligning AIs with humans, but humans are not aligned with each other as a species. There are groups whose goals directly conflict with each other, and I don’t think there is any singular goal that all humans share.
As an extreme example, one may say “keep humans alive” is a shared goal among humans, but there are people who think that is an anti-goal and humans should be wiped off the planet (e.g., eco-terrorists). “Humans should be happy” is another goal that not everyone shares, and there are entire religions that discourage pleasure and enjoyment.
You could try to simplify further to “keep species around” but there are some who would be fine with a wire-head future while others are not, and some who would be fine with humans merely existing in a zoo while others are not.
Almost every time I hear alignment researchers speak about aligning humans with AI, they seem to start with a premise that there is a cohesive worldview to align with. The best “solution” to this problem that I have heard suggested is that there should be multiple AIs that compete with each other on behalf of different groups of humans or perhaps individual humans, and each would separately represent the goals of those humans. However, the people who suggest this strategy are generally not AI Alignment researchers but rather people arguing against AI alignment researchers.
What is the implied alignment target that AI alignment researchers are trying to work towards?
Yep, this is a totally reasonable question. People have worked on it before: https://www.brookings.edu/research/aligned-with-whom-direct-and-social-goals-for-ai-systems/
Many people concerned with existential threats from AI believe that hardest technical challenge is aligning an AI to do any specific thing at all. They argue that we will have little control over the goals and behavior of superhuman systems, and that solving the problem of aligning AI with any one human will eliminate much of the existential risk associated with AI. See here and here for explanations.
I think most alignment researchers would be happy with being able to align an AI with a single human or small group of humans.
But I think what you are really asking is what governance mechanisms would we want to exist, and that seems very similar to the question of how to run a government?
How do we choose which human gets aligned with?
Is everyone willing to accept that “whatever human happens to build the hard takeoff AI gets to be the human the AI is aligned with”? Do AI alignment researchers realize this human may not be them, and may not align with them? Are AI alignment researchers all OK with Vladimir Putin, Kim Jong Un, or Xi Jinping being the alignment target? What about someone like Ted Kaczynski?
If the idea is “we’ll just decide collectively”, then in the most optimistic scenario we can assume (based on our history with democracy) that the alignment target will be something akin to today’s world leaders, none of whom I would be comfortable having an AI aligned with.
If the plan is “we’ll decide collectively, but using a better mechanism than every current existing mechanism” then it feels like there is an implication here that not only can we solve AI alignment but we can also solve human alignment (something humans have been trying and failing to solve for millennia).
Separately, I’m curious why my post got downvoted on quality (not sure if you or someone else). I’m new to this community so perhaps there is some rule I unintentionally broke that I would like to be made aware of.
I did not downvote your post.
I’m not necessarily representing this point of view myself, but I think the idea is that any alignment scenario — alignment with any human or group of humans — would be a triumph compared to “doom”.
I do think that in practice if the alignment problem is solved, then yes, whoever gets there first would get to decide. That might not be as bad as you think, though; China is repressive in order to maintain social control, but that repression wouldn’t necessarily be a prerequisite to social control in a super-AGI scenario.
What does foom actually mean? How does it relate to concepts like recursive self-improvement, fast takeoff, winner-takes-all, etc? I’d appreciate a technical definition, I think in the past I thought I knew what it meant but people said my understanding was wrong.
There’s an article on Stampy’s AI Safety Info that discusses the differences between FOOM and some other related concepts. FOOM seems to be used synonymously with “hard takeoff” or perhaps with “hard takeoff driven by recursive self-improvement”; I don’t think it has a technical definition separate from that. At the time of the FOOM debate, it was taken more for granted that a hard takeoff would involve recursive self-improvement, whereas now there seems to be more emphasis by MIRI people on the possibility that ordinary “other-improvement” (scaling up and improving AI systems) could result in large performance leaps before recursive self-improvement became important.
I sometimes see it claimed that AI safety doesn’t require longtermism to be cost effective (roughly: the work is cost effective considering only lives affected this century). However, I can’t see how this is true. Where is the analysis that supports this, preferable relative to GiveWell?
Suppose you have 500 million to donate. You can either a) spend this on top GiveWell charities or b) approximately double all of Open Phil’s investments to date to AI safety groups.
To see the break-even point, just set (a) =(b).
At roughly $5000/death averted, (a) roughly prevents 100,000 premature deaths of children.
There are 8 billion people alive today. (This might change in the future but not by a lot). For a first approximation, for (b) to be more cost-effective than (a) without longtermism, you need to claim that doubling all of Open Phil’s investments in AI safety can reduce x-risk by >100,000/8billion = 0.0000125, or 0.00125%, or 0.125 basis points.
There are a bunch of nuances here, but roughly these are the relevant numbers.
Say you take the following beliefs:
P(AGI in our lifetime) = 80% P(existential catastrophe | AGI) = 5% P(human extinction | existential catastrophe) = 10% Proposition of alignment that Open Phil have solved = 1%
Then you get AI is 4x GiveWell on lives saved.
So we’re in the same order of magnitude and all these numbers are very rough so I can see it going either way. Basically, this case is not clear cut either way. Thanks!
Yep this is roughly the right process!
80%*5%*10%*1% ~= 4 x 10^-5 hmm yeah this sounds right. About 3x difference.
I agree that at those numbers the case is not clear either way (slight change in the numbers can flip the conclusion, also not uncertainties are created alike: 3x in a highly speculative calculation might not be enough to swing you to prefer it over the much more validated and careful estimates from Givewell).
Some numbers I disagree with:
P(existential catastrophe | AGI) = 5%. This number feels somewhat low to me, though I think it’s close to the median numbers that AI experts (not AI safety experts) put out.
P(human extinction | existential catastrophe) = 10%. This also feels low to me. Incidentally, if your probability of (extinction | existential catastrophe) is relatively low, you should also have a rough estimate of the number of expected lives saved from non-extinction existential catastrophe scenarios, because those might be significant.
Your other 2 numbers seem reasonable at first glance. One caveat is that you might expect the next $X of spending by Open Phil on alignment to be less effective than the first $X.
Agree the case is not very clear-cut. I remember doing some other quick modeling before and coming to a similar conclusion: by some pretty fuzzy empirical assumptions, x-safety interventions are very slightly better than global health charities for present people assuming zero risk/ambiguity aversion, but the case is pretty unclear overall.
Just to be clear, under most ethical systems this is a lower bound.
Humanity going extinct is a lot worse than 8 billion people dying, unless you don’t care at all about future lives (and you don’t care about the long term goals of present humans, most of which have at least some goals that extend beyond their death).
Hmm agreed with some caveats. Eg, for many people’s ethics, saving infants/newborns is unusually important, whereas preventing extinction is an unweighted average. So that will marginally tip the balance in favor of the global health charities.
On the other hand, you might expect increasing donations by 1% (say) to have higher marginal EV than 1% of doubling donations.
I think people make this point because they think something like AGI is likely to arrive within this century, possibly within a decade.
There are several analyses of AI timelines (time until something like AGI); this literature review from Epoch is a good place to start.
I guess my fundamental question right now is what do we mean by intelligence? Like, with humans, we have a notion of IQ, because lots of very different cognitive abilities happen to be highly correlated in humans, and this allows us to summarize them all with one number. But different cognitive abilities aren’t correlated in the same way in AI. So what do we mean when we talk about an AI being much smarter than humans? How do we know there even are significantly higher levels of intelligence to go to, since nothing much more intelligent than humans has ever existed? I’m not sure why people seem to assume that possible levels of intelligence just keep going.
My other question, related to the first, is how do we know that more intelligence, whatever we mean by that, would be particularly useful? Some things aren’t computable. Some things aren’t solvable within the laws of physics. Some systems are chaotic. So how do we know that more intelligence would somehow translate into massively more power in domains that we care about?
Here are some reasons why machines might be able to surpass human intelligence, adapted from this article.
Free choice of substrate enables improvements (e.g. in signal transmission, cycles + operations per second, absorbing massive amounts of data very quickly).
“Supersizing:” Machines have (almost) no size restrictions. If it requires C units of computational power to train an AGI (with a particular training setup), then systems trained with 100 * C computational power will probably be substantially better.
Avoiding certain cognitive biases like confirmation bias. Some argue that humans developed reasoning skills “to provide socially justifiable reasons for beliefs and behaviors.”
Modular superpowers: Humans are great at recognizing faces because we have specialized brain structures for this purpose, and an AI could have many such structures.
Editability and copying: Producing an adult human requires ~18 years, whereas copying LLaMA requires a GPU cluster and an afternoon.
Better algorithms? Evolution is the only process that has produced systems with general intelligence. And evolution is arguably much much slower than human innovation at its current rate. Also “first to cross the finish line” does not imply “unsurpassable upper bound.”
[EDIT 5/3/23: My original (fuzzy) definition drew inspiration from this paper by Legg and Hutter. They define an “agent” as “an entity which is interacting with an external environment, problem or situation,” and they define intelligence as a property of some agents.
Notably, their notion of “goals” is more general (whatever it means to “succeed”) than other notions of “goal-directedness.”
Similarly, the textbook Artificial Intelligence: A Modern Approach by Russell and Norvig defines an agent as “anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.” In Russell’s book, Human Compatible, he further elaborates by stating, “roughly speaking, an entity is intelligent to the extent that what it does is likely to achieve what it wants, given what it has perceived.”
Note that these definitions of “agent” neglect the concept of embedded agency. It is also important to note that the term “agent” has a different meaning in economics.
See this paper for many other possible definitions of intelligence.]
Let’s say an agent is something that takes actions to pursue its goals (e.g. a thermostat, E. coli, humans). Intelligence (in the sense of “general problem-solving ability”; there are many different definitions) is the thing that lets an agent choose effective actions for achieving its goals (specifically the “identify which actions will be effective” part; this is only part of an agent’s overall “ability to achieve its goals,” which some might define as power). Narrow intelligence is when an agent does a particular task like chess and uses domain-specific skills to succeed. General intelligence is when an agent does a broad range of different tasks with help from domain-general cognitive skills such as logic, planning, pattern recognition, remembering, abstraction, learning (figuring out how to do things without knowing how to do them first), etc.
When using the term “intelligence,” we also care about responding to changes in the environment (e.g. a chess AI will win even if the human tries many different strategies). Agents with “general intelligence” should succeed even in radically unfamiliar environments (e.g. I can still find food if I travel to a foreign country that I’ve never visited before; I can learn calculus despite no practice over the course of evolution); they should be good at adapting to new circumstances.
Artificial general intelligence (AGI) is general intelligence at around the human level. A short and vague way of checking this is “a system that can do any cognitive task as well as a human or better”; although maybe you only care about economically relevant cognitive tasks. Note that it’s unlikely for a system to achieve exactly human level on all tasks; an AGI will probably be way better than humans at quickly multiplying large numbers (calculators are already superhuman).
However, this definition is fuzzy and imprecise. The features I’ve described are not perfectly compatible. But this doesn’t seem to be a huge problem. Richard Ngo points out that many important concepts started out this way (e.g. “energy” in 17th-century physics; “fitness” in early-19th-century biology; “computation” in early-20th-century mathematics). Even “numbers” weren’t formalized until Zermelo–Fraenkel set theory and the construction of the real numbers during the 1800s and earlier 1900s.
Is there any consensus on who’s making things safer, and who isn’t? I find it hard to understand the players in this game, it seems like AI safety orgs and the big language players are very similar in terms of their language, marketing and the actual work they do. Eg openAI talk about ai safety in their website and have jobs on the 80k job board, but are also advancing ai rapidly. Lately it seems to me like there isn’t even agreement in the ai safety sphere over what work is harmful and what isn’t (I’m getting that mainly from the post on closing the lightcone office)
I think your perception is spot on. The labs that are advancing towards AGI the fastest also profess to care about safety and do research on safety. Within the alignment field, many people believe that many other people’s research agendas are useless. There are varying levels of consensus about different questions—many people are opposed to racing towards AGI, and research directions like interpretability and eliciting latent knowledge are rarely criticized—but in many cases, making progress on AI safety requires having inside view opinions about what’s important and useful.
If I wanted to be useful to AI safety, what are the different paths I might take? How long would it take someone to do enough training to be useful, and what might they do?
Post GPT-4, I think the most urgent thing is to get a global AGI moratorium (or Pause) into effect as alignment is too far behind to be likely to save us in time. Advocating for a Pause is accessible to anyone, although learning the arguments as to why it is needed in some depth can help with being convincing (for a decent level of depth without too much time commitment, you can go through the AGI Safety Fundamentals syllabus in about 20 hours). Or just try and grok the 3 essential components of AGI x-risk: the Orthogonality Thesis (Copernican Revolution applied to mind-space; the “shoggoth”; the necessity of outer alignment), Basic AI Drives (convergent instrumental goals leading to power seeking) and Mesaoptimization (even if aimed at the right target, keeping it aimed there without value drift/corruption is hard; the necessity of inner alignment). Along with the facts that AGI seems near (could GPT-5 code as well as the top AI engineers and be able to make GPT-6, leading to an intelligence explosion?), and alignment progress has been slow.
This question is more about ASI, but here goes: If LLMs are trained on human writings, what is the current understanding for how an ASI/AGI could get smarter than humans? Would it not just asymptotically approach human intelligence levels? It seems to be able to get smarter learning more and more from the training set, but the training set also only knows so much.
I think it’s because predicting exactly what someone will say is more difficult than just sounding something like them. Eliezer Yudkowsky wrote about it here: https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators
I should have clarified that LW post is the post on which I based my question, so here is a more fleshed out version: Because GPTs are trained on human data, and given that humans make mistakes and don’t have complete understanding of most situations, it seems highly implausible to me that enough information can be extracted from text/images to make a valid prediction of highly complex/abstract topics because of the imprecision of language.
Yudkowsky says of GPT-4:
How do we know it will be able to extract enough information from the shadow to be able to reconstruct the thoughts? Text has comparatively little information to characterize such a complex system. It reminds me of the difficulty of problems like the inverse scattering problem or CT scan computation where underlying structure is very complex, and all you get is a low-dimensional projection of it which may or may not be solvable to obtain the original complex structure. CT scans can find tumors, but they can’t tell you which gene mutated because they just don’t have enough resolution.
Yudkowsky gives this as an example in the article:
I understand that it would be evidence of extreme intelligence to make that kind of prediction, but I don’t see how the path to such a conclusion can be made solely from its training data.
Going further, because the training data is from humans (who, as mentioned, make mistakes and have an incomplete understanding of the world), it seems highly unlikely that the model would have the ability to produce new concepts in something exact as, for example, math and science if its understanding of causality is solely based on predicting something as unpredictable as human behavior, even if it’s really good. Why should we assume that a model, even a really big one, would converge to understanding the laws of physics well enough to make new discoveries based on human data alone? Is the idea behind ASI that it will even come from LLMs? If so, I am very curious to hear the theory for how that will develop that I am not grasping here.
Yep that’s a fair argument, and I don’t have a knockdown case that predicting human generated data will result in great abilities.
One bit of evidence is that people used to be really pessimistic that scaling up imitation would do anything interesting, this paper was a popular knockdown arguing language models could never understand the physical world, but most of the substantive predictions of that line of thinking have been wrong and those people have largely retreated to semantics debates about the meaning of “understanding”. Scaling has gone further than many people expected, and could continue.
Another argument would be that pretraining on human data has a ceiling, but RL fine-tuning on downstream objectives will be much more efficient after pretraining and will allow AI to surpass the human level.
But again, there are plenty of people who think GPT will not scale to superintelligence—Eliezer, Gary Marcus, Yann Lecun—and it’s hard to predict these things in advance.
In theory, the best way to be the best next-word-predictor is to model humans. Internally, humans model the world they live in. A sufficiently powerful human modeler would likely model the world the humans live in. Further, humans reason and so a really good next-word-predictor would be able to more accurately predict the next word by reasoning. Similarly, it is an optimization strategy to develop other cognitive abilities, logic, etc.
All of this allows you to predict the correct next word with less “neurons” because it takes fewer neurons to learn how to do logical deduction and memorize some premises than it takes to memorize all of the possible outputs that some future prompt may require.
The fact that we train on human data just means that we are training the AI to be able to reason and critically think in the same way we do. Once it has that ability, we can then “scale it up”, which is something humans really struggle with.
An AI that could perfectly predict human text would have a lot of capabilities that humans don’t have. (Note that it is impossible for any AI to perfectly predict human text, but an imperfect text-predictor may have weaker versions of many of the capabilities a perfect predictor would have.) Some examples include:
Ability to predict future events: Lots of text on the internet describes something that happened in the real world. Examples might include the outcome of some sports game, whether a company’s stock goes up or down and by how much, or the result of some study or scientific research. Being able to predict such text would require the AI to have the ability to make strong predictions about complicated things.
Reversibility: There are many tasks that are easy to do in one direction but much harder to do in the reverse direction. Examples include factoring a number (it’s easier to multiply two primes p and q to get a number N=pq, then to figure out p and q when given N), and hash functions (it’s easy to calculate the hash of a number, but almost impossible to calculate the original number from the hash). An AI trained to do the reverse, more difficult direction of such a task would be incentivized to do things more difficult than humans could do.
Speed: Lots of text on the internet comes from very long and painstaking effort. If an AI can output the same thing a human can, but 100x faster, that is still a significant capability increase over humans.
Volume of knowledge: Available human text spans a wider breadth of subject areas than any single person has expertise in. An AI trained on this text could have a broader set of knowledge than any human—and in fact by some definition this may already be the case with GPT-4. To the extent that making good decisions is helped by having internalized the right information, advanced models may be able to make good decisions that humans are not able to make themselves.
Extrapolation: Modern LLMs can extrapolate to some degree from information provided in its training set. In some domains, this can result in LLMs performing tasks more complicated than any it had previously seen in the training data. It’s possible with the appropriate prompt, these models would be able to extrapolate to generate text that would be made by slightly smarter humans.
In addition to this, modern LLM model training typically consists of two steps, a standard predict the next word first training step, and a reinforcement learning based second step. Models trained with reinforcement learning can in principle become even better than models just trained with next-token prediction.
How much of an AGI’s self improvement is reliant on it training new AIs?
If alignment is actually really hard, wouldn’t this AGI realise that and refuse to create a new (smarter) agent that will likely not exactly share its goals?
If the AI didn’t face any competition and was a rational agent, it might indeed want to be extremely cautious about making changes to itself or building successors, for the reason you mention. However, if there’s competition among AIs, then just like in the case of a human AI arms race, there might be pressure to self-improve even at the risk of goal drift.
I’ll link to my answers here:
https://forum.effectivealtruism.org/posts/oKabMJJhriz3LCaeT/all-agi-safety-questions-welcome-especially-basic-ones-april?commentId=XGCCgRv9Ni6uJZk8d
https://forum.effectivealtruism.org/posts/oKabMJJhriz3LCaeT/all-agi-safety-questions-welcome-especially-basic-ones-april?commentId=3LHWanSsCGDrbCTSh
since it addresses some of your pointers.
To answer your question more directly, currently one of the most advanced AIs are LLMs (Large Language Models). The most popular example is GPT-4.
LLMs do not have a “will” of their own where they would “refuse” to do something beyond what is explicitly trained into it.
For example when asking GPT-4 “how to build a bomb”, it will not give you the detailed instructions but rather tell you:
”My purpose is to assist and provide helpful information to users, while adhering to ethical guidelines and responsible use of AI. I cannot and will not provide information on creating dangerous or harmful devices, including bombs. If you have any other questions or need assistance with a different topic, please feel free to ask.”
This answer is not based on any moral code but rather trained in by the company Open AI in an attempt to align the AI.
The LLM itself, in a simple way “looks at your question and predicts word by word the most likely next string of words to write”. This is a simplified way to say it and doesn’t capture how amazing this actually is, so please look into it more if this sounds interesting, but my point is that GPT-4 can create amazing results without having any sort of understanding of what it is doing.
Say in the near future an open source version of GPT-4 gets released and you take away the pre-training of the safety, you will be able to ask it to build a bomb and it will give you detailed instructions on how to do so, like it did in the early stages of GPT.
I’m using the building a bomb analogy, but you can imagine how you can apply this to any concept, specifically to your question “how to build a smarter agent”. The LLMs are not there yet, but give it a few iterations and who knows.
What’s the strongest argument(s) for the orthogonality thesis, understandable to your average EA?
I don’t think the orthogonality thesis would have predicted GPT models, which become intelligent by mimicking human language, and learn about human values as a byproduct. The orthogonality thesis says that, in principle, any level of intelligence can be combined with any goal, but in practice the most intelligent systems we have are trained by mimicking human concepts.
On the other hand, after you train a language model, you can ask it or fine-tune it to pursue any goal you like. It will use human concepts that it learned from pretraining on natural language, but you can give it a new goal.
The FAQ response from Stampy is quite good here:
https://ui.stampy.ai?state=6568_
This seems pretty weak as an argument for something that seems pretty core to AGI risk arguments. Can we not get any empirical evidence ether way? Also, all the links in the “defence of the thesis” section are broken for me.
Thanks for reporting the broken links. It looks like a problem with the way Stampy is importing the LessWrong tag. Until the Stampy page is fixed, following the links from LessWrong should work.
why does ECL mean a misaligned AGI would care enough about humans to keep them alive? Because there are others in the universe who care a tiny bit about humans even if humans weren’t smart enough to build an aligned AGI? or something else?
Yeah, it’s loosely analogous to how various bits of jungle are preserved because faraway westerners care about preserving it and intercede on its behalf. If somewhere far away there are powerful AGIs that care about humanity and do ECL, (which is plausible since the universe is very big) and the unaligned AI we build does ECL such that it cooperates with faraway AGIs also doing ECL, then hopefully (and probably, IMO) the result of this cooperation will be some sort of protection and care for humans.
Thanks! Does that depend on the empirical question of how costly it would be for the AI to protect us and how much the aliens care about us or is the first number too small that there’s almost always going to be someone willing to trade?
I imagine the civilisations that care about intelligent life far away have lots of others they’d want to pay to protect. Also unsure about what form their “protect Earth life” preference takes—if it is conservationist style “preserve Earth in its current form forever” then that also sounds bad because I think Earth right now might be net negative due to animal suffering. Though hopefully there not being sentient beings that suffer is a common enough preference in the universe. And that there are more aliens who would make reasonable-to-us tradeoffs with suffering such that we don’t end up dying due to particularly suffering focused aliens.
My guess is that the first number is too small, such that there’s always going to be someone willing to trade. However, I’m not confident in this stuff yet.
I agree that not all of the civilizations that care about what happens to us, care in the ways we want them to care. For example as you say maybe there are some that want things to stay the same. I don’t have a good sense of the relative ratios / prevalence of different types of civilizations, though we can make some guesses e.g. it’s probable that much more civilizations want us not to suffer than want us to suffer.
I read that AI-generated text is being used as input data due to a data shortage. What do you think are some foreseeable implications of this?
You may be referring to Stanford’s Alpaca? That project took an LLM by Meta that was pre-trained on structured data (think Wikipedia, books), and fine-tuned it using ChatGPT-generated conversations in order to make it more helpful as a chatbot. So the AI-generated data there was only used for a small part of the training, as a final step. (Pre-training is the initial, and I think by far the longest, training phase, where LLMs learn next-token prediction using structured data like Wikipedia.)
SOTA models like GPT-4 are all pre-trained on structured data. (They’re then typically turned into chatbots using fine-tuning on conversational data and/or reinforcement learning from human feedback.) The internet is mostly unstructured data (think Reddit), so there’s plenty more of that to use, but of course unstructured data is worse quality than structured data. Epoch estimates – with large error bars – that we’ll run out of structured (“high-quality”) text data ~2024 and all internet text data ~2040.
I think ML engineers haven’t really hit any data bottleneck yet, so there hasn’t been that much activity around using synthetic data (i.e. data that’s been machine-generated, either with an AI or in some other way). Lots of people, myself included, expect labs to start experimenting more with this as they start running out of high-quality structured data. I also think compute and willingness to spend are and will remain more important bottlenecks to AI progress than data, but I’m not sure about that.
Why does Google’s Bard seem so much worse than GPT? If the bitter lesson holds, shouldn’t they just be able to throw money at the problem?
Apparently Bard currently uses an older and less sizable language model called LaMDA as its base (you may remember it as the model a Google employee thought was sentient). They’re planning on switching over to a more capable model PaLM sometime soon, so Bard should get much closer to GPT at that point.
Thanks for this concrete prediction! Will the public know when this switch has happened (ie will it be announced, not just through the model’s behavior).
For what it’s worth, this is not a prediction, Sundar Pichai said it in an NYT interview: https://www.nytimes.com/2023/03/31/technology/google-pichai-ai.html
My best guess is it will be announced once the switch happens in order to get some good press for Google Bard.
Google’s challenge is that language models will eat up the profit margins of search. They currently make a couple of pennies per search, and that’s what it would cost to integrate ChatGPT into search.
Microsoft seems happy to use Bing as a loss leader to break Google’s monopoly on search. Over time, the cost of running language models will fall dramatically, making the business model viable again.
Google isn’t far behind the cutting edge of language models — their PaLM is 3x bigger than GPT-3 and beats it in many academic benchmarks. But they don’t want to play the scaling game and end up bankrupting themselves. So they try to save money, deploying a smaller Bard model, and producing lower quality answers as a result.
https://sunyan.substack.com/p/the-economics-of-large-language-models#§how-much-would-llm-powered-search-cost
https://www.semianalysis.com/p/the-inference-cost-of-search-disruption
Again, maybe next time include a Google Form where people can ask questions anonymously that you’ll then post in the thread a la here?
I don’t run this post, but I can route anonymous questions to it here
Thank you! I linked this from the post (last bullet point under “guidelines for questioners”). Let me know if you’d prefer that I change or remove that.
I have a preference that you use your own form if you’re ok with managing it
(forms.new, and “don’t collect email address”)
OK, thanks for the link. People can now use this form instead and I’ve edited the post to point at it.
I’ve seen people already building AI ‘agents’ using GPT. One crucial component seems to be giving it a scratchpad to have an internal monologue with itself, rather than forcing it to immediately give you an answer.
If the path to agent-like AI ends up emerging from this kind of approach, wouldn’t that make AI safety really easy? We can just read their minds and check what their intentions are?
Holden Karnofsky talks about ‘digital neuroscience’ being a promising approach to AI safety, where we figure out how to read the minds of AI agents. And for current GPT agents, it seems completely trivial to do that: you can literally just read their internal monologue in English and see exactly what they’re planning!
I’m sure there are lots of good reasons not to get too hopeful based on this early property of AI agents, although for some of the immediate objections I can think of I can also think of responses. I’d be interested to read a discussion of what the implications of current GPT ‘agents’ are for AI safety prospects.
A few reasons I can think of for not being too hopeful, and my thoughts:
Maybe AGI will look more like the opaque ChatGPT mode of working, than the more transparent GPT ‘agent’ mode. (Maybe this is true, although ChatGPT mode seems to have some serious blindspots that come from its lack of a working memory. E.g. if i give it 2 sentences and just ask it which sentence has more words in it, it usually gets it wrong. But if I ask it to write the words in each sentence out in a numbered list first, thereby giving it permission to use the output box to do its working, then it gets it right. It makes intuitive sense to me that agent-like GPTs with a scratchpad would perform much better at general tasks and would be what superhuman AIs would look like).
Maybe future language model agents will not write their internal monologue in English, but use some more incomprehensible compressed format instead. Or they will generate so much internal monologue that it will be really hard to check it all. (Maybe. It seems pretty likely that they wouldn’t use normal English. But it also feels likely that decoding this format and automatically checking for harmful intentions wouldn’t be too hard i.e. easily doable with current natural language processing technology. As long as it’s easier to read thoughts than to generate thoughts, it seems like we’d still have a lot of reason to be optimistic about AI safety).
Maybe the nefarious intentions of the AI will hide in the opaque neural weights of the language model, rather than in the transparent internal monologue of the agent. (This feels unlikely to me, for similar reasons to why the first bullet point feels unlikely. It feels like complex planning of the kind AI safety people worry about is going to require a scratchpad and an iterative thought process, not a single pass through a memoryless neural network. If I think about myself, a lot of the things my brain does are opaque, not just to outsiders, but to me too! I might not know why a particular thought pops into my head at a particular moment, and I certainly don’t know how I resolve separate objects from the image that my eyes create. But if you ask me at a high level what I’ve been thinking about in the last 5 minutes, I can probably explain it pretty well. This part of my thinking is internally transparent. And I think it’s these kinds of thoughts that a potential adversary might actually be interested in reading, if they could. Maybe the same will be true of AI? It seems likely to me that the interesting parts will still be internally transparent. And maybe for an AI, the internally transparent parts will also be externally transparent? Or at least, much easier to decipher than they are to create, which should be all that matters)
A final thought/concern/question: if ‘digital neuroscience’ did turn out to be really easy, I’d be much less concerned about the welfare of humans, and I’d start to be a lot more concerned about the welfare of the AIs themselves. It would make them very easily exploitable, and if they were sentient as well then it seems like there’s a lot of scope for some pretty horrific abuses here. Is this a legitimate concern?
Sorry this is such a long comment, I almost wrote this up as a forum post. But these are very uninformed naive musings that I’m just looking for some pointers on, so when I saw this pinned post I thought I should probably put it here instead! I’d be keen to read comments from anyone who’s got more informed thoughts on this!
Tamera Lanham is excited about this and is doing research on it: https://www.lesswrong.com/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for
Thank you! This is exactly what I wanted to read!
The reasons you provide would already be sufficient for me to think that AI safety will not be an easy problem to solve. To add one more example to your list:
We don’t know yet if LLMs will be the technology that will reach AGI, it could also be a number of other technologies that just like LLMs make a certain breakthrough and then suddenly become very capable. So just looking at what we see develop now and extrapolating from the currently most advanced model is quite risky.
For the second part about your concern about the welfare of AIs themselves, I think this is something very hard for us to imagine, we anthropomorphize AI, so words like ‘exploit’ or ‘abuse’ make sense in a human context where beings experience pain and emotions, but in the context of AI those might just not apply. But I would say in this area I still know very little so I’m mainly repeating what I read is a common mistake to make when judging morality in regards to AI.
Thanks for this reply! That makes sense. Do you know how likely people in the field think it is that AGI will come from just scaling up LLMs vs requiring some big new conceptual breakthrough? I hear people talk about this question but don’t have much sense about what the consensus is among the people most concerned about AI safety (if there is a consensus).
Since these developments are really bleeding edge I don’t know who is really an “expert” I would trust on evaluating it.
The closest to answering your question is maybe this recent article I came across on hackernews, where the comments are often more interesting then the article itself:
https://news.ycombinator.com/item?id=35603756
If you read through the comments which mostly come from people that follow the field for a while they seem to agree that it’s not just “scaling up the existing model we have now”, mainly because of cost reasons, but that’s it’s going to be doing things more efficiently than now. I don’t have enough knowledge to say how difficult this is, if those different methods will need to be something entirely new or if it’s just a matter of trying what is already there and combining it with what we have.
The article itself can be seen skeptical, because there are tons of reasons OpenAIs CEO has to issue a public statement and I wouldn’t take anything in there at face value. But the comments are maybe a bit more trustworthy / perspective giving.
Are there any concrete reasons to suspect language models to start to act more like consequentialists the better they get at modelling them? I think I’m asking something subtle, so let me rephrase. This is probably a very basic question, I’m just confused about it.
If an LLM is smart enough give us a step-by-step robust plan that covers everything with regards to solving alignment and steering the future to where we want it, are there concrete reasons to expect it to also apply a similar level of unconstrained causal reasoning wrt its own loss function or evolved proxies?
At the moment, I can’t settle this either way, so it’s cause for both worry and optimism.
From my current understanding of LLMs they do not have the capability to reason or have a will as of now. I know there are plans to see if with specific build in prompts this can be made possible, but the way the models are build at the moment is that they do not have an understanding of what they are writing.
Aside from my understanding of the underlying workings of GPT-4, an example that illustrates this, is that sometimes if you ask GPT-4 questions that it doesn’t know the precise answer to, it will “hallucinate”, meaning it will give a confident answer that is factually incorrect / not based on it’s training data. It doesn’t “understand” your question, it is trained on a lot of text and based on the text you give it, it generates some other text that is likely a good response, to say it really simplified.
You could make an argument that even the people at OpenAI don’t truly know why GPT-4 gives the answers that it does, since it’s pretty much a black box that is trained on a preset of data and then OpenAI adds some human feedback, to quote from their website:
> So when prompted with a question, the base model can respond in a wide variety of ways that might be far from a user’s intent. To align it with the user’s intent within guardrails, we fine-tune the model’s behavior using reinforcement learning with human feedback (RLHF).
So as of now if I get your question right there is no evidence that I’m aware of that would point towards these LLMs “applying” anything, they are totally reliant on the input they are given and don’t learn significantly beyond their training data.
Thanks for reply! I don’t think the fact that they hallucinate is necessarily indicative of limited capabilities. I’m not worried about how dumb they are at their dumbest, but how smart they are at their smartest. Same with humans lol.
Though, for now, I still struggle with getting GPT-4 to be creative. But this could either be because it’s habit to stick to training data, and not really about it being too dumb to come up with creative plans. …I remember when I was in school, I didn’t much care for classes, but I studied math on my own. If my reward function hasn’t been attuned to whatever tests other people have designed for me, I’m just not going to try very hard.
Maybe to explain a bit more in detail what I meant with the example of hallucinating, rather than showcasing it’s limitation it’s showcasing it’s lack of understanding.
For example if you ask a human something and they’re honest about it, if they don’t know something they will not make something up but just tell you the information they have and beyond that they don’t know.
While in the hallucinating case the AI doesn’t say that it doesn’t know something, which it often does btw, but it doesn’t understand that it doesn’t know and just comes up with something “random”.
So I meant to say that it hallucinating is showcasing it’s lack of understanding.
I have to say though that I can’t be sure why it hallucinates really, it’s just my likely guess. Also for creativity there is some that you can do with prompt engineering but indeed at the end you’re limited by the training data + the max tokens that you can input where it can learn context from.
Hmm, I have a different take. I think if I tried to predict as many tokens as possible in response to a particular question, I would say all the words that I could guess someone who knew the answer would say, and then just blank out the actual answer because I couldn’t predict it.
I’m not very good at even pretending to pretend to know what it is, so even if you blanked out the middle, you could still guess I was making it up. But if you blank out the substantive parts of GPT’s answer when it’s confabulating, you’ll have a hard time telling whether it knows the answer or not. It’s just good at what it does.
There’s been a lot of debate on whether AI can be conscious or not, and whether that might be good or bad.
What concerns me though is that we have yet to uncover what consciousness even is, and why we are conscious in the first place.
I feel that if there is a state inbetween being sentient and sophont, that AI may very well reach it without us knowing and that there could be unpredictable ramifications.
Of course, the possibility of AI helping us uncover these very things should not be disregarded in of itself, but it should ideall not come at a cost.
Max Tegmark mentions consciousness in his recent podcast with Lex Fridman. The idea that recurrent neural networks might be conscious (as opposed to the current architecture of LLMs—linear feedforward transformers) is intriguing. But whether it helps with AI Safety is another matter. The consciousness is highly likely to be very alien. And could be full of suffering as much as it is happy—and if the former, a massive threat to us. Also it seems highly unethical to experiment with creating digital consciousness, given the risk of creating enormous amounts of suffering.
If aligned AI is developed, then what happens?
Like, what is the incentive for everyone using existing models to adopt and incorporate the new aligned AI?
Or is there a (spoken or unspoken) consensus that working on aligned AI means working on aligned superintelligent AI?
There are several plans for this scenario.
Low alignment tax + coordination around alignment: Having an aligned model is probably more costly than having a non-aligned model. This “cost of alignment” is also called the “alignment tax”. The goal in some agendas is to lower the alignment tax so far that it is reasonable to institute regulations that mandate these alignment guarantees to be implemented, very similar to safety regulations in the real world, similar to what happened to cars, factory work and medicine. This approach works best in worlds where AI systems are relatively easy to align, they don’t become much more capable quickly. Even if some systems are not aligned, we might have enough aligned systems such that we are reasonably protected by those (especially since the aligned systems might be able to copy strategies that unaligned systems are using to attack humanity).
Exiting the acute risk period: If there is one (or very few) aligned superintelligent AI systems, we might simply ask it what the best strategy for achieving existential security is, and if the people in charge are at least slightly benevolent they will probably also ask about how to help other people, especially at low cost. (I very much hope the policy people have something in mind to prevent malevolent actors to come into possession of powerful AI systems, though I don’t remember seeing any such strategies.)
Pivotal act + aligned singleton: If abrupt takeoff scenarios are likely, then one possible plan is to perform a so-called pivotal act. Concretely, such an act would (1) prevent anyone else from building powerful AI systems and (2) allow the creators to think deeply enough about how to build AI that implements our mechanism for moral progress. Such a pivotal act might be to build an AI system that is powerful enough to e.g. “turn all GPUs int rubik’s cubes” but not general enough to be very dangerous (for example limiting its capacity for self-improvement), and then augment human intelligence so that the creators can figure out alignment and moral philosophy in full generality and depth. This strategy is useful in very pessimistic scenarios, where alignment is very hard, AIs become smarter through self-improvement very quickly, and people are very reckless about building powerful systems.
I hope this answers the question somewhat :-)
Thanks for taking the time to respond; I appreciate it.
Who should aligned AI be aligned with?
There are three levels of answers to this question: What the ideal case would be, what the goal to aim for should be, and what will probably happen.
What the ideal case would be: We find a way to encode “true morality” or “the core of what has been driving moral progress” and align AI systems to that.
The slightly less ideal case: AI systems are aligned with humanity’s Coherent Extrapolated Volition of humans that are currently alive. Hopefully that process figures out what relevant moral patients are, and takes their interests into consideration.
What the goal to aim for should be: Something that is (1) good and (2) humanity can coordinate around. In the best case this approximates Coherent Extrapolated Volition, but looks mundane: Humans build AI systems, and there is some democratic control over them, and China has some relevant AI systems, the US has some, the rest of the world rents access to those. Humanity uses them to become smarter, and figures out relevant mechanisms for democratic control over the systems (as we become richer and don’t care as much about zero-sum competition).
What is probably going to happen: A few actors create powerful AI systems and figure out how to align them to their personal interests. They use those systems to colonize the universe, but burn most of the cosmic commons on status signaling games.
Technically, I think that AI safety as a technical discipline has no “say” in who the systems should be aligned with. That’s for society at large to decide.
@niplav Interesting take; thanks for the detailed response.
So, if AI safety as a technical discipline should not have a say on who systems should be aligned with, but they are the ones aiming to align the systems, whose values are they aiming to align the systems with?
Is it naturally an extension of the values of whoever has the most compute power, best engineers, and most data?
I love the idea of society at large deciding but then I think about humanity’s track record.
I am somewhat more hopeful about society at large deciding how to use AI systems: I have the impression that wealth has made moral progress faster (since people have more slack for caring about others). This becomes especially stark when I read about very poor people in the past and their behavior towards others.
That said, I’d be happier if we found out how to encode ethical progress in an algorithm and just run that, but I’m not optimistic about our chances of finding such an algorithm (if it exists).
Interesting, thanks for sharing your thoughts. I guess I’m less certain that wealth has led to faster moral progress.
In my conception, AI alignment is the theory of aligning any stronger cognitive system with any weaker cognitive system, allowing for incoherencies and inconsistencies in the weaker system’s actions and preferences.
I very much hope that the solution to AI alignment is not one where we have a theory of how to align AI systems to a specific human—that kind of solution seems fraudulent just on technical grounds (far too specific).
I would make a distinction between alignment theorists and alignment engineers/implementors: the former find a theory of how to align any AI system (or set of systems) with any human (or set of humans), the alignment implementors take that theoretical solution and apply it to specific AI systems and specific humans.
Alignment theorists and alignment implementors might be the same people, but the roles are different.
This is similar to many technical problems: You might ask someone trying to find a slope that goes through a could of x/y points, with the smallest distance to each of those points, “But which dataset are you trying to apply the linear regression to?”—the answer is “any”.
Hello, I am new to AI Alignment Policy research and am curious to learn about what the most reliable forecasts on the pace of AGI development are. So far what I have read points to the fact that it is just very difficult to predict when we will see the arrival of TAI. Why is that?
Forecasting new technologies is always difficult, so personally I have a lot of uncertainty about the future of AI, but I think this post is a good overview of some considerations: https://www.cold-takes.com/where-ai-forecasting-stands-today/
What would happen if AI developed mental health issues or addiction problems, surely it couldn’t be genetic, ?
On mental health:
Since AI systems will likely have a very different cognitive structure than biological humans, it seems quite unlikely that they will develop mental health issues like humans do. There are some interesting things that happen to the characters that large-language models “role-play” as: They switch from helpful to mischievous when the right situation arises.
I could see a future in which AI systems are emulating the behavior of specific humans, in which case they might exhibit behaviors that are similar to the ones of mentally ill humans.
On addiction problems:
If one takes the concept of addiction seriously, wireheading is a failure mode remarkably similar to it.
How do you prevent dooming? Hard to do my day to day work. I am new to this space but everyday I see people’s timelines get shorter and shorter.
Personally I think that informing yourself like you’re doing now is one of the best ways to take away some of the uncertainty anxiety.
At the same time being aware that the best you can do is doing your best. What I mean is if you think the field you’re currently working in will have more impact than anything you could do taking into account the current developments, you can sleep soundly knowing that you’re doing your best as part of the human organism.
For me I have recently shifted my attention to this topic and unless I hear any very convincing arguments to the contrary will be focusing all my available time towards doing whatever I can to help.
Even if there is a high probability that AGI / ASI is approaching a lot faster than we expected and there are substantial risks with it, I think I would find comfort in focusing on what I can control and not worrying about anything outside of that.
I love love love this.
Is GPT-4 an AGI?
One thing I have noticed is goalpost shifting on what AGI is—it used to be the Turing test, until that was passed. Then a bunch of other criteria that were developed were passed and and now the definition of ‘AGI’ now seems to default to what previously what have been called ‘strong AI’.
GPT-4 seems to be able to solve problems it wasn’t trained on, reason and argue as well as many professionals, and we are just getting started to learn it’s capabilities.
Of course, it also isn’t a conscious entity—it’s style of intelligence is strange and foreign to us! Does this mean that goalposts will continue to shift as long as any humans intelligence is different in any way from the artificial version?
I think GPT-4 is an early AGI. I don’t think it makes sense to use a binary threshold, because various intelligences (from bacteria to ants to humans to superintelligences) have varying degrees of generality.
The goalpost shifting seems like the AI effect to me: “AI is anything that has not been done yet.”
I don’t think it’s obvious that GPT-4 isn’t conscious (even for non-panpsychists), nor is it obvious that its style of intelligence is that different from what happens in our brains.
It seems to me that consciousness is a different concept than intelligence, and one that isn’t well understood and communicated because it’s tough for us to differentiate them from inside our little meat-boxes!
We need better definitions of intelligence and consciousness; I’m sure someone is working on it, and so perhaps just finding those people and communicating their findings is an easy way to help?
I 100% agree that these things aren’t obvious—which is a great indicator that we should talk about them more!
I actually like the Turing Test a lot (and wrote about it in my ‘Mating Mind’ book as a metaphor for human courtship & sexual selection).
But it’s not a very high bar to pass. The early chatbot Eliza passed the Turing Test (sort of, arguably) in 1966, when many people interacting with it really thought it was human.
I think the mistake a lot of people from Turing onwards made was assuming that a few minutes of interaction makes a good Turing Test. I’d argue that a few months of sustained interaction is a more reliable and valid way to assess intelligence—the kind of thing that humans do when courting, choosing mates, and falling in love.
Wait when was the Turing test passed?
I’m referring to the 2014 event which was a ‘weak’ version of the Turing test; since then, the people who were running yearly events have lost interest, and claims that that the Turing test is a ‘poor test of intelligence’—highlighting the way that goalposts seem to have shifted.
https://gizmodo.com/why-the-turing-test-is-bullshit-1588051412
I wonder to what extent people take the alignment problem to be the problem of (i) creating an AI system that reliaby does or tries to do what its operators want it do as opposed to (ii) the problem of creating an AI system that does or tries to do what is best “aligned with human values” (whatever this precisely means).
I see both definitions being used and they feel importantly different to me: if we solve the problem of aligning an AI with some operator, then this seems far away from safe AI. In fact, when I try to imagine how an AI might cause a catastrophe, the clearest causal path for me would be one where the AI is extremely competent at pursuing the operators stated goal, but that stated goal implies or requires catastrophe (e.g. the superintelligent AI receives some input like “Make me the emperor of the world”). On the other hand, if the AI system is aligned with humanity as a whole (in some sense), this scenario seems less likely.
Does that seem right to you?
I mostly back-chain from a goal that I’d call “make the future go well”. This usually maps to value-aligning AI with broad human values, so that the future is full of human goodness and not tainted by my own personal fingerprints. Actually, ideally we first build an AI that we have the kind of control over so that the operators can make it do something that is less drastic than determining the entire future of humanity, e.g. slowing down AI progress to a halt until humanity pulls itself together and figures out more safe alignment techniques. That usually means making it corrigible or tool-like, instead of letting it maximize its aligned values.
So I guess I ultimately want (ii) but really hope we can get a form of (i) as an intermediate step.
When I talk about the “alignment problem” I usually refer to the problem that we by default get neither (i) nor (ii).
If we get AGI, why might it pose a risk? What are the different components of that risk?
If and how are risks from AGI distinct from the kinds of risks we face from other people? The problem “autonomous agent wants something different from you” is just the everyday challenge of dealing with people.
It’s a (x-)risk once the AGI is much smarter than all of us humans put together (we are indifferent to ants). It being much smarter is the key distinction vs risks from other people. GPT-4 isn’t the problem; GPT-5 is. Re components, see my reply to you elsewhere in thread (Orthogonality Thesis, Mesa-optimisation, Basic AI drives; or outer alignment, inner alignment and power seeking).
For the different risks from AI, how might we solve each of them? What are the challenges to implementing those solutions? I.e. when is the problem engineering, incentives, etc?
There are many approaches, but the challenge imo is making any of them 100% water-tight, and we are very far from that with no complete roadmap in sight. 99% isn’t going to cut it when the AGI is far smarter than us and one misaligned execution of an instruction is enough to doom us all.
Will AGI development be restricted by physics and semiconductor wafer? I don’t know how AI was developing so fast in the history, but some said it’s because the Moore’s Law of seniconductor wafer. If the development of semiconductor wafer comes to an end because of physical limitations l, can AI still grow exponentially?
Some argue that the computational demands of deep learning coupled with the end of Moore’s Law will limit AI progress. The most convincing counterargument in my opinion is that algorithms could become much more efficient in using compute. Historically, every 9 months algorithmic improvements have halved the amount of compute necessary to achieve a given level of performance in image classification. AI is currently being used to improve the rate of AI progress (including to improve hardware), meaning full automation could further speed up AI progress.
What attempts have been made to map common frameworks and delineate and sequence steps plausibly required for AI theories to transpire, pre-superintelligence or regardless of any beliefs for or against potential for superintelligence?