An AI killing everyone wouldn’t earn a massive penalty in training, because there won’t be humans alive in that scenario to assign the penalty.
Humans need not be around to give a penalty at inference time, just like how GPT4 is not penalized by individual humans, but that the reward is learned / programmed. Even if all humans are sleeping / dead today, GPT can run inference according to the reward we preprogrammed. They are not doing pure online learning.
This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm. Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you’re starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat.
It is a logical fallacy to account for future increase in capabilities but not future advances in safety research. You’re claiming AGI will be an x-risk based on scaling current capabilities only, but you’re failing to scale safety. Generalization to unsafe scenarios is a situation we want to write tests for before deploying in situations where they may occur. Phase deployment should help test whether we can generalize to increasingly harder situations.
I’d expect the first AGI systems to be built by labs that are pushing full steam ahead on making crazy impressive things happen ASAP, which means you’re actively optimizing against minds that are trying to limit their impact, intelligence, or power
The recent push for productization is making everyone realize that alignment is a capability. A gaslighting chatbot is a bad chatbot compared to a harmless helpful one. As you can see currently, the world is phasing out AI deployment, fixing the bugs, then iterating.
You gave an argument that human goals overlap some with the goals of evolution, but you didn’t give an argument that humans are non-catastrophic from the (pseudo-)perspective of evolution. That would depend on whether humans will produce lots of copies of human DNA in the future.
Humans are unaligned in various ways, it looks like a lot of AIs will be deployed in the future, many aligned to different objectives. I’m skeptical of MIRI’s modeling of risk because y’all only talk about one super-powerful AGI that is godlike, but y’all haven’t modeled multiple companies, multiple AGIs, multiple deployments. Unlike the former, this is going to be the most likely scenario that is frequently unmentioned in forecasting. Future compute is going to be distributed among these AGIs too, so in many ways we end up at something akin to a modern society of humans.
Yep! The orthogonality doesn’t just show that unfriendly goals are possible; it shows that friendly goals are possible too.
Then why the overemphasis/obsession on doom scenario? It makes for a great robot-uprising scifi story but is unscientific. If you approximate the likelihood of future scenarios as a gaussian distribution, wiping out all humans is so extreme and long tailed that it is less likely than almost any other scenario in the set, and the least likely scenario in that set has a probability whose limit approaches to zero given the infinite set of possibilities summing up to 1.0. Given that the number of possibilities are infinite, the likelihood of any one possibility is far too small, close to zero. The likelihood of unaligned AGIs jerking each other off in a massive orgy for eternity is as likely as wiping out humans (more likely accounting for resistance to latter scenario).
The recent push for productization is making everyone realize that alignment is a capability. A gaslighting chatbot is a bad chatbot compared to a harmless helpful one. As you can see currently, the world is phasing out AI deployment, fixing the bugs, then iterating.
While that’s one way to look at it, another way is to notice the arms race dynamics and how every major tech company is now throwing LLMs into the public head over heels even when they stil have some severe flaws. Another observation is that e.g. OpenAI’s safety efforts are not very popular among end users, given that in their eyes these safety measures make the systems less capable/interesting/useful. People tend to get irritated when their prompt is answered with “As a language model trained by OpenAI, I am not able to <X>”, rather than feeling relief over being saved from a dangerous output.
As for your final paragraph, it is easy to say “<outcome X> is just one ouf of infinite possibilities”, but you’re equating trajectories with outcomes. The existence of infinite possibilities doesn’t really help when there’s a systematic reason that causes many or most of them to have human extinction as an outcome. Whether this is actually the case or not is of course an open and hotly debated question, but just claiming “it’s just a single point on the x axis so the probability mass must be 0″ is surely not how you get closer to an actual answer.
why the overemphasis/obsession on doom scenario?
Because it is extremely important that we do what we can to avoid such a scenario. I’m glad that e.g. airlines still invest a lot in improving flight safety and preventing accidents even though flying is already the safest way of traveling. Humanity is basically at this very moment boarding a giant AI-rplane that is about to take off for the very first time, and I’m rather happy there’s a number of people out there looking at the possible worst case and doing their best to figure out how we can get this plane safely off the ground rather than saying “why are people so obsessed with the doom scenario? A plane crash is just one out of infinite possibilities, we’re gonna be fine!”.
Humans need not be around to give a penalty at inference time, just like how GPT4 is not penalized by individual humans, but that the reward is learned / programmed. Even if all humans are sleeping / dead today, GPT can run inference according to the reward we preprogrammed. They are not doing pure online learning.
I was also confused by this at first. But I don’t think Rob is saying “an AI that learned ‘don’t kill everyone’ during training would immediately start killing everyone as soon as it can get away with it”, I think he’s saying “even if an AI picks up what seems like a ‘don’t kill everyone’ heuristic during training, that doesn’t mean this heuristic will always hold out-of-distribution”. In particular, undergoing training is a different environment than being deployed, so picking up a “don’t kill everyone in training (but do whatever when deployed)” heuristic is just as good during training as “don’t kill everyone ever”, but the former allows the AI more freedom to pursue its other objectives when deployed.
(I’m hoping Rob can correct me if I’m wrong and/or you can reply if I’m mistaken, per Cunningham’s Law.)
Humans need not be around to give a penalty at inference time, just like how GPT4 is not penalized by individual humans, but that the reward is learned / programmed. Even if all humans are sleeping / dead today, GPT can run inference according to the reward we preprogrammed. They are not doing pure online learning.
It is a logical fallacy to account for future increase in capabilities but not future advances in safety research. You’re claiming AGI will be an x-risk based on scaling current capabilities only, but you’re failing to scale safety. Generalization to unsafe scenarios is a situation we want to write tests for before deploying in situations where they may occur. Phase deployment should help test whether we can generalize to increasingly harder situations.
The recent push for productization is making everyone realize that alignment is a capability. A gaslighting chatbot is a bad chatbot compared to a harmless helpful one. As you can see currently, the world is phasing out AI deployment, fixing the bugs, then iterating.
Humans are unaligned in various ways, it looks like a lot of AIs will be deployed in the future, many aligned to different objectives. I’m skeptical of MIRI’s modeling of risk because y’all only talk about one super-powerful AGI that is godlike, but y’all haven’t modeled multiple companies, multiple AGIs, multiple deployments. Unlike the former, this is going to be the most likely scenario that is frequently unmentioned in forecasting. Future compute is going to be distributed among these AGIs too, so in many ways we end up at something akin to a modern society of humans.
Then why the overemphasis/obsession on doom scenario? It makes for a great robot-uprising scifi story but is unscientific. If you approximate the likelihood of future scenarios as a gaussian distribution, wiping out all humans is so extreme and long tailed that it is less likely than almost any other scenario in the set, and the least likely scenario in that set has a probability whose limit approaches to zero given the infinite set of possibilities summing up to 1.0. Given that the number of possibilities are infinite, the likelihood of any one possibility is far too small, close to zero. The likelihood of unaligned AGIs jerking each other off in a massive orgy for eternity is as likely as wiping out humans (more likely accounting for resistance to latter scenario).
While that’s one way to look at it, another way is to notice the arms race dynamics and how every major tech company is now throwing LLMs into the public head over heels even when they stil have some severe flaws. Another observation is that e.g. OpenAI’s safety efforts are not very popular among end users, given that in their eyes these safety measures make the systems less capable/interesting/useful. People tend to get irritated when their prompt is answered with “As a language model trained by OpenAI, I am not able to <X>”, rather than feeling relief over being saved from a dangerous output.
As for your final paragraph, it is easy to say “<outcome X> is just one ouf of infinite possibilities”, but you’re equating trajectories with outcomes. The existence of infinite possibilities doesn’t really help when there’s a systematic reason that causes many or most of them to have human extinction as an outcome. Whether this is actually the case or not is of course an open and hotly debated question, but just claiming “it’s just a single point on the x axis so the probability mass must be 0″ is surely not how you get closer to an actual answer.
Because it is extremely important that we do what we can to avoid such a scenario. I’m glad that e.g. airlines still invest a lot in improving flight safety and preventing accidents even though flying is already the safest way of traveling. Humanity is basically at this very moment boarding a giant AI-rplane that is about to take off for the very first time, and I’m rather happy there’s a number of people out there looking at the possible worst case and doing their best to figure out how we can get this plane safely off the ground rather than saying “why are people so obsessed with the doom scenario? A plane crash is just one out of infinite possibilities, we’re gonna be fine!”.
I was also confused by this at first. But I don’t think Rob is saying “an AI that learned ‘don’t kill everyone’ during training would immediately start killing everyone as soon as it can get away with it”, I think he’s saying “even if an AI picks up what seems like a ‘don’t kill everyone’ heuristic during training, that doesn’t mean this heuristic will always hold out-of-distribution”. In particular, undergoing training is a different environment than being deployed, so picking up a “don’t kill everyone in training (but do whatever when deployed)” heuristic is just as good during training as “don’t kill everyone ever”, but the former allows the AI more freedom to pursue its other objectives when deployed.
(I’m hoping Rob can correct me if I’m wrong and/or you can reply if I’m mistaken, per Cunningham’s Law.)