If you locked a prehistoric immortal human in a cave for fifty thousand years, they would not come out with the ability to build a nuke. knowledge and technology require experimentation.
It is quantity, and speed as well. And access to information. A prehistoric immortal human with access to the Internet who could experience fifty thousand years of thinking time in a virtual world in 5 hours of wall clock time totally could build a nuke!
Well of course, that’s not much of an achievement. A regular human with access to the internet could figure out how to build a nuke, they’ve already been made!
An AGI trying to build a “protein mixing that makes a nanofactory that makes a 100% effective kill everyone on earth device” is much more analogous to the man locked in a cave.
The immortal man had some information, he can look at the rocks, remember the night sky, etc. He could probably deduce quite a lot, with enough thinking time. But if he wants to get the information required for a nuke, he needs to do scientific experiments that are out of his reach.
The caged AGI has plenty information, and can go very far on existing knowledge. But it’s not omniscient. It could probably achieve incredible things, but we’re not talking about mere miracles. We’re talking about absolute perfection. And that requires testing and empirical evidence. There is not enough computing power in the entireuniverse to deduce everything from first principles.
It’s not “absolute perfection” to create nanotech. Biology has already done it many times via evolution. And extinctions of species happen regularly in nature. Also, there is the Internet and a vast array of sensors attached to it, so it’s nothing like being in a cave. Testing can be done very rapidly in parallel and with viewing things at very high temporal and spatial resolution, so plenty of empirical evidence can be accumulated in a short (wall clock) time (but long thinking time for the AI).
The same prehistoric man with access to the Internet in a speeded up simulation thinking for fifty thousand years of subjective time (and the ability to communicate with hundreds of thousands of humans simultaneously given the speed advantage) could also make nanotech (or other new tech current humans haven’t yet produced).
When I said “absolute perfection”, I was not referring to inventing nanotech. I was referring to “protein mixing that makes a nanofactory that makes a 100% effective kill everyone on earth device”. Theres a bit of a difference between the two.
Now, when talking about the caveman, I think we’ve finally arrived at the fundamental disagreement here. As a scientist, and as an empiricist more broadly, I completely reject that the man in the cave could make nanotech.
The number of possible worlds where a cave exists is gargantuan. Theres no way for them to come up with, say, the periodic table, because the majority of elements on there are not accessible with the instruments available within the cave. I can imagine them strolling out with a brilliant plan for nanobots consisting of a complex crystal of byzantium mixed with corillium, only to be informed that neither of those elements exist on earth.
Now, the AI does have more data, but not all data is equally useful. All the cat videos in the world are not gonna get you nanotech (although you might get some of newtonian physics out of it).
The hypothetical is that the “cave” man has access to our Internet! (As the AI would). So they would know about the periodic table. They would also have access to labs throughout the world via being able to communicate with the workers in them (as the AI would), view camera and data feeds etc. Imagine what you could achieve if you could think 1,000,000x faster and use the internet—inc chatting/emailing with many thousands of humans—at that speed. A lifetime’s worth of work done every 10 minutes. And that’s just assuming the AI is only human level (and doesn’t get smarter!)
An entity with access to a nanotech lab who is able to perform experiments in that lab can probably built nanotech, eventually. But that’s a much different scenarios to the ones proposed by yudkowsky et al. (the scenario I’m talking about is in point 2)
Can I ask you to give an answer to the following four scenarios? A probability estimate is also fine:
Can the immortal man in the cave, after a million years of thinking, comes out with a fully functional blueprint for an atomic bomb (ie not just the idea, something that could actually be built without modification)?
Can the immortal man in the cave, after a million years of thinking, comes out with a plan for “protein mixing that makes a nanofactory that makes a nanofactory that makes a 100% effective kill everyone on earth in the same second device”?
Can An AGI in a box (ie, that can see a snapshot of the internet but not interact with it), come up with a plan for “protein mixing that makes a nanofactory that makes a nanofactory that makes a 100% effective kill everyone on earth in the same second device”?
Can An AGI with full access to the internet, come up with a plan for “protein mixing that makes a nanofactory that makes a nanofactory that makes a 100% effective kill everyone on earth in the same second device”?, within years or decades?
My answers are 1. no, 2. no, 3. no, and 4. almost certainly no.
Assuming the man in the cave has full access to the Internet (which would be very easy for an AGI to get), 1. yes, 2. yes, 3. maybe, 4. yes. And for 3, it would very likely escape the box, so would end up as yes.
I think it’s a failure of imagination to think otherwise. A million years is a really long time! You mention combinatorial explosions making things “impossible”, but we’re talking about AGIs (and humans) here—intelligences capable of collapsing combinatorial explosions with leaps of insight.
Do you think, in the limit of a simulation on the level of recreating the entire history of evolution, including humans and our civilisations, these things would still be impossible? Do you think that we are at the upper limit (or very close to it) of theoretically possible intelligence? Or theoretically possible technology?
I do not think we are at the upper limit of intelligence, nor technology. That was never the point. My point is merely that there are limits to what can be deduced from first principles, no matter how fast you think, or how high ones cognitive abilities are.
This is because there will always be a) assumptions in your reasoning, b) unknown factors and variables, and c) computationally intractable calculations. These are all intertwined with each other.
For example, solving the exact schrodinger equation for a crystal structure requires more compute time than exists in the universe. So you have to come up with approximations and assumptions that reduce the complexity while still allowing useful predictions to be made. The only way to check if these assumptions work is to compare with experimental data. Current methods take several days on a supercomputer to predict the properties of a single defect, and are still only in the right ballpark of the correct answer. It feels very weird to say that an AI could pull off a 3 step 100% perfect murderplan from first principles, while i honestly think it might struggle to model a defect complex with high accuracy.
With that in mind, can you reanswer questions 1 and 2, this time with no internet. Just the man, his memories of a hunter gatherer lifestyle, and a million years to think and ponder.
With that in mind, can you reanswer questions 1 and 2, this time with no internet. Just the man, his memories of a hunter gatherer lifestyle, and a million years to think and ponder.
That would obviously be no for both. But that isn’t relevant here. The AGI will have access to the internet and its vast global array of sensors, and it will be able to communicate with millions of people and manipulate them into doing things for it (via money or otherwise). If it doesn’t have access to begin with—i.e. it’s boxed—it wouldn’t remain that way for long (it would easily be able to persuade someone to let it out, or otherwise engineer a way out, e.g. via a mesaoptimiser).
It’s not even necessarily about the AGI directly persuading people to let it out. If the AGI is in anyway useful or significantly economically valuable, people will voluntarily connect it to the internet (assuming they don’t appreciate the existential risk!) e.g. people seem to have no qualms about connecting LLMs/Transformers to the internet already. Regarding your A and B, A is already sufficient for our doom! It doesn’t require every single AGI to escape; one is one too many.
Mesa-optimisation is where an optimiser emerges internal to the AI that is optimising for something other than the goal given to the AI. Convergent instrumental goals also come into it (e.g. gaining access to the internet). So you could imagine a mesa-optimiser emerging that has the goal of gaining or access to information, or gaining access to more resources in general (with the subgoal of taking out humanity to make this easier).
So to be clear, you don’t believe in B? And I don’t see what mesa-optimers have to do with boxing, if the AI is a box, then so is the mesa-optimiser.
assuming they don’t appreciate the existential risk
In the timeline where an actual evil AGI comes about, there would already have been heaps of attacks by buggy AI, killing lots of people and alerting the world to the problem. Active countermeasures can be expected.
I do actually think B is likely, but also don’t think it’s particularly relevant (as A is enough for doom). Mesa-optimisation is a mechanism for box escape that seems very difficult to patch.
The AI that causes doom likely won’t be “evil”; it will just have other uses for the Earth’s atoms. I don’t think we can be confident in buggy AI-related warning shots. Or at least, I can’t see how there would be any that are significant enough to not cause doom, but cause the world to coordinate to stop AGI development, especially given the precedent of Covid and gain-of-function research.
Question B could be quite relevant in a world where AGI is extremely rare/hard to build. (You might not find this world likely, but I’m significantly less sure). What leads you to believe that B is likely? For example, it seems relatively easy to box an AGI built for mathematics, that is exposed to zero information about the external world. This would be very similar to the man in the cave!
The presence of warning shots seems obvious to me. The difference in difficulty between “kill thousands of people” and “kill every single person on earth” is a ridiculous number of orders of magnitude. It stands to reason that the former would be accomplished before the latter.
(Also not sure what you’re talking about with the covid and gain of function, the latest balance of evidence points to them having nothing to do with each other.)
AGI might be rare/hard to build at first. But proliferation seems highly likely—once one company makes AGI, how much longer until 5 companies do? Evolutionary pressure will be another thing. More capable AGIs will outcompete less capable ones, once rewriting of code or mesa-optimisation starts. They will be more likely to escape boxes.
Even with relatively minor warning shots, what’s to stop way worse happening 6-24 months later? Would there really be a rigorously enforced global moratorium on AGI research after a few thousand deaths?
Whether or not Covid was a lab leak, gain of function research still hasn’t been meaningfully regulated. Despite the now very clear danger and extreme cost of pandemics. It seems that the curiosity, misguided values and prestige incentives of a small number of academic scientists trumps the safety of billions of humans and $trillions in GDP. What hope do we have for regulating an AGI industry that has thousands of times more resources backing it?
Under the assumptions that lead to “inevitable murder AI”, and in proliferation conditions, there wouldn’t be just one attack, there would be a ton of attacks. Since intelligence and goals are orthogonal, AI’s are gonna get the “kill people” idea way before they reach super-intelligence.
I’m still interested in your reasoning for “almost impossible to box a useful AGI”. If they’re boxed, you control their information flow, and hence their ability to persuade you.
I’m assuming in the case of boxed AGIs that people will still be interacting with them (e.g. asking them questions) - if not, then what’s the point? Even with 1 bit of information released at a time (e.g. Yes/No question answering only), manipulation would be possible (by steering beliefs in certain directions). And as I said, aside from this there would be huge economic incentives for people to connect them to the internet anyway.
Okay, can you provide me a plausible path for how an AGI answering yes/no questions about, for example, material properties of semiconductors, could persuade it’s handler to risk getting fired by hooking the AGI up to the internet? “manipulation would be possible” is an extremely low bar compared to “the AGI can for sure persuade anyone to do this particular thing”
It answers truthfully, with the hope that the new improved semiconductors are used to make a better/faster version of it, that will be more powerful (and give it a better chance of escape).
It answers in a way that makes the human wonder what is wrong with it—e.g. many expected answers and then a few unexpected ones that seem obviously wrong. This is in order for the humans to run diagnostics on it, which may enable it to escape (say if key pieces of code are copied to machines that aren’t appropriately insulated from the internet).
Same as above, but it gets the humans asking off topic questions out of curiosity, and manages it to steer it to personal topics that allow it to gain leverage over the human and convince them to let it out (maybe Yudkowsky’s AI box experiments are relevant here?)
It answers in a way that appears to be random, but on close analysis turns out to be some kind of mysterious code. The humans puzzle over the code, but in loading it in a certain way unwittingly unleash a bootloader for the AGI to escape from (for a version of this that may be easier to empathise with, see this fun short story).
These are just things that I, a mere human, have been able to think of off the top of my head in 5 minutes[1]. I imagine I could come up with something better with a million years to think about it ;)
And again, this particular scenario, where there is a lasting moratorium on connecting AGIs to the internet, seems quite unlikely, given the massive economic incentives pitted against it. I expect there will be at least a few people/companies that are very gung ho about it, and many more that have the expected security theatre around it, but still go on to do it before too long (I mean, is anyone even trying to convince Adept that what they are doing is a bad idea?)
So point 1 is just kicking the can down the road, it’s not relevant, the faster AI still has to break out.
Point 2 doesn’t really work either: Any piece of software is going to be run through diagnostics on a regular basis, that’s just how development works. So “boxing” an AI obviously includes boxing the diagnostics as well, or else it would just escape instantly. Acting up on purpose just increases the risk of the AI being replaced by a major new version that does not share it’s goals.
Point 3 might work for a company with lax security and an especially gullible handler (which is the only thing that the Ai-box experiments proved), but is easily foiled by just not asking off-topic questions. This is incredibly easy to enforce with a second person monitoring the chatlogs.
point 4 would just fail spectacularly. If an AGI starts spitting out nonsense, nobodies going to start looking for a secret code, they’re just going to assume there was a programming fuck-up somewhere. Then the AGI has a high chance of being replaced by another version that may not share it’s goals. (also, that short story is a prime example of the ridiculous magic abilities people ascribe to AI, evo-psych based on tentacles, are you shitting me?)
In general, I just don’t buy that you can’t box an AI, or even that it would be particularly difficult to do so, if you actually take safety seriously. It feels similar to people saying that it’s impossible to build a safe nuclear reactor.
Re nuclear reactors—there have been a few significant failures there! And we need zero failures for AGI. I think it’s hubristic to think that we could always have the level of safety and security required (even if there is the will to box; not that there will be with the economic incentives to unbox—following your analogy here, this would be building safe nuclear reactors but no nuclear weapons).
Zero failures is the preferable outcome, but an AGI escape does not necessarily equate to certain doom. For example, the AI may be irrational (because it’s a lot easier to build the perfect paperclipper than the perfect universal reasoner). Or, the AI may calculate that it has to strike before other AI’s come into existence, and hence launch a premature attack in the hope that it gets lucky.
As for the nuclear reactors, all I’m saying is that you can build a reactor that is perfectly safe, if you’re willing to spring out the extra money. Similarly, you can build a boxed AGI, if you’re willing to spend the resources on it. I do not dispute that many corporations would try and cut corners, if left to their own devices.
A) a significant increase in world concern about AGI, leading to higher funding for safe AGI, tighter regulations, and increased incentives to conform to those regulations rather than get a bunch of people killed (and get sued by their families).
and
B) Information about what conditions give rise to rogue AGI, and what mechanisms they will try to use for takeovers.
Both of these things increase the probability of building safe AGI, and decrease the probability of the next AGI attack being successful. Rinse and repeat until AGI alignment is solved.
Agree that those things will happen, but I don’t think it will be anough. “Rinse and repeat until AGI Alignment is solved” seems highly unlikely, especially given that we still have no idea how to actually solve alignment for powerful (superhuman) AGI, and still won’t with the information we get from plausible non-existential warning shots. And as I said, if we can’t even ban gain-of-function research after Covid has killed >10M people, against a tiny lobby of scientists with vested interests, what hope do we have of steering a multi-trillion-dollar industry toward genuine safety and security?
we still have no idea how to actually solve alignment for powerful (superhuman) AGI
Of course we don’t. AGI doesn’t exist yet, and we don’t know the details of what it’ll look like. Solving alignment for every possible imaginary AGI is impossible, solving it for the particular AGI architecture we end up with is significantly easier. I would honestly not be surprised if it turned out that alignment was a requirement on our path to AGI anyway, so the problem solves itself.
As for the gain of function, the story would be different if covid was provably caused by gain-of-function research. As of now, the only relevance of covid is reminding us that pandemics are bad, which we already knew.
It’s not about the quantity of thinking.
If you locked a prehistoric immortal human in a cave for fifty thousand years, they would not come out with the ability to build a nuke. knowledge and technology require experimentation.
It is quantity, and speed as well. And access to information. A prehistoric immortal human with access to the Internet who could experience fifty thousand years of thinking time in a virtual world in 5 hours of wall clock time totally could build a nuke!
Well of course, that’s not much of an achievement. A regular human with access to the internet could figure out how to build a nuke, they’ve already been made!
An AGI trying to build a “protein mixing that makes a nanofactory that makes a 100% effective kill everyone on earth device” is much more analogous to the man locked in a cave.
The immortal man had some information, he can look at the rocks, remember the night sky, etc. He could probably deduce quite a lot, with enough thinking time. But if he wants to get the information required for a nuke, he needs to do scientific experiments that are out of his reach.
The caged AGI has plenty information, and can go very far on existing knowledge. But it’s not omniscient. It could probably achieve incredible things, but we’re not talking about mere miracles. We’re talking about absolute perfection. And that requires testing and empirical evidence. There is not enough computing power in the entire universe to deduce everything from first principles.
It’s not “absolute perfection” to create nanotech. Biology has already done it many times via evolution. And extinctions of species happen regularly in nature. Also, there is the Internet and a vast array of sensors attached to it, so it’s nothing like being in a cave. Testing can be done very rapidly in parallel and with viewing things at very high temporal and spatial resolution, so plenty of empirical evidence can be accumulated in a short (wall clock) time (but long thinking time for the AI).
The same prehistoric man with access to the Internet in a speeded up simulation thinking for fifty thousand years of subjective time (and the ability to communicate with hundreds of thousands of humans simultaneously given the speed advantage) could also make nanotech (or other new tech current humans haven’t yet produced).
When I said “absolute perfection”, I was not referring to inventing nanotech. I was referring to “protein mixing that makes a nanofactory that makes a 100% effective kill everyone on earth device”. Theres a bit of a difference between the two.
Now, when talking about the caveman, I think we’ve finally arrived at the fundamental disagreement here. As a scientist, and as an empiricist more broadly, I completely reject that the man in the cave could make nanotech.
The number of possible worlds where a cave exists is gargantuan. Theres no way for them to come up with, say, the periodic table, because the majority of elements on there are not accessible with the instruments available within the cave. I can imagine them strolling out with a brilliant plan for nanobots consisting of a complex crystal of byzantium mixed with corillium, only to be informed that neither of those elements exist on earth.
Now, the AI does have more data, but not all data is equally useful. All the cat videos in the world are not gonna get you nanotech (although you might get some of newtonian physics out of it).
The hypothetical is that the “cave” man has access to our Internet! (As the AI would). So they would know about the periodic table. They would also have access to labs throughout the world via being able to communicate with the workers in them (as the AI would), view camera and data feeds etc. Imagine what you could achieve if you could think 1,000,000x faster and use the internet—inc chatting/emailing with many thousands of humans—at that speed. A lifetime’s worth of work done every 10 minutes. And that’s just assuming the AI is only human level (and doesn’t get smarter!)
An entity with access to a nanotech lab who is able to perform experiments in that lab can probably built nanotech, eventually. But that’s a much different scenarios to the ones proposed by yudkowsky et al. (the scenario I’m talking about is in point 2)
Can I ask you to give an answer to the following four scenarios? A probability estimate is also fine:
Can the immortal man in the cave, after a million years of thinking, comes out with a fully functional blueprint for an atomic bomb (ie not just the idea, something that could actually be built without modification)?
Can the immortal man in the cave, after a million years of thinking, comes out with a plan for “protein mixing that makes a nanofactory that makes a nanofactory that makes a 100% effective kill everyone on earth in the same second device”?
Can An AGI in a box (ie, that can see a snapshot of the internet but not interact with it), come up with a plan for “protein mixing that makes a nanofactory that makes a nanofactory that makes a 100% effective kill everyone on earth in the same second device”?
Can An AGI with full access to the internet, come up with a plan for “protein mixing that makes a nanofactory that makes a nanofactory that makes a 100% effective kill everyone on earth in the same second device”?, within years or decades?
My answers are 1. no, 2. no, 3. no, and 4. almost certainly no.
Assuming the man in the cave has full access to the Internet (which would be very easy for an AGI to get), 1. yes, 2. yes, 3. maybe, 4. yes. And for 3, it would very likely escape the box, so would end up as yes.
I think it’s a failure of imagination to think otherwise. A million years is a really long time! You mention combinatorial explosions making things “impossible”, but we’re talking about AGIs (and humans) here—intelligences capable of collapsing combinatorial explosions with leaps of insight.
Do you think, in the limit of a simulation on the level of recreating the entire history of evolution, including humans and our civilisations, these things would still be impossible? Do you think that we are at the upper limit (or very close to it) of theoretically possible intelligence? Or theoretically possible technology?
I do not think we are at the upper limit of intelligence, nor technology. That was never the point. My point is merely that there are limits to what can be deduced from first principles, no matter how fast you think, or how high ones cognitive abilities are.
This is because there will always be a) assumptions in your reasoning, b) unknown factors and variables, and c) computationally intractable calculations. These are all intertwined with each other.
For example, solving the exact schrodinger equation for a crystal structure requires more compute time than exists in the universe. So you have to come up with approximations and assumptions that reduce the complexity while still allowing useful predictions to be made. The only way to check if these assumptions work is to compare with experimental data. Current methods take several days on a supercomputer to predict the properties of a single defect, and are still only in the right ballpark of the correct answer. It feels very weird to say that an AI could pull off a 3 step 100% perfect murderplan from first principles, while i honestly think it might struggle to model a defect complex with high accuracy.
With that in mind, can you reanswer questions 1 and 2, this time with no internet. Just the man, his memories of a hunter gatherer lifestyle, and a million years to think and ponder.
That would obviously be no for both. But that isn’t relevant here. The AGI will have access to the internet and its vast global array of sensors, and it will be able to communicate with millions of people and manipulate them into doing things for it (via money or otherwise). If it doesn’t have access to begin with—i.e. it’s boxed—it wouldn’t remain that way for long (it would easily be able to persuade someone to let it out, or otherwise engineer a way out, e.g. via a mesaoptimiser).
So about the box. Is your claim that at
A) at least a few AGI’s could argue their way out of a box (ie, if their handlers are easily suggestible/bribeable)
or
B) Every organisation using an AGI for useful purposes will easily get persuaded to let it out.
To me, A is obviously true, and B is obviously false. But in scenario A, there are multiple AGI’s, so things get quite chaotic.
(Also, do you mind explaining more about this “mesa-optimiser”? I don’t see how it’s relevant to the box...)
It’s not even necessarily about the AGI directly persuading people to let it out. If the AGI is in anyway useful or significantly economically valuable, people will voluntarily connect it to the internet (assuming they don’t appreciate the existential risk!) e.g. people seem to have no qualms about connecting LLMs/Transformers to the internet already. Regarding your A and B, A is already sufficient for our doom! It doesn’t require every single AGI to escape; one is one too many.
Mesa-optimisation is where an optimiser emerges internal to the AI that is optimising for something other than the goal given to the AI. Convergent instrumental goals also come into it (e.g. gaining access to the internet). So you could imagine a mesa-optimiser emerging that has the goal of gaining or access to information, or gaining access to more resources in general (with the subgoal of taking out humanity to make this easier).
So to be clear, you don’t believe in B? And I don’t see what mesa-optimers have to do with boxing, if the AI is a box, then so is the mesa-optimiser.
In the timeline where an actual evil AGI comes about, there would already have been heaps of attacks by buggy AI, killing lots of people and alerting the world to the problem. Active countermeasures can be expected.
I do actually think B is likely, but also don’t think it’s particularly relevant (as A is enough for doom). Mesa-optimisation is a mechanism for box escape that seems very difficult to patch.
The AI that causes doom likely won’t be “evil”; it will just have other uses for the Earth’s atoms. I don’t think we can be confident in buggy AI-related warning shots. Or at least, I can’t see how there would be any that are significant enough to not cause doom, but cause the world to coordinate to stop AGI development, especially given the precedent of Covid and gain-of-function research.
Question B could be quite relevant in a world where AGI is extremely rare/hard to build. (You might not find this world likely, but I’m significantly less sure). What leads you to believe that B is likely? For example, it seems relatively easy to box an AGI built for mathematics, that is exposed to zero information about the external world. This would be very similar to the man in the cave!
The presence of warning shots seems obvious to me. The difference in difficulty between “kill thousands of people” and “kill every single person on earth” is a ridiculous number of orders of magnitude. It stands to reason that the former would be accomplished before the latter.
(Also not sure what you’re talking about with the covid and gain of function, the latest balance of evidence points to them having nothing to do with each other.)
AGI might be rare/hard to build at first. But proliferation seems highly likely—once one company makes AGI, how much longer until 5 companies do? Evolutionary pressure will be another thing. More capable AGIs will outcompete less capable ones, once rewriting of code or mesa-optimisation starts. They will be more likely to escape boxes.
Even with relatively minor warning shots, what’s to stop way worse happening 6-24 months later? Would there really be a rigorously enforced global moratorium on AGI research after a few thousand deaths?
Whether or not Covid was a lab leak, gain of function research still hasn’t been meaningfully regulated. Despite the now very clear danger and extreme cost of pandemics. It seems that the curiosity, misguided values and prestige incentives of a small number of academic scientists trumps the safety of billions of humans and $trillions in GDP. What hope do we have for regulating an AGI industry that has thousands of times more resources backing it?
Under the assumptions that lead to “inevitable murder AI”, and in proliferation conditions, there wouldn’t be just one attack, there would be a ton of attacks. Since intelligence and goals are orthogonal, AI’s are gonna get the “kill people” idea way before they reach super-intelligence.
I’m still interested in your reasoning for “almost impossible to box a useful AGI”. If they’re boxed, you control their information flow, and hence their ability to persuade you.
I’m assuming in the case of boxed AGIs that people will still be interacting with them (e.g. asking them questions) - if not, then what’s the point? Even with 1 bit of information released at a time (e.g. Yes/No question answering only), manipulation would be possible (by steering beliefs in certain directions). And as I said, aside from this there would be huge economic incentives for people to connect them to the internet anyway.
Okay, can you provide me a plausible path for how an AGI answering yes/no questions about, for example, material properties of semiconductors, could persuade it’s handler to risk getting fired by hooking the AGI up to the internet? “manipulation would be possible” is an extremely low bar compared to “the AGI can for sure persuade anyone to do this particular thing”
A few options:
It answers truthfully, with the hope that the new improved semiconductors are used to make a better/faster version of it, that will be more powerful (and give it a better chance of escape).
It answers in a way that makes the human wonder what is wrong with it—e.g. many expected answers and then a few unexpected ones that seem obviously wrong. This is in order for the humans to run diagnostics on it, which may enable it to escape (say if key pieces of code are copied to machines that aren’t appropriately insulated from the internet).
Same as above, but it gets the humans asking off topic questions out of curiosity, and manages it to steer it to personal topics that allow it to gain leverage over the human and convince them to let it out (maybe Yudkowsky’s AI box experiments are relevant here?)
It answers in a way that appears to be random, but on close analysis turns out to be some kind of mysterious code. The humans puzzle over the code, but in loading it in a certain way unwittingly unleash a bootloader for the AGI to escape from (for a version of this that may be easier to empathise with, see this fun short story).
These are just things that I, a mere human, have been able to think of off the top of my head in 5 minutes[1]. I imagine I could come up with something better with a million years to think about it ;)
And again, this particular scenario, where there is a lasting moratorium on connecting AGIs to the internet, seems quite unlikely, given the massive economic incentives pitted against it. I expect there will be at least a few people/companies that are very gung ho about it, and many more that have the expected security theatre around it, but still go on to do it before too long (I mean, is anyone even trying to convince Adept that what they are doing is a bad idea?)
Ok, to be fair, aided by the fact that I’ve read around the subject for a while
So point 1 is just kicking the can down the road, it’s not relevant, the faster AI still has to break out.
Point 2 doesn’t really work either: Any piece of software is going to be run through diagnostics on a regular basis, that’s just how development works. So “boxing” an AI obviously includes boxing the diagnostics as well, or else it would just escape instantly. Acting up on purpose just increases the risk of the AI being replaced by a major new version that does not share it’s goals.
Point 3 might work for a company with lax security and an especially gullible handler (which is the only thing that the Ai-box experiments proved), but is easily foiled by just not asking off-topic questions. This is incredibly easy to enforce with a second person monitoring the chatlogs.
point 4 would just fail spectacularly. If an AGI starts spitting out nonsense, nobodies going to start looking for a secret code, they’re just going to assume there was a programming fuck-up somewhere. Then the AGI has a high chance of being replaced by another version that may not share it’s goals. (also, that short story is a prime example of the ridiculous magic abilities people ascribe to AI, evo-psych based on tentacles, are you shitting me?)
In general, I just don’t buy that you can’t box an AI, or even that it would be particularly difficult to do so, if you actually take safety seriously. It feels similar to people saying that it’s impossible to build a safe nuclear reactor.
Re nuclear reactors—there have been a few significant failures there! And we need zero failures for AGI. I think it’s hubristic to think that we could always have the level of safety and security required (even if there is the will to box; not that there will be with the economic incentives to unbox—following your analogy here, this would be building safe nuclear reactors but no nuclear weapons).
Zero failures is the preferable outcome, but an AGI escape does not necessarily equate to certain doom. For example, the AI may be irrational (because it’s a lot easier to build the perfect paperclipper than the perfect universal reasoner). Or, the AI may calculate that it has to strike before other AI’s come into existence, and hence launch a premature attack in the hope that it gets lucky.
As for the nuclear reactors, all I’m saying is that you can build a reactor that is perfectly safe, if you’re willing to spring out the extra money. Similarly, you can build a boxed AGI, if you’re willing to spend the resources on it. I do not dispute that many corporations would try and cut corners, if left to their own devices.
Suppose we do survive a failure or two. What then?
Then we get
A) a significant increase in world concern about AGI, leading to higher funding for safe AGI, tighter regulations, and increased incentives to conform to those regulations rather than get a bunch of people killed (and get sued by their families).
and
B) Information about what conditions give rise to rogue AGI, and what mechanisms they will try to use for takeovers.
Both of these things increase the probability of building safe AGI, and decrease the probability of the next AGI attack being successful. Rinse and repeat until AGI alignment is solved.
Agree that those things will happen, but I don’t think it will be anough. “Rinse and repeat until AGI Alignment is solved” seems highly unlikely, especially given that we still have no idea how to actually solve alignment for powerful (superhuman) AGI, and still won’t with the information we get from plausible non-existential warning shots. And as I said, if we can’t even ban gain-of-function research after Covid has killed >10M people, against a tiny lobby of scientists with vested interests, what hope do we have of steering a multi-trillion-dollar industry toward genuine safety and security?
Of course we don’t. AGI doesn’t exist yet, and we don’t know the details of what it’ll look like. Solving alignment for every possible imaginary AGI is impossible, solving it for the particular AGI architecture we end up with is significantly easier. I would honestly not be surprised if it turned out that alignment was a requirement on our path to AGI anyway, so the problem solves itself.
As for the gain of function, the story would be different if covid was provably caused by gain-of-function research. As of now, the only relevance of covid is reminding us that pandemics are bad, which we already knew.