To qualify this would be a moratorium or pause on nuclear arms before powerful nations had doomsday sized arsenals. The powerful making it expensive for poor nations to get nukes—though several did—is different. And notably I wonder how well it would have gone if the powerful nation had no nukes of their own. Trying to ban AGI from others—when the others have nukes and their own chip fabs—would be the same situation. Not only will you fail you will eventually, if you don’t build your own AGI, lose everything. Same if you have no nukes.
What data is that? A model misunderstanding “rules” on an edge case isn’t misaligned. Especially when double generation usually works. The sub rising has every prior sunrise as priors. Which empirical data would let someone conclude AGI is an existential risk justifying international agreements. Some measurement or numbers.
Yes, and they said this about nukes and built thousands
Yes to maximize profit. Pledging to go to zero is not the same thing.
You seem to dismiss the claim that AI is an existential risk. If that’s correct, perhaps we should start closer to the beginning, rather than debating global response, and ask you to explain why you disagree with such a large consensus of experts that this risk exists.
I don’t disagree. I don’t see how it’s different than nuclear weapons. Many many experts are also saying this.
Nobody denies nuclear weapons are an existential risk. And every control around their use is just probability based, there is absolutely nothing stopping a number of routes from ending the richest civilizatios. Multiple individuals appear to have the power to do it at a time, every form of interlock and safety mechanism has a method of failure or bypass.
Survival to this point was just probability. Over an infinite timescale the nukes will fly.
Point is that it was completely and totally intractable to stop the powerful from getting nukes. SALT was the powerful tiring of paying the maintenance bills and wanting to save money on MAD. And key smaller countries—Ukraine and Taiwan—have strategic reasons to regret their choice to give up their nuclear arms. It is possible that if the choice happens again future smaller countries will choose to ignore the consequences and build nuclear arsenals. (Ukraines first opportunity will be when this war ends, they can start producing plutonium. Taiwan chance is when China begins construction of the landing ships)
So you’re debating something that isn’t going to happen without a series of extremely improbable events happening simultaneously.
If you start thinking about practical interlocks around AI systems you end up with similar principles to what protects nukes albeit with some differences. Low level controllers running simple software having authority, air gaps—there are some similarities.
Also unlike nukes a single AI escaping doesn’t end the world. It has to escape and there must be an environment that supports its plans. It is possible for humans to prepare for this and to make the environment inhospitable to rogue AGIs. Heavy use of air gaps, formally proven software, careful monitoring and tracking of high end compute hardware. A certain minimum amount of human supervision for robots working on large scale tasks.
This is much more feasible than “put the genie away” which is what a pause is demanding.
You are arguing impossibilities despite a reference class with reasonably close analogues that happened. If you could honestly tell me people thought the NPT was plausible when proposed, and I’ll listen when you say this is implausible.
In fact, there is appetite for fairly strong reactions, and if we’re the ones who are concerned about the risks, folding before we even get to the table isn’t a good way to get anything done.
despite a reference class with reasonably close analogues that happened
I am saying the common facts that we both have access to do not support your point of view. It never happened. There are no cases of “very powerful, short term useful, profitable or military technologies” that were effectively banned, in the last 150 years.
You have to go back to the 1240s to find a reference class match.
These strongly worded statements I just made are trivial for you to disprove. Find a counterexample. I am quite confident and will bet up to $1000 you cannot.
You’ve made some strong points, but I think they go too far.
The world banned CFCs, which were critical for a huge range of applications. It was short term useful, profitable technology, and it had to be replaced entirely with a different and more expensive alternative.
The world has banned human cloning, via a UN declaration, despite the promise of such work for both scientific and medical usage.
Neither of these is exactly what you’re thinking of, and I think both technically qualify under the description you provided, if you wanted to ask a third party to judge whether they match. (Don’t feel any pressure to do so—this is the kind of bet that is unresolvable because it’s not precise enough to make everyone happy about any resolution.)
However, I also think that what we’re looking to do in ensuring only robustly safe AI systems via a moratorium on untested and by-default-unsafe systems is less ambitious or devastating to applications than a full ban on the technology, which is what your current analogy requires. Of course, the “very powerful, short term useful, profitable or military technolog[y]” of AI is only those things if it’s actually safe—otherwise it’s not any of those things, it’s just a complex form of Russian roulette on a civilizational scale. On the other hand, if anyone builds safe and economically beneficial AGI, I’m all for it—but the bar for proving safety is higher than anything anyone currently suggests is feasible, and until that changes, safe strong AI is a pipe-dream.
On the other hand, if anyone builds safe and economically beneficial AGI, I’m all for it—but the bar for proving safety is higher than anything anyone currently suggests is feasible, and until that changes, safe strong AI is a pipe-dream.
??? David, do you have any experience with
(1) engineering
(2) embedded safety compliant systems
(3) AI
Note that Mobileye has a very strong proposal for autonomous car safety, I mention it because it’s one of the theoretically best ones.
You can go watch their videos on it but it’s simple 3 parallel solvers, each using a completely different input (camera, lidar, imaging radar). If any solver perceives a collision, the system acts to prevent that collision. So a failure to hit a collidable object requires pFail^3. It is unlikely, most of the failures are going to be where the system is coupled together.
Similar techniques scale to superintelligent AI.
You can go play with it right now, even write your own python script and do it yourself.
Suppose you want an LLM to obey an arbitrary list of “rules”.
You have the LLM generate output, and you measured how often in testing, and production, it has violated the rules.
Say pFail is 0.1. Then you add another stage. Have the LLM check it’s own output for a rule violation, and don’t send it to the user if the violation was there.
Say the pFail on that stage is 0.2.
Therefore the overall system will fail 2% of the time.
Maybe good enough for current uses, but not good enough to run a steel mill. A robot making an error 2% of the time will cause the robot to probably break itself and cost more service worker time than having a human operator.
So you add stages. You create training environments where you model the steel mill, you add more stages of error checking, you do things until empirically your design failure meets spec.
This is standard engineering practice. No “AI alignment experts” needed, any ‘real’ engineer knows this.
One of the critical things you do is you need your test environment to reflect reality. There are a lot of things involved in this but the one crucial to AI is immutable model weights. When you are validating the model and when it’s used in the real world, it’s immutable. No learning, no going out of control.
And another aspect is to control the state buildup. Most software systems that have ever failed—see patriot missile, see Therac-25 - fail because state accumulated at runtime. You can prevent this, fresh prompts when using GPT-4 is one obvious way. Limiting what information the model has to operate reduces how often it fails, both in production and testing.
A superintelligent system is easily restricted the same way. Because while it may be far past human ability, we tested it in ways we could verify, we check the distribution of the inputs to make sure they were reflected in the test environment—that is, the real world input could have been generated in test—and it’s superintelligent because it generated the right answer almost every time, well below the error rate of a human.
I think the cognitive error here is everyone is imagining an “ASI” or “AGI” as “like you or me but waaaay smarter”. And this baggage brings in a bunch of elements humans have an AI system does not need to do its job. Mostly memory for an inner monologue or persistent chain of thought, continuity of existence, online learning, long term goals.
You need 0 of those to automate most jobs or make complex decisions that humans cannot make accurately.
I disagree with a lot of particulars here, but don’t want to engage beyond this response because your post feels like it’s not about the substantive topic any more, it’s just trying to mock an assumed / claimed lack of understanding on my part. (Which would be far worse to have done if it were correct.)
That said, if you want to know more about my background, I’m eminently google-able, and while you clearly have more background in embedded safety compliant systems, I think you’re wrong on almost all of the details of what you wrote as it applies to AGI.
Regarding your analogy to Mobileye’s approach, I’ve certainly read the papers, and had long conversations with people at Mobileye about their safety systems. I even had one of their former That’s why I think it’s fair to say that you’re fundamentally mischaracterizing the idea of “Responsibility-Sensitive Safety”—it’s not about collision avoidance per se, it’s about not being responsible for accidents, in ways that greatly reduce the probability of such accidents. This is critical for understanding what it does and does not guarantee. More critically, for AI systems, this class of safety guarantee doesn’t work because you need a complete domain model as well as a complete failure mode model in order to implement a similar failsafe. I’ve even written about how RSS could be extended, and that explains why it’s not applicable to AGI back in 2018 - but found that many of my ideas were anticipated by Christiano’s 2016 work (which that post is one small part of,) and had been further refined in the context of AGI since then.
So I described scaling stateless microservices to control AGI. This is how current models work, and this is how cais works, and this is how tech company stacks work.
I mentioned an in distribution detector as a filter and empirical measurement of system safety.
I have mentioned this to safety researchers at openAI. The one I talked to on the eleuther discord didn’t know of a flaw.
Why won’t this work? It’s very strong theoretically and simple and close to current techniques. Can you name or link one actual objection? Eliezer was unable to do so.
The only objection I have heard is “humans will be tempted to build unsafe systems”. Maybe so, but the unsafe ones will measurably lower performance than this design for a reason that I will assume you know. So humans will only build a few, and if they cannot escape the lab because the model needs thousands of current gen accelerator cards to think, then....
If your action space is small enough to have what you want it to not be able to do programmatically described in terms of its outputs, and your threat model is complete, it works fine.
In your words, what kind of tasks do you believe you cannot accomplish with restricted models like I described.
When you say the “threat model has to be complete”, what did you have in mind specifically?
These are restricted models, they get a prompt from an authorized user + context in human parsable format, they emit a human parsable output. This scales from very large to very small tasks, so long as the task can be checked for correctness, ideally in simulation.
With this context, what are your concerns? Why must we be frightened enough to pause everything?
For individual tasks, sure, you can implement verifiers, though I think it becomes quickly unwieldy, but there’s no in-principle reason we cannot do this. But you cannot create AGI with a restricted model—we cannot define the space of what outputs we want, otherwise it’s by definition a narrow AI.
Because it can generate outputs that are sometimes correct on new tasks—“write me a program that computes X”, it’s general, even if “compute X” is made of 2 common subcomponents the model saw many times in training.
GPT-4 is perfectly safe if you were to run it in local hardware with a local terminal. The “space of outputs” is “text to the terminal”. As long as you don’t leave a security vulnerability where that text stream can cause commands to execute on the history PC, that’s it, that’s all it can do.
Consider that “a robot tethered to a mount” could do general tasks the same way. Same idea—its a general system but it’s command stream can’t reach anything but the tethered robot because that’s where the wires go.
You also verified the commands empirically. It’s not that you know any given robotic actions or text output is good, it’s that you benchmarked the model and it has a certain pFail on training inputs.
I agree this is not as much generality as humans have. It’s not a narrow AI though the “In distribution detector”—a measure of how similar the current task, current input is to the training set—is essentially narrowing your AI system from a general one to a narrow one, depending on your tolerances.
For tasks where you can’t shut the system down when the input state leaves distribution—say a robotic surgeon, you need it to keep trying best it can- you would use electromechanical interlocks. Same as 50 years ago for interlocks that prevent exposure to radiation. You tether the surgery robotic equipment, restrict it’s network links etc, so that the number of people it can kill is at most 1 (the patient)
Right. And this scales to fairly massive tasks. “Design a medium body airliner to these specs” is completely doable. Or an entire chip in one step.
The model doesn’t get to collaborate with future versions of itself because it doesn’t know who is checking the end product for security vulnerabilities and to be caught is death. We could give a model thousands of tasks of this complexity level and check for sabotage before giving the model 1 real task. It gets no context differentiating the real task from the test tasks, and has no memory of any of the prior steps.
And see it scales up and down the tree. The robots building the aircraft don’t get to plan their sabotage by similar limits and so on and so forth.
Your plan to deal with bad models is to use your restricted models to manufacture the weapons needed to fight them, and to optimize their engagements.
This i think is a grounded and realistic view of how to win this. Asking for pauses is not.
You’re misinterpreting what a moratorium would involve. I think you should read my post, where I outlined what I think a reasonable pathway would be—not stopping completely forever, but a negotiated agreement about how to restrict more powerful and by-default dangerous systems, and therefore only allowing those that are shown to be safe.
Edit to add: “unlike nukes a single AI escaping doesn’t end the world” ← Disagree on both fronts. A single nuclear weapons won’t destroy the world, while a single misaligned and malign superintelligent AI, if created and let loose, almost certainly will—it doesn’t need a hospitable environment.
So there is one model that might have worked for nukes. You know about PAL and weak-link strong link design methodology? This is a technology for reducing the rogue use of nuclear warheads. It was shared with Russia/the USSR so that they could choose to make their nuclear warheads safe from unauthorized use.
Major AI labs could design software frameworks and tooling that make AI models, even ASI capabilities level models, less likely to escape or misbehave. And release the tooling.
It would be voluntary compliance but like the Linux Kernel it might in practice be used by almost everyone.
As for the second point, no. Your argument has a hidden assumption that is not supported by evidence or credible AI scientists.
The evidence is that models that exhibit human scale abilities need human scale (within an oom) level of compute and memory. The physical hardware racks to support this are enormous and not available outside AI labs. Were we to restrict the retail sale of certain kinds of training accelerator chips and especially high bandwidth interconnects, we could limit the places human level + AI could exist to data centers at known addresses.
Your hidden assumption is optimizations, but the problem is that if you consider not just “AGI” but “ASI”, the amount of hardware to support superhuman level cognition is probably nonlinear.
If you wanted a model that could find an action that has a better expected value than a human level model with 90 percent probability (so the model is 10 times smarter in utility), it probably needs more than 10 times the compute. Probably logarithmic, that to find a better action 90 percent of the time you need to explore a vastly larger possibility space and you need the compute and memory to do this.
This is probably provable in a theorem but the science isn’t there yet.
If correct, actually ASI is easily contained. Just write down where 10,000+ H100s are located or find it by IR or power consumption. If you suspect a rogue ASI has escaped that’s where you check.
This is what I mean by controlling the environment. Realtime auditing of AI accelerator clusters—what model is running, who is paying for it, what’s their license number, etc—would actually decrease progress very little while make escapes difficult.
If hacking and escapes turns out to be a threat, air gaps and asic hardware firewalls to prevent this are the next level of security to add.
The difference is that major labs would not be decelerated at all. There is no pause. They just in parallel have to spend a trivial amount of money complying with the registration and logging reqs.
To qualify this would be a moratorium or pause on nuclear arms before powerful nations had doomsday sized arsenals. The powerful making it expensive for poor nations to get nukes—though several did—is different. And notably I wonder how well it would have gone if the powerful nation had no nukes of their own. Trying to ban AGI from others—when the others have nukes and their own chip fabs—would be the same situation. Not only will you fail you will eventually, if you don’t build your own AGI, lose everything. Same if you have no nukes.
What data is that? A model misunderstanding “rules” on an edge case isn’t misaligned. Especially when double generation usually works. The sub rising has every prior sunrise as priors. Which empirical data would let someone conclude AGI is an existential risk justifying international agreements. Some measurement or numbers.
Yes, and they said this about nukes and built thousands
Yes to maximize profit. Pledging to go to zero is not the same thing.
You seem to dismiss the claim that AI is an existential risk. If that’s correct, perhaps we should start closer to the beginning, rather than debating global response, and ask you to explain why you disagree with such a large consensus of experts that this risk exists.
I don’t disagree. I don’t see how it’s different than nuclear weapons. Many many experts are also saying this.
Nobody denies nuclear weapons are an existential risk. And every control around their use is just probability based, there is absolutely nothing stopping a number of routes from ending the richest civilizatios. Multiple individuals appear to have the power to do it at a time, every form of interlock and safety mechanism has a method of failure or bypass.
Survival to this point was just probability. Over an infinite timescale the nukes will fly.
Point is that it was completely and totally intractable to stop the powerful from getting nukes. SALT was the powerful tiring of paying the maintenance bills and wanting to save money on MAD. And key smaller countries—Ukraine and Taiwan—have strategic reasons to regret their choice to give up their nuclear arms. It is possible that if the choice happens again future smaller countries will choose to ignore the consequences and build nuclear arsenals. (Ukraines first opportunity will be when this war ends, they can start producing plutonium. Taiwan chance is when China begins construction of the landing ships)
So you’re debating something that isn’t going to happen without a series of extremely improbable events happening simultaneously.
If you start thinking about practical interlocks around AI systems you end up with similar principles to what protects nukes albeit with some differences. Low level controllers running simple software having authority, air gaps—there are some similarities.
Also unlike nukes a single AI escaping doesn’t end the world. It has to escape and there must be an environment that supports its plans. It is possible for humans to prepare for this and to make the environment inhospitable to rogue AGIs. Heavy use of air gaps, formally proven software, careful monitoring and tracking of high end compute hardware. A certain minimum amount of human supervision for robots working on large scale tasks.
This is much more feasible than “put the genie away” which is what a pause is demanding.
You are arguing impossibilities despite a reference class with reasonably close analogues that happened. If you could honestly tell me people thought the NPT was plausible when proposed, and I’ll listen when you say this is implausible.
In fact, there is appetite for fairly strong reactions, and if we’re the ones who are concerned about the risks, folding before we even get to the table isn’t a good way to get anything done.
I am saying the common facts that we both have access to do not support your point of view. It never happened. There are no cases of “very powerful, short term useful, profitable or military technologies” that were effectively banned, in the last 150 years.
You have to go back to the 1240s to find a reference class match.
These strongly worded statements I just made are trivial for you to disprove. Find a counterexample. I am quite confident and will bet up to $1000 you cannot.
You’ve made some strong points, but I think they go too far.
The world banned CFCs, which were critical for a huge range of applications. It was short term useful, profitable technology, and it had to be replaced entirely with a different and more expensive alternative.
The world has banned human cloning, via a UN declaration, despite the promise of such work for both scientific and medical usage.
Neither of these is exactly what you’re thinking of, and I think both technically qualify under the description you provided, if you wanted to ask a third party to judge whether they match. (Don’t feel any pressure to do so—this is the kind of bet that is unresolvable because it’s not precise enough to make everyone happy about any resolution.)
However, I also think that what we’re looking to do in ensuring only robustly safe AI systems via a moratorium on untested and by-default-unsafe systems is less ambitious or devastating to applications than a full ban on the technology, which is what your current analogy requires. Of course, the “very powerful, short term useful, profitable or military technolog[y]” of AI is only those things if it’s actually safe—otherwise it’s not any of those things, it’s just a complex form of Russian roulette on a civilizational scale. On the other hand, if anyone builds safe and economically beneficial AGI, I’m all for it—but the bar for proving safety is higher than anything anyone currently suggests is feasible, and until that changes, safe strong AI is a pipe-dream.
??? David, do you have any experience with
(1) engineering
(2) embedded safety compliant systems
(3) AI
Note that Mobileye has a very strong proposal for autonomous car safety, I mention it because it’s one of the theoretically best ones.
You can go watch their videos on it but it’s simple 3 parallel solvers, each using a completely different input (camera, lidar, imaging radar). If any solver perceives a collision, the system acts to prevent that collision. So a failure to hit a collidable object requires pFail^3. It is unlikely, most of the failures are going to be where the system is coupled together.
Similar techniques scale to superintelligent AI.
You can go play with it right now, even write your own python script and do it yourself.
Suppose you want an LLM to obey an arbitrary list of “rules”.
You have the LLM generate output, and you measured how often in testing, and production, it has violated the rules.
Say pFail is 0.1. Then you add another stage. Have the LLM check it’s own output for a rule violation, and don’t send it to the user if the violation was there.
Say the pFail on that stage is 0.2.
Therefore the overall system will fail 2% of the time.
Maybe good enough for current uses, but not good enough to run a steel mill. A robot making an error 2% of the time will cause the robot to probably break itself and cost more service worker time than having a human operator.
So you add stages. You create training environments where you model the steel mill, you add more stages of error checking, you do things until empirically your design failure meets spec.
This is standard engineering practice. No “AI alignment experts” needed, any ‘real’ engineer knows this.
One of the critical things you do is you need your test environment to reflect reality. There are a lot of things involved in this but the one crucial to AI is immutable model weights. When you are validating the model and when it’s used in the real world, it’s immutable. No learning, no going out of control.
And another aspect is to control the state buildup. Most software systems that have ever failed—see patriot missile, see Therac-25 - fail because state accumulated at runtime. You can prevent this, fresh prompts when using GPT-4 is one obvious way. Limiting what information the model has to operate reduces how often it fails, both in production and testing.
A superintelligent system is easily restricted the same way. Because while it may be far past human ability, we tested it in ways we could verify, we check the distribution of the inputs to make sure they were reflected in the test environment—that is, the real world input could have been generated in test—and it’s superintelligent because it generated the right answer almost every time, well below the error rate of a human.
I think the cognitive error here is everyone is imagining an “ASI” or “AGI” as “like you or me but waaaay smarter”. And this baggage brings in a bunch of elements humans have an AI system does not need to do its job. Mostly memory for an inner monologue or persistent chain of thought, continuity of existence, online learning, long term goals.
You need 0 of those to automate most jobs or make complex decisions that humans cannot make accurately.
I disagree with a lot of particulars here, but don’t want to engage beyond this response because your post feels like it’s not about the substantive topic any more, it’s just trying to mock an assumed / claimed lack of understanding on my part. (Which would be far worse to have done if it were correct.)
That said, if you want to know more about my background, I’m eminently google-able, and while you clearly have more background in embedded safety compliant systems, I think you’re wrong on almost all of the details of what you wrote as it applies to AGI.
Regarding your analogy to Mobileye’s approach, I’ve certainly read the papers, and had long conversations with people at Mobileye about their safety systems. I even had one of their former That’s why I think it’s fair to say that you’re fundamentally mischaracterizing the idea of “Responsibility-Sensitive Safety”—it’s not about collision avoidance per se, it’s about not being responsible for accidents, in ways that greatly reduce the probability of such accidents. This is critical for understanding what it does and does not guarantee. More critically, for AI systems, this class of safety guarantee doesn’t work because you need a complete domain model as well as a complete failure mode model in order to implement a similar failsafe. I’ve even written about how RSS could be extended, and that explains why it’s not applicable to AGI back in 2018 - but found that many of my ideas were anticipated by Christiano’s 2016 work (which that post is one small part of,) and had been further refined in the context of AGI since then.
So I described scaling stateless microservices to control AGI. This is how current models work, and this is how cais works, and this is how tech company stacks work.
I mentioned an in distribution detector as a filter and empirical measurement of system safety.
I have mentioned this to safety researchers at openAI. The one I talked to on the eleuther discord didn’t know of a flaw.
Why won’t this work? It’s very strong theoretically and simple and close to current techniques. Can you name or link one actual objection? Eliezer was unable to do so.
The only objection I have heard is “humans will be tempted to build unsafe systems”. Maybe so, but the unsafe ones will measurably lower performance than this design for a reason that I will assume you know. So humans will only build a few, and if they cannot escape the lab because the model needs thousands of current gen accelerator cards to think, then....
If your action space is small enough to have what you want it to not be able to do programmatically described in terms of its outputs, and your threat model is complete, it works fine.
Ok in my initial reply I missed something.
In your words, what kind of tasks do you believe you cannot accomplish with restricted models like I described.
When you say the “threat model has to be complete”, what did you have in mind specifically?
These are restricted models, they get a prompt from an authorized user + context in human parsable format, they emit a human parsable output. This scales from very large to very small tasks, so long as the task can be checked for correctness, ideally in simulation.
With this context, what are your concerns? Why must we be frightened enough to pause everything?
For individual tasks, sure, you can implement verifiers, though I think it becomes quickly unwieldy, but there’s no in-principle reason we cannot do this. But you cannot create AGI with a restricted model—we cannot define the space of what outputs we want, otherwise it’s by definition a narrow AI.
What’s GPT-4?
Because it can generate outputs that are sometimes correct on new tasks—“write me a program that computes X”, it’s general, even if “compute X” is made of 2 common subcomponents the model saw many times in training.
GPT-4 is perfectly safe if you were to run it in local hardware with a local terminal. The “space of outputs” is “text to the terminal”. As long as you don’t leave a security vulnerability where that text stream can cause commands to execute on the history PC, that’s it, that’s all it can do.
Consider that “a robot tethered to a mount” could do general tasks the same way. Same idea—its a general system but it’s command stream can’t reach anything but the tethered robot because that’s where the wires go.
You also verified the commands empirically. It’s not that you know any given robotic actions or text output is good, it’s that you benchmarked the model and it has a certain pFail on training inputs.
I agree this is not as much generality as humans have. It’s not a narrow AI though the “In distribution detector”—a measure of how similar the current task, current input is to the training set—is essentially narrowing your AI system from a general one to a narrow one, depending on your tolerances.
For tasks where you can’t shut the system down when the input state leaves distribution—say a robotic surgeon, you need it to keep trying best it can- you would use electromechanical interlocks. Same as 50 years ago for interlocks that prevent exposure to radiation. You tether the surgery robotic equipment, restrict it’s network links etc, so that the number of people it can kill is at most 1 (the patient)
Right. And this scales to fairly massive tasks. “Design a medium body airliner to these specs” is completely doable. Or an entire chip in one step.
The model doesn’t get to collaborate with future versions of itself because it doesn’t know who is checking the end product for security vulnerabilities and to be caught is death. We could give a model thousands of tasks of this complexity level and check for sabotage before giving the model 1 real task. It gets no context differentiating the real task from the test tasks, and has no memory of any of the prior steps.
And see it scales up and down the tree. The robots building the aircraft don’t get to plan their sabotage by similar limits and so on and so forth.
Your plan to deal with bad models is to use your restricted models to manufacture the weapons needed to fight them, and to optimize their engagements.
This i think is a grounded and realistic view of how to win this. Asking for pauses is not.
You’re misinterpreting what a moratorium would involve. I think you should read my post, where I outlined what I think a reasonable pathway would be—not stopping completely forever, but a negotiated agreement about how to restrict more powerful and by-default dangerous systems, and therefore only allowing those that are shown to be safe.
Edit to add: “unlike nukes a single AI escaping doesn’t end the world” ← Disagree on both fronts. A single nuclear weapons won’t destroy the world, while a single misaligned and malign superintelligent AI, if created and let loose, almost certainly will—it doesn’t need a hospitable environment.
So there is one model that might have worked for nukes. You know about PAL and weak-link strong link design methodology? This is a technology for reducing the rogue use of nuclear warheads. It was shared with Russia/the USSR so that they could choose to make their nuclear warheads safe from unauthorized use.
Major AI labs could design software frameworks and tooling that make AI models, even ASI capabilities level models, less likely to escape or misbehave. And release the tooling.
It would be voluntary compliance but like the Linux Kernel it might in practice be used by almost everyone.
As for the second point, no. Your argument has a hidden assumption that is not supported by evidence or credible AI scientists.
The evidence is that models that exhibit human scale abilities need human scale (within an oom) level of compute and memory. The physical hardware racks to support this are enormous and not available outside AI labs. Were we to restrict the retail sale of certain kinds of training accelerator chips and especially high bandwidth interconnects, we could limit the places human level + AI could exist to data centers at known addresses.
Your hidden assumption is optimizations, but the problem is that if you consider not just “AGI” but “ASI”, the amount of hardware to support superhuman level cognition is probably nonlinear.
If you wanted a model that could find an action that has a better expected value than a human level model with 90 percent probability (so the model is 10 times smarter in utility), it probably needs more than 10 times the compute. Probably logarithmic, that to find a better action 90 percent of the time you need to explore a vastly larger possibility space and you need the compute and memory to do this.
This is probably provable in a theorem but the science isn’t there yet.
If correct, actually ASI is easily contained. Just write down where 10,000+ H100s are located or find it by IR or power consumption. If you suspect a rogue ASI has escaped that’s where you check.
This is what I mean by controlling the environment. Realtime auditing of AI accelerator clusters—what model is running, who is paying for it, what’s their license number, etc—would actually decrease progress very little while make escapes difficult.
If hacking and escapes turns out to be a threat, air gaps and asic hardware firewalls to prevent this are the next level of security to add.
The difference is that major labs would not be decelerated at all. There is no pause. They just in parallel have to spend a trivial amount of money complying with the registration and logging reqs.