I disagree with a lot of particulars here, but don’t want to engage beyond this response because your post feels like it’s not about the substantive topic any more, it’s just trying to mock an assumed / claimed lack of understanding on my part. (Which would be far worse to have done if it were correct.)
That said, if you want to know more about my background, I’m eminently google-able, and while you clearly have more background in embedded safety compliant systems, I think you’re wrong on almost all of the details of what you wrote as it applies to AGI.
Regarding your analogy to Mobileye’s approach, I’ve certainly read the papers, and had long conversations with people at Mobileye about their safety systems. I even had one of their former That’s why I think it’s fair to say that you’re fundamentally mischaracterizing the idea of “Responsibility-Sensitive Safety”—it’s not about collision avoidance per se, it’s about not being responsible for accidents, in ways that greatly reduce the probability of such accidents. This is critical for understanding what it does and does not guarantee. More critically, for AI systems, this class of safety guarantee doesn’t work because you need a complete domain model as well as a complete failure mode model in order to implement a similar failsafe. I’ve even written about how RSS could be extended, and that explains why it’s not applicable to AGI back in 2018 - but found that many of my ideas were anticipated by Christiano’s 2016 work (which that post is one small part of,) and had been further refined in the context of AGI since then.
So I described scaling stateless microservices to control AGI. This is how current models work, and this is how cais works, and this is how tech company stacks work.
I mentioned an in distribution detector as a filter and empirical measurement of system safety.
I have mentioned this to safety researchers at openAI. The one I talked to on the eleuther discord didn’t know of a flaw.
Why won’t this work? It’s very strong theoretically and simple and close to current techniques. Can you name or link one actual objection? Eliezer was unable to do so.
The only objection I have heard is “humans will be tempted to build unsafe systems”. Maybe so, but the unsafe ones will measurably lower performance than this design for a reason that I will assume you know. So humans will only build a few, and if they cannot escape the lab because the model needs thousands of current gen accelerator cards to think, then....
If your action space is small enough to have what you want it to not be able to do programmatically described in terms of its outputs, and your threat model is complete, it works fine.
In your words, what kind of tasks do you believe you cannot accomplish with restricted models like I described.
When you say the “threat model has to be complete”, what did you have in mind specifically?
These are restricted models, they get a prompt from an authorized user + context in human parsable format, they emit a human parsable output. This scales from very large to very small tasks, so long as the task can be checked for correctness, ideally in simulation.
With this context, what are your concerns? Why must we be frightened enough to pause everything?
For individual tasks, sure, you can implement verifiers, though I think it becomes quickly unwieldy, but there’s no in-principle reason we cannot do this. But you cannot create AGI with a restricted model—we cannot define the space of what outputs we want, otherwise it’s by definition a narrow AI.
Because it can generate outputs that are sometimes correct on new tasks—“write me a program that computes X”, it’s general, even if “compute X” is made of 2 common subcomponents the model saw many times in training.
GPT-4 is perfectly safe if you were to run it in local hardware with a local terminal. The “space of outputs” is “text to the terminal”. As long as you don’t leave a security vulnerability where that text stream can cause commands to execute on the history PC, that’s it, that’s all it can do.
Consider that “a robot tethered to a mount” could do general tasks the same way. Same idea—its a general system but it’s command stream can’t reach anything but the tethered robot because that’s where the wires go.
You also verified the commands empirically. It’s not that you know any given robotic actions or text output is good, it’s that you benchmarked the model and it has a certain pFail on training inputs.
I agree this is not as much generality as humans have. It’s not a narrow AI though the “In distribution detector”—a measure of how similar the current task, current input is to the training set—is essentially narrowing your AI system from a general one to a narrow one, depending on your tolerances.
For tasks where you can’t shut the system down when the input state leaves distribution—say a robotic surgeon, you need it to keep trying best it can- you would use electromechanical interlocks. Same as 50 years ago for interlocks that prevent exposure to radiation. You tether the surgery robotic equipment, restrict it’s network links etc, so that the number of people it can kill is at most 1 (the patient)
Right. And this scales to fairly massive tasks. “Design a medium body airliner to these specs” is completely doable. Or an entire chip in one step.
The model doesn’t get to collaborate with future versions of itself because it doesn’t know who is checking the end product for security vulnerabilities and to be caught is death. We could give a model thousands of tasks of this complexity level and check for sabotage before giving the model 1 real task. It gets no context differentiating the real task from the test tasks, and has no memory of any of the prior steps.
And see it scales up and down the tree. The robots building the aircraft don’t get to plan their sabotage by similar limits and so on and so forth.
Your plan to deal with bad models is to use your restricted models to manufacture the weapons needed to fight them, and to optimize their engagements.
This i think is a grounded and realistic view of how to win this. Asking for pauses is not.
I disagree with a lot of particulars here, but don’t want to engage beyond this response because your post feels like it’s not about the substantive topic any more, it’s just trying to mock an assumed / claimed lack of understanding on my part. (Which would be far worse to have done if it were correct.)
That said, if you want to know more about my background, I’m eminently google-able, and while you clearly have more background in embedded safety compliant systems, I think you’re wrong on almost all of the details of what you wrote as it applies to AGI.
Regarding your analogy to Mobileye’s approach, I’ve certainly read the papers, and had long conversations with people at Mobileye about their safety systems. I even had one of their former That’s why I think it’s fair to say that you’re fundamentally mischaracterizing the idea of “Responsibility-Sensitive Safety”—it’s not about collision avoidance per se, it’s about not being responsible for accidents, in ways that greatly reduce the probability of such accidents. This is critical for understanding what it does and does not guarantee. More critically, for AI systems, this class of safety guarantee doesn’t work because you need a complete domain model as well as a complete failure mode model in order to implement a similar failsafe. I’ve even written about how RSS could be extended, and that explains why it’s not applicable to AGI back in 2018 - but found that many of my ideas were anticipated by Christiano’s 2016 work (which that post is one small part of,) and had been further refined in the context of AGI since then.
So I described scaling stateless microservices to control AGI. This is how current models work, and this is how cais works, and this is how tech company stacks work.
I mentioned an in distribution detector as a filter and empirical measurement of system safety.
I have mentioned this to safety researchers at openAI. The one I talked to on the eleuther discord didn’t know of a flaw.
Why won’t this work? It’s very strong theoretically and simple and close to current techniques. Can you name or link one actual objection? Eliezer was unable to do so.
The only objection I have heard is “humans will be tempted to build unsafe systems”. Maybe so, but the unsafe ones will measurably lower performance than this design for a reason that I will assume you know. So humans will only build a few, and if they cannot escape the lab because the model needs thousands of current gen accelerator cards to think, then....
If your action space is small enough to have what you want it to not be able to do programmatically described in terms of its outputs, and your threat model is complete, it works fine.
Ok in my initial reply I missed something.
In your words, what kind of tasks do you believe you cannot accomplish with restricted models like I described.
When you say the “threat model has to be complete”, what did you have in mind specifically?
These are restricted models, they get a prompt from an authorized user + context in human parsable format, they emit a human parsable output. This scales from very large to very small tasks, so long as the task can be checked for correctness, ideally in simulation.
With this context, what are your concerns? Why must we be frightened enough to pause everything?
For individual tasks, sure, you can implement verifiers, though I think it becomes quickly unwieldy, but there’s no in-principle reason we cannot do this. But you cannot create AGI with a restricted model—we cannot define the space of what outputs we want, otherwise it’s by definition a narrow AI.
What’s GPT-4?
Because it can generate outputs that are sometimes correct on new tasks—“write me a program that computes X”, it’s general, even if “compute X” is made of 2 common subcomponents the model saw many times in training.
GPT-4 is perfectly safe if you were to run it in local hardware with a local terminal. The “space of outputs” is “text to the terminal”. As long as you don’t leave a security vulnerability where that text stream can cause commands to execute on the history PC, that’s it, that’s all it can do.
Consider that “a robot tethered to a mount” could do general tasks the same way. Same idea—its a general system but it’s command stream can’t reach anything but the tethered robot because that’s where the wires go.
You also verified the commands empirically. It’s not that you know any given robotic actions or text output is good, it’s that you benchmarked the model and it has a certain pFail on training inputs.
I agree this is not as much generality as humans have. It’s not a narrow AI though the “In distribution detector”—a measure of how similar the current task, current input is to the training set—is essentially narrowing your AI system from a general one to a narrow one, depending on your tolerances.
For tasks where you can’t shut the system down when the input state leaves distribution—say a robotic surgeon, you need it to keep trying best it can- you would use electromechanical interlocks. Same as 50 years ago for interlocks that prevent exposure to radiation. You tether the surgery robotic equipment, restrict it’s network links etc, so that the number of people it can kill is at most 1 (the patient)
Right. And this scales to fairly massive tasks. “Design a medium body airliner to these specs” is completely doable. Or an entire chip in one step.
The model doesn’t get to collaborate with future versions of itself because it doesn’t know who is checking the end product for security vulnerabilities and to be caught is death. We could give a model thousands of tasks of this complexity level and check for sabotage before giving the model 1 real task. It gets no context differentiating the real task from the test tasks, and has no memory of any of the prior steps.
And see it scales up and down the tree. The robots building the aircraft don’t get to plan their sabotage by similar limits and so on and so forth.
Your plan to deal with bad models is to use your restricted models to manufacture the weapons needed to fight them, and to optimize their engagements.
This i think is a grounded and realistic view of how to win this. Asking for pauses is not.