For example, it might be possible to use proof assistants to help solve the AI ‘alignment problem’ by creating AI systems that we can prove have certain properties we think are required for the AI system to reliably do what we want it to do.
I don’t think this is particularly impactful, primarily because I don’t see a path by which it has an impact, and I haven’t seen anyone make a good case for this particular path to impact.
(It’s hard to argue a negative, but if I had to try, I’d point out that if we want proofs, we would probably do those via math, which works at a much higher level of abstraction and so takes much less work / effort; formal verification seems good for catching bugs in your implementations of ideas, which is not the core of the AI risk problem.)
However, it is plausibly still worthwhile becoming an expert on formal verification because of the potential applications to cybersecurity. (Though it seems like in that case you should just become an expert on cybersecurity.)
As someone in the intersection of these subjects I tend to agree with your conclusion, and with your next comment to Arden describing the design-implementation relationship.
Edit 19 Feb 2022: I want to clarify my position, namely, that I don’t see formal verification as a promising career path. As for what I write below, I both don’t believe it is a very practical suggestions, and I am not at all sold on AI safety.
However, while thinking about this, I did come up with a (very rough) idea for AI alignment , where formal verification could play a significant role. One scenario for AGI takeoff, or for solving AI alignment, is to do it inductively—that is, each generation of agents designs the next generation, which should be more sophisticated (and hopefully still aligned). Perhaps one plan to do achieve this is as follows (I’m not claiming that any step is easy or even plausible):
Formally define what it means for an agent to be aligned, in such a way that subsequent agents designed by this agent are also aligned.
Build your first generation of AI agents (which should be lean and simple as possible, to make the next step easier).
Let a (perhaps computer assisted) human prove that the first generation of AI is aligned in the formal sense of 1.
Then, once you deploy the first generation of agents, it is their job to formally prove that further agents designed by them are aligned as well. Hopefully, since they are very intelligent, and plausibly good at manipulating the previous formal proofs, they can find such proofs. Since the proof is formal, humans can trust and verify it (for example using traditional formal proof checkers), despite not being able to come up with the proof themselves.
This plan has many pitfalls (for example, each step may turn out to be extremely hard to carry out, or maybe your definition of alignment will be so strict that the agents won’t be able to construct any new and interesting aligned agents), however it is a possible way to be certain about having aligned AI.
I agree (with the caveat that I have much less relevant domain knowledge than Rohin, so you should presumably give less weight to my view).
Several years ago, I attended a small conference on ‘formal mathematics’, i.e. proof assistants, formal verification, and related issues. As far as I remember, all of the real-world applications mentioned there were of the type “catching bugs in your implementations of ideas”. For example, I remember someone saying they had verified software used in air traffic control. This does suggest formal verification can be useful in real-world safety applications. However, I’d guess (very speculative) that these kind of applications won’t benefit much from further research, and wouldn’t need to be done by someone with an understanding of EA—they seem relatively standard and easy to delegate, and at least in principle doable with current tools.
Thanks for this comment. I don’t know a lot about this area, so I’m not confident here. But I would have thought that it would sometimes be important for making safe and beneficial AI to be able to prove that systems actually exhibit certain properties when implemented.
I guess I think this first becuase bugs seem capable of being big deals in this context (maybe I’m wrong there?), and because it seems like there could be some instances where it’s more feasible to use proof assistants than math to prove that a system has a property.
I would have thought that it would sometimes be important for making safe and beneficial AI to be able to prove that systems actually exhibit certain properties when implemented.
We can decompose this into two parts:
1. Proving that the system that we design has certain properties
2. Proving that the system that we implement matches the design (and so has the same properties)
1 is usually done by math-style proofs, which are several orders of magnitude easier to do than direct formal verification of the system in a proof assistant without having first done the math-style proof.
2 is done by formal verification, where for complex enough systems the specification for the formal verification often comes from the output of a math proof.
I guess I think this first becuase bugs seem capable of being big deals in this context
I’m arguing that after you’ve done 1, even if there’s a failure from not having done 2, it’s very unlikely to cause x-risk via the usual mechanism of an AI system adversarially optimizing against humans. (Maybe it causes x-risk in that due to a bug the computer system says “call Russia” and that gets translated to “launch all the nukes”, or something like that, but that’s not specific to AI alignment, and I think it’s pretty unlikely.)
Like, idk. I struggle to actually think of a bug in implementation that would lead to a powerful AI system optimizing against us, when without that bug it would have been fine. Even if you accidentally put a negative sign on a reward function, I expect that this would be caught long before the AI system was a threat.
I realize this isn’t a super compelling response, but it’s hard to argue against this because it’s hard to prove a negative.
there could be some instances where it’s more feasible to use proof assistants than math to prove that a system has a property.
Proof assistants are based on math. Any time a proof assistant proves something, it can produce a “transcript” that is a formal math proof of that thing.
Now you might hope that proof assistants can do things faster than humans, because they’re automated. This isn’t true—usually the automation is things like “please just prove for me that 2*x is larger than x, I don’t want to have to write the details myself”, or “please fill out and prove the base case of this induction argument”, where a standard math proof wouldn’t even note the detail.
Sometimes a proof assistant can do better than humans, when the proof of a fact is small but deeply unintuitive, such that brute force search is actually better than finetuned human intuition. I know of one such case, that I’m failing to find a link for. But this is by far the exception, not the rule.
(There are some proofs, most famously the map-coloring theorem, where part of the proof was done by a special-purpose computer program searching over a space of possibilities. I’m not counting these, as this feels like mathematicians doing a math proof and finding a subpart that they delegated to a machine.)
EDIT: I should note that one use case that seems plausible to me is to use formal verification techniques to verify learned specifications, or specifications that change based on the weights of some neural net, but I’d be pretty surprised if this was done using proof assistants (as opposed to other techniques in formal verification).
Thanks so much Rohin for this explanation. It sounds somewhat persuasive to me, but I don’t feel in a psoition to have a good judgement on the matter. I’ll pass this on to our AI specialists to see what they think!
Chiming in on this very late. (I worked on formal verification research using proof assistants for a sizable part of undergrad.)
- Given the stakes, it seems like it could be important to verify 1. formally after the math proofs step. Math proofs are erroneous a non-trivial fraction of the time.
- While I agree that proof assistants right now are much slower than doing math proofs yourself, verification is a pretty immature field. I can imagine them becoming a lot better such that they do actually become better to use than doing math proofs yourself, and don’t think this would be the worst thing to invest in.
- I’m somewhat unsure about the extent to which we’ll be able to cleanly decompose 1. and 2. in the systems we design, though I haven’t thought about it much.
- A lot of the formal verification work on proof assistants seems to me like it’s also work that could apply to verifying learned specifications? E.g. I’m imagining that this process would be automated, and the automation used could look a lot like the parts of proof assistants that automate proofs.
I don’t think this is particularly impactful, primarily because I don’t see a path by which it has an impact, and I haven’t seen anyone make a good case for this particular path to impact.
(It’s hard to argue a negative, but if I had to try, I’d point out that if we want proofs, we would probably do those via math, which works at a much higher level of abstraction and so takes much less work / effort; formal verification seems good for catching bugs in your implementations of ideas, which is not the core of the AI risk problem.)
However, it is plausibly still worthwhile becoming an expert on formal verification because of the potential applications to cybersecurity. (Though it seems like in that case you should just become an expert on cybersecurity.)
As someone in the intersection of these subjects I tend to agree with your conclusion, and with your next comment to Arden describing the design-implementation relationship.
Edit 19 Feb 2022: I want to clarify my position, namely, that I don’t see formal verification as a promising career path. As for what I write below, I both don’t believe it is a very practical suggestions, and I am not at all sold on AI safety.
However, while thinking about this, I did come up with a (very rough) idea for AI alignment , where formal verification could play a significant role.
One scenario for AGI takeoff, or for solving AI alignment, is to do it inductively—that is, each generation of agents designs the next generation, which should be more sophisticated (and hopefully still aligned). Perhaps one plan to do achieve this is as follows (I’m not claiming that any step is easy or even plausible):
Formally define what it means for an agent to be aligned, in such a way that subsequent agents designed by this agent are also aligned.
Build your first generation of AI agents (which should be lean and simple as possible, to make the next step easier).
Let a (perhaps computer assisted) human prove that the first generation of AI is aligned in the formal sense of 1.
Then, once you deploy the first generation of agents, it is their job to formally prove that further agents designed by them are aligned as well. Hopefully, since they are very intelligent, and plausibly good at manipulating the previous formal proofs, they can find such proofs. Since the proof is formal, humans can trust and verify it (for example using traditional formal proof checkers), despite not being able to come up with the proof themselves.
This plan has many pitfalls (for example, each step may turn out to be extremely hard to carry out, or maybe your definition of alignment will be so strict that the agents won’t be able to construct any new and interesting aligned agents), however it is a possible way to be certain about having aligned AI.
I agree (with the caveat that I have much less relevant domain knowledge than Rohin, so you should presumably give less weight to my view).
Several years ago, I attended a small conference on ‘formal mathematics’, i.e. proof assistants, formal verification, and related issues. As far as I remember, all of the real-world applications mentioned there were of the type “catching bugs in your implementations of ideas”. For example, I remember someone saying they had verified software used in air traffic control. This does suggest formal verification can be useful in real-world safety applications. However, I’d guess (very speculative) that these kind of applications won’t benefit much from further research, and wouldn’t need to be done by someone with an understanding of EA—they seem relatively standard and easy to delegate, and at least in principle doable with current tools.
Thanks Max—I’ll pass this on!
Hi Rohin,
Thanks for this comment. I don’t know a lot about this area, so I’m not confident here. But I would have thought that it would sometimes be important for making safe and beneficial AI to be able to prove that systems actually exhibit certain properties when implemented.
I guess I think this first becuase bugs seem capable of being big deals in this context (maybe I’m wrong there?), and because it seems like there could be some instances where it’s more feasible to use proof assistants than math to prove that a system has a property.
Curious to hear if/why you disagree!
We can decompose this into two parts:
1. Proving that the system that we design has certain properties
2. Proving that the system that we implement matches the design (and so has the same properties)
1 is usually done by math-style proofs, which are several orders of magnitude easier to do than direct formal verification of the system in a proof assistant without having first done the math-style proof.
2 is done by formal verification, where for complex enough systems the specification for the formal verification often comes from the output of a math proof.
I’m arguing that after you’ve done 1, even if there’s a failure from not having done 2, it’s very unlikely to cause x-risk via the usual mechanism of an AI system adversarially optimizing against humans. (Maybe it causes x-risk in that due to a bug the computer system says “call Russia” and that gets translated to “launch all the nukes”, or something like that, but that’s not specific to AI alignment, and I think it’s pretty unlikely.)
Like, idk. I struggle to actually think of a bug in implementation that would lead to a powerful AI system optimizing against us, when without that bug it would have been fine. Even if you accidentally put a negative sign on a reward function, I expect that this would be caught long before the AI system was a threat.
I realize this isn’t a super compelling response, but it’s hard to argue against this because it’s hard to prove a negative.
Proof assistants are based on math. Any time a proof assistant proves something, it can produce a “transcript” that is a formal math proof of that thing.
Now you might hope that proof assistants can do things faster than humans, because they’re automated. This isn’t true—usually the automation is things like “please just prove for me that 2*x is larger than x, I don’t want to have to write the details myself”, or “please fill out and prove the base case of this induction argument”, where a standard math proof wouldn’t even note the detail.
Sometimes a proof assistant can do better than humans, when the proof of a fact is small but deeply unintuitive, such that brute force search is actually better than finetuned human intuition. I know of one such case, that I’m failing to find a link for. But this is by far the exception, not the rule.
(There are some proofs, most famously the map-coloring theorem, where part of the proof was done by a special-purpose computer program searching over a space of possibilities. I’m not counting these, as this feels like mathematicians doing a math proof and finding a subpart that they delegated to a machine.)
EDIT: I should note that one use case that seems plausible to me is to use formal verification techniques to verify learned specifications, or specifications that change based on the weights of some neural net, but I’d be pretty surprised if this was done using proof assistants (as opposed to other techniques in formal verification).
Thanks so much Rohin for this explanation. It sounds somewhat persuasive to me, but I don’t feel in a psoition to have a good judgement on the matter. I’ll pass this on to our AI specialists to see what they think!
Chiming in on this very late. (I worked on formal verification research using proof assistants for a sizable part of undergrad.)
- Given the stakes, it seems like it could be important to verify 1. formally after the math proofs step. Math proofs are erroneous a non-trivial fraction of the time.
- While I agree that proof assistants right now are much slower than doing math proofs yourself, verification is a pretty immature field. I can imagine them becoming a lot better such that they do actually become better to use than doing math proofs yourself, and don’t think this would be the worst thing to invest in.
- I’m somewhat unsure about the extent to which we’ll be able to cleanly decompose 1. and 2. in the systems we design, though I haven’t thought about it much.
- A lot of the formal verification work on proof assistants seems to me like it’s also work that could apply to verifying learned specifications? E.g. I’m imagining that this process would be automated, and the automation used could look a lot like the parts of proof assistants that automate proofs.