Thanks for this comment. I don’t know a lot about this area, so I’m not confident here. But I would have thought that it would sometimes be important for making safe and beneficial AI to be able to prove that systems actually exhibit certain properties when implemented.
I guess I think this first becuase bugs seem capable of being big deals in this context (maybe I’m wrong there?), and because it seems like there could be some instances where it’s more feasible to use proof assistants than math to prove that a system has a property.
I would have thought that it would sometimes be important for making safe and beneficial AI to be able to prove that systems actually exhibit certain properties when implemented.
We can decompose this into two parts:
1. Proving that the system that we design has certain properties
2. Proving that the system that we implement matches the design (and so has the same properties)
1 is usually done by math-style proofs, which are several orders of magnitude easier to do than direct formal verification of the system in a proof assistant without having first done the math-style proof.
2 is done by formal verification, where for complex enough systems the specification for the formal verification often comes from the output of a math proof.
I guess I think this first becuase bugs seem capable of being big deals in this context
I’m arguing that after you’ve done 1, even if there’s a failure from not having done 2, it’s very unlikely to cause x-risk via the usual mechanism of an AI system adversarially optimizing against humans. (Maybe it causes x-risk in that due to a bug the computer system says “call Russia” and that gets translated to “launch all the nukes”, or something like that, but that’s not specific to AI alignment, and I think it’s pretty unlikely.)
Like, idk. I struggle to actually think of a bug in implementation that would lead to a powerful AI system optimizing against us, when without that bug it would have been fine. Even if you accidentally put a negative sign on a reward function, I expect that this would be caught long before the AI system was a threat.
I realize this isn’t a super compelling response, but it’s hard to argue against this because it’s hard to prove a negative.
there could be some instances where it’s more feasible to use proof assistants than math to prove that a system has a property.
Proof assistants are based on math. Any time a proof assistant proves something, it can produce a “transcript” that is a formal math proof of that thing.
Now you might hope that proof assistants can do things faster than humans, because they’re automated. This isn’t true—usually the automation is things like “please just prove for me that 2*x is larger than x, I don’t want to have to write the details myself”, or “please fill out and prove the base case of this induction argument”, where a standard math proof wouldn’t even note the detail.
Sometimes a proof assistant can do better than humans, when the proof of a fact is small but deeply unintuitive, such that brute force search is actually better than finetuned human intuition. I know of one such case, that I’m failing to find a link for. But this is by far the exception, not the rule.
(There are some proofs, most famously the map-coloring theorem, where part of the proof was done by a special-purpose computer program searching over a space of possibilities. I’m not counting these, as this feels like mathematicians doing a math proof and finding a subpart that they delegated to a machine.)
EDIT: I should note that one use case that seems plausible to me is to use formal verification techniques to verify learned specifications, or specifications that change based on the weights of some neural net, but I’d be pretty surprised if this was done using proof assistants (as opposed to other techniques in formal verification).
Thanks so much Rohin for this explanation. It sounds somewhat persuasive to me, but I don’t feel in a psoition to have a good judgement on the matter. I’ll pass this on to our AI specialists to see what they think!
Chiming in on this very late. (I worked on formal verification research using proof assistants for a sizable part of undergrad.)
- Given the stakes, it seems like it could be important to verify 1. formally after the math proofs step. Math proofs are erroneous a non-trivial fraction of the time.
- While I agree that proof assistants right now are much slower than doing math proofs yourself, verification is a pretty immature field. I can imagine them becoming a lot better such that they do actually become better to use than doing math proofs yourself, and don’t think this would be the worst thing to invest in.
- I’m somewhat unsure about the extent to which we’ll be able to cleanly decompose 1. and 2. in the systems we design, though I haven’t thought about it much.
- A lot of the formal verification work on proof assistants seems to me like it’s also work that could apply to verifying learned specifications? E.g. I’m imagining that this process would be automated, and the automation used could look a lot like the parts of proof assistants that automate proofs.
Hi Rohin,
Thanks for this comment. I don’t know a lot about this area, so I’m not confident here. But I would have thought that it would sometimes be important for making safe and beneficial AI to be able to prove that systems actually exhibit certain properties when implemented.
I guess I think this first becuase bugs seem capable of being big deals in this context (maybe I’m wrong there?), and because it seems like there could be some instances where it’s more feasible to use proof assistants than math to prove that a system has a property.
Curious to hear if/why you disagree!
We can decompose this into two parts:
1. Proving that the system that we design has certain properties
2. Proving that the system that we implement matches the design (and so has the same properties)
1 is usually done by math-style proofs, which are several orders of magnitude easier to do than direct formal verification of the system in a proof assistant without having first done the math-style proof.
2 is done by formal verification, where for complex enough systems the specification for the formal verification often comes from the output of a math proof.
I’m arguing that after you’ve done 1, even if there’s a failure from not having done 2, it’s very unlikely to cause x-risk via the usual mechanism of an AI system adversarially optimizing against humans. (Maybe it causes x-risk in that due to a bug the computer system says “call Russia” and that gets translated to “launch all the nukes”, or something like that, but that’s not specific to AI alignment, and I think it’s pretty unlikely.)
Like, idk. I struggle to actually think of a bug in implementation that would lead to a powerful AI system optimizing against us, when without that bug it would have been fine. Even if you accidentally put a negative sign on a reward function, I expect that this would be caught long before the AI system was a threat.
I realize this isn’t a super compelling response, but it’s hard to argue against this because it’s hard to prove a negative.
Proof assistants are based on math. Any time a proof assistant proves something, it can produce a “transcript” that is a formal math proof of that thing.
Now you might hope that proof assistants can do things faster than humans, because they’re automated. This isn’t true—usually the automation is things like “please just prove for me that 2*x is larger than x, I don’t want to have to write the details myself”, or “please fill out and prove the base case of this induction argument”, where a standard math proof wouldn’t even note the detail.
Sometimes a proof assistant can do better than humans, when the proof of a fact is small but deeply unintuitive, such that brute force search is actually better than finetuned human intuition. I know of one such case, that I’m failing to find a link for. But this is by far the exception, not the rule.
(There are some proofs, most famously the map-coloring theorem, where part of the proof was done by a special-purpose computer program searching over a space of possibilities. I’m not counting these, as this feels like mathematicians doing a math proof and finding a subpart that they delegated to a machine.)
EDIT: I should note that one use case that seems plausible to me is to use formal verification techniques to verify learned specifications, or specifications that change based on the weights of some neural net, but I’d be pretty surprised if this was done using proof assistants (as opposed to other techniques in formal verification).
Thanks so much Rohin for this explanation. It sounds somewhat persuasive to me, but I don’t feel in a psoition to have a good judgement on the matter. I’ll pass this on to our AI specialists to see what they think!
Chiming in on this very late. (I worked on formal verification research using proof assistants for a sizable part of undergrad.)
- Given the stakes, it seems like it could be important to verify 1. formally after the math proofs step. Math proofs are erroneous a non-trivial fraction of the time.
- While I agree that proof assistants right now are much slower than doing math proofs yourself, verification is a pretty immature field. I can imagine them becoming a lot better such that they do actually become better to use than doing math proofs yourself, and don’t think this would be the worst thing to invest in.
- I’m somewhat unsure about the extent to which we’ll be able to cleanly decompose 1. and 2. in the systems we design, though I haven’t thought about it much.
- A lot of the formal verification work on proof assistants seems to me like it’s also work that could apply to verifying learned specifications? E.g. I’m imagining that this process would be automated, and the automation used could look a lot like the parts of proof assistants that automate proofs.