“If we had a fast, fool-proof way to analyze machine blueprints and confirm that they’re safe to implement, then we could trust the design without needing to trust the designer. But no such method exists.”
Are you expecting a general solution with a low false negative rate? Isn’t this doable if the designs are simple enough to understand fully or fall within a well-known category that we do have a method to check for? We’d just reject any design we couldn’t understand or verify.
Also, why does it need to be fast, and how fast? To not give up your edge to others who are taking more risk?
Are you expecting a general solution with a low false negative rate? Isn’t this doable if the designs are simple enough to understand fully or fall within a well-known category that we do have a method to check for?
I don’t know of a way to save the world using only blueprints that, e.g., a human could confirm (in a reasonable length of time) is a safe way to save the world, in the face of superintelligent optimization to manipulate the human.
From my perspective, the point here is that human checking might add a bit of extra safety or usefulness, but the main challenge is to get the AGI to want to help with the intended task. If the AGI is adversarial, you’ve already failed.
Also, why does it need to be fast, and how fast? To not give up your edge to others who are taking more risk?
Yes. My guess would be that the first AGI project will have less than five years to save the world (before a less cautious project destroys it), and more than three months. Time is likely to be of the essence, and I quickly become more pessimistic about save-the-world-with-AGI plans as they start taking more than e.g. one year in expectation.
“If we had a fast, fool-proof way to analyze machine blueprints and confirm that they’re safe to implement, then we could trust the design without needing to trust the designer. But no such method exists.”
Are you expecting a general solution with a low false negative rate? Isn’t this doable if the designs are simple enough to understand fully or fall within a well-known category that we do have a method to check for? We’d just reject any design we couldn’t understand or verify.
Also, why does it need to be fast, and how fast? To not give up your edge to others who are taking more risk?
I don’t know of a way to save the world using only blueprints that, e.g., a human could confirm (in a reasonable length of time) is a safe way to save the world, in the face of superintelligent optimization to manipulate the human.
From my perspective, the point here is that human checking might add a bit of extra safety or usefulness, but the main challenge is to get the AGI to want to help with the intended task. If the AGI is adversarial, you’ve already failed.
Yes. My guess would be that the first AGI project will have less than five years to save the world (before a less cautious project destroys it), and more than three months. Time is likely to be of the essence, and I quickly become more pessimistic about save-the-world-with-AGI plans as they start taking more than e.g. one year in expectation.