Iāve been thinking about AI safety again, and this is what Iām thinking:
The main argument of Stuart Russellās book focuses on reward modeling as a way to align AI systems with human preferences. But reward modeling seems more like an AI capabilities technology than an AI safety one. If itās really difficult to write a reward function for a given task Y, then it seems unlikely that AI developers would deploy a system that does it in an unaligned way according to a misspecified reward function. Instead, reward modeling makes it feasible to design an AI system to do the task at all.
Even with reward modeling, though, AI systems are still going to have similar drives due to instrumental convergence: self-preservation, goal preservation, resource acquisition, etc., even if they have goals that were well specified by their developers. Although maybe corrigibility and not doing bad things can be built into the systemsā goals using reward modeling.
The ways I could see reward modeling technology failing to prevent AI catastrophes (other than misuse) are:
An AI system is created using reward modeling, but the learned reward function still fails in a catastrophic, unexpected way. This is similar to how humans often take actions that unintentionally cause harm, such as habitat destruction, because theyāre not thinking about the harms that occur.
Possible solution: create a model garden for open source reward models that developers can use when training new systems with reward modeling. This way, developers start from a stronger baseline with better safety guarantees than they would have if they were developing reward modeling systems from scratch/āwith only their proprietary training data.
A developer cuts corners while creating an AI system (perhaps due to economic pressure) and doesnāt give the system a robust enough learned reward function, and the system fails catastrophically.
Lots of ink has been spilled about arms race dynamics š
Possible solution: Make sure reward models can be run efficiently. For example, if reward modeling is done using a neural network that outputs a reward value, make sure it can be done well even with slimmer neural networks (fewer parameters, lower bit depth, etc.).
The main argument of Stuart Russellās book focuses on reward modeling as a way to align AI systems with human preferences.
Hmm, I remember him talking more about IRL and CIRL and less about reward modeling. But itās been a little while since I read it, could be wrong.
If itās really difficult to write a reward function for a given task Y, then it seems unlikely that AI developers would deploy a system that does it in an unaligned way according to a misspecified reward function. Instead, reward modeling makes it feasible to design an AI system to do the task at all.
Maybe thereās an analogy where someone would say āIf itās really difficult to prevent accidental release of pathogens from your lab, then it seems unlikely that bio researchers would do research on pathogens whose accidental release would be catastrophicā. Unfortunately thereās a horrifying many-decades-long track record of accidental release of pathogens from even BSL-4 labs, and itās not like this kind of research has stopped. Instead itās like, the bad thing doesnāt happen every time, and/āor things seem to be working for a while before the bad thing happens, and thatās good enough for the bio researchers to keep trying.
So as I talk about here, I think there are going to be a lot of proposals to modify an AI to be safe that do not in fact work, but do seem ahead-of-time like they might work, and which do in fact work for a while as training progresses. I mean, when x-risk-naysayers like Yann LeCun or Jeff Hawkins are asked how to avoid out-of-control AGIs, they can spout off a list of like 5-10 ideas that would not in fact work, but sound like they would. These are smart people and a lot of other smart people believe them too. Also, even something as dumb as āmaximize the amount of money in my bank accountā would plausibly work for a while and do superhumanly-helpful things for the programmers, before it starts doing superhumanly-bad things for the programmers.
Even with reward modeling, though, AI systems are still going to have similar drives due to instrumental convergence: self-preservation, goal preservation, resource acquisition, etc., even if they have goals that were well specified by their developers. Although maybe corrigibility and not doing bad things can be built into the systemsā goals using reward modeling.
Yup, if you donāt get corrigibility then you failed.
Iāve been thinking about AI safety again, and this is what Iām thinking:
The main argument of Stuart Russellās book focuses on reward modeling as a way to align AI systems with human preferences. But reward modeling seems more like an AI capabilities technology than an AI safety one. If itās really difficult to write a reward function for a given task Y, then it seems unlikely that AI developers would deploy a system that does it in an unaligned way according to a misspecified reward function. Instead, reward modeling makes it feasible to design an AI system to do the task at all.
Even with reward modeling, though, AI systems are still going to have similar drives due to instrumental convergence: self-preservation, goal preservation, resource acquisition, etc., even if they have goals that were well specified by their developers. Although maybe corrigibility and not doing bad things can be built into the systemsā goals using reward modeling.
The ways I could see reward modeling technology failing to prevent AI catastrophes (other than misuse) are:
An AI system is created using reward modeling, but the learned reward function still fails in a catastrophic, unexpected way. This is similar to how humans often take actions that unintentionally cause harm, such as habitat destruction, because theyāre not thinking about the harms that occur.
Possible solution: create a model garden for open source reward models that developers can use when training new systems with reward modeling. This way, developers start from a stronger baseline with better safety guarantees than they would have if they were developing reward modeling systems from scratch/āwith only their proprietary training data.
A developer cuts corners while creating an AI system (perhaps due to economic pressure) and doesnāt give the system a robust enough learned reward function, and the system fails catastrophically.
Lots of ink has been spilled about arms race dynamics š
Possible solution: Make sure reward models can be run efficiently. For example, if reward modeling is done using a neural network that outputs a reward value, make sure it can be done well even with slimmer neural networks (fewer parameters, lower bit depth, etc.).
Hmm, I remember him talking more about IRL and CIRL and less about reward modeling. But itās been a little while since I read it, could be wrong.
Maybe thereās an analogy where someone would say āIf itās really difficult to prevent accidental release of pathogens from your lab, then it seems unlikely that bio researchers would do research on pathogens whose accidental release would be catastrophicā. Unfortunately thereās a horrifying many-decades-long track record of accidental release of pathogens from even BSL-4 labs, and itās not like this kind of research has stopped. Instead itās like, the bad thing doesnāt happen every time, and/āor things seem to be working for a while before the bad thing happens, and thatās good enough for the bio researchers to keep trying.
So as I talk about here, I think there are going to be a lot of proposals to modify an AI to be safe that do not in fact work, but do seem ahead-of-time like they might work, and which do in fact work for a while as training progresses. I mean, when x-risk-naysayers like Yann LeCun or Jeff Hawkins are asked how to avoid out-of-control AGIs, they can spout off a list of like 5-10 ideas that would not in fact work, but sound like they would. These are smart people and a lot of other smart people believe them too. Also, even something as dumb as āmaximize the amount of money in my bank accountā would plausibly work for a while and do superhumanly-helpful things for the programmers, before it starts doing superhumanly-bad things for the programmers.
Yup, if you donāt get corrigibility then you failed.