Eevee🔹 comments on Eevee’s Quick takes

Eevee🔹 27 Sep 2021 6:13 UTC
3 points
0 ∶ 0
I’ve been thinking about AI safety again, and this is what I’m thinking:
The main argument of Stuart Russell’s book focuses on reward modeling as a way to align AI systems with human preferences. But reward modeling seems more like an AI capabilities technology than an AI safety one. If it’s really difficult to write a reward function for a given task Y, then it seems unlikely that AI developers would deploy a system that does it in an unaligned way according to a misspecified reward function. Instead, reward modeling makes it feasible to design an AI system to do the task at all.
Even with reward modeling, though, AI systems are still going to have similar drives due to instrumental convergence: self-preservation, goal preservation, resource acquisition, etc., even if they have goals that were well specified by their developers. Although maybe corrigibility and not doing bad things can be built into the systems’ goals using reward modeling.
The ways I could see reward modeling technology failing to prevent AI catastrophes (other than misuse) are:
- An AI system is created using reward modeling, but the learned reward function still fails in a catastrophic, unexpected way. This is similar to how humans often take actions that unintentionally cause harm, such as habitat destruction, because they’re not thinking about the harms that occur.
  - Possible solution: create a model garden for open source reward models that developers can use when training new systems with reward modeling. This way, developers start from a stronger baseline with better safety guarantees than they would have if they were developing reward modeling systems from scratch/with only their proprietary training data.
- A developer cuts corners while creating an AI system (perhaps due to economic pressure) and doesn’t give the system a robust enough learned reward function, and the system fails catastrophically.
  - Lots of ink has been spilled about arms race dynamics 😛
  - Possible solution: Make sure reward models can be run efficiently. For example, if reward modeling is done using a neural network that outputs a reward value, make sure it can be done well even with slimmer neural networks (fewer parameters, lower bit depth, etc.).
- Steven Byrnes 27 Sep 2021 11:53 UTC
  1 point
  0 ∶ 0
  Parent
  The main argument of Stuart Russell’s book focuses on reward modeling as a way to align AI systems with human preferences.
  Hmm, I remember him talking more about IRL and CIRL and less about reward modeling. But it’s been a little while since I read it, could be wrong.
  If it’s really difficult to write a reward function for a given task Y, then it seems unlikely that AI developers would deploy a system that does it in an unaligned way according to a misspecified reward function. Instead, reward modeling makes it feasible to design an AI system to do the task at all.
  Maybe there’s an analogy where someone would say “If it’s really difficult to prevent accidental release of pathogens from your lab, then it seems unlikely that bio researchers would do research on pathogens whose accidental release would be catastrophic”. Unfortunately there’s a horrifying many-decades-long track record of accidental release of pathogens from even BSL-4 labs, and it’s not like this kind of research has stopped. Instead it’s like, the bad thing doesn’t happen every time, and/or things seem to be working for a while before the bad thing happens, and that’s good enough for the bio researchers to keep trying.
  So as I talk about here, I think there are going to be a lot of proposals to modify an AI to be safe that do not in fact work, but do seem ahead-of-time like they might work, and which do in fact work for a while as training progresses. I mean, when x-risk-naysayers like Yann LeCun or Jeff Hawkins are asked how to avoid out-of-control AGIs, they can spout off a list of like 5-10 ideas that would not in fact work, but sound like they would. These are smart people and a lot of other smart people believe them too. Also, even something as dumb as “maximize the amount of money in my bank account” would plausibly work for a while and do superhumanly-helpful things for the programmers, before it starts doing superhumanly-bad things for the programmers.
  Even with reward modeling, though, AI systems are still going to have similar drives due to instrumental convergence: self-preservation, goal preservation, resource acquisition, etc., even if they have goals that were well specified by their developers. Although maybe corrigibility and not doing bad things can be built into the systems’ goals using reward modeling.
  Yup, if you don’t get corrigibility then you failed.