Gerald Monroe comments on AI Pause Will Likely Backfire

Gerald Monroe 18 Sep 2023 23:28 UTC
−1 points
1 ∶ 1
Or another rephrase. How is the “secretly is planning to murder all humans” improving the models scores on a benchmark? If you think about it first, what gradient from the training set even led to this capability of an inner cognitive process looking for a chance to betray. What force is causing this cognitive process to come out of the random initial weights?

Humans seem to have such a force but it’s because modeling “if I kill this rival then the reward for me is...” was evolutionarily useful. Also it’s probably a behavior that is learned.

And second, yeah, SGD should push “neutral” weights inside the network that are not contributing to correct answers towards weights that do increase the odds of a correct output distribution. So it should actively destroy “unnecessary ” cognitive processes inside the model.

You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
- Aleksi Maunu 23 Sep 2023 10:02 UTC
  2 points
  1 ∶ 0
  Parent
  How is the “secretly is planning to murder all humans” improving the models scores on a benchmark?
  (I personally don’t find this likely, so this might accidentally be a strawman)
  For example: planning and gaining knowledge are incentivized on many benchmarks → instrumental convergence makes model instrumentally value power among other things → a very advanced system that is great at long-term planning might conclude that “murdering all humans” is useful for power or other instrumentally convergent goals
  You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
  I think with our current interpretability techniques we wouldn’t be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments