Amber Dawn comments on Where I’m at with AI risk: convinced of danger but not (yet) of doom

Amber Dawn 3 Apr 2023 14:59 UTC
2 points
0 ∶ 0
The point about inserting a negative sign is good (though then there’s a question of who inserts that sign—homicidal maniacs? Is it put in by accident?)

Re the colour example, this seems disanalogous (unless I misunderstand) because if the AI is correctly identifying human-aligned actions and performing them, that means it understands our goals to the extent we care about. Like maybe not all of them, and maybe it doesn’t understand all the abstruse reasons why we care about various things, but I don’t really care about that (I care about people not being killed or harmed).
- Chris Leong 3 Apr 2023 16:37 UTC
  2 points
  0 ∶ 0
  Parent
  There was an example where some group accidentally performed a large run where they trained the AI to be maximally offensive rather than minimally offensive.
  
  Actually, rereading I don’t really know where I was going with the color example. I think I probably messed up as you said.
  - Larks 3 Apr 2023 19:05 UTC
    2 points
    0 ∶ 0
    Parent
    You could also imagine a situation something like a property being defined by a PCA component, hence not being robust to inversion because PCA components are only unique up to multiplication by a scalar.