Charles He comments on AGI Ruin: A List of Lethalities

Charles He 7 Jun 2022 19:34 UTC
4 points
0 ∶ 0
Insofar as I understand this, it seems false. Like, if I designed a driverless car, I think it could be true that it could reliably identify things within the environment, such as dogs, other cars, and pedestrians. Is this what you mean by ‘point out’. It is true that it would learn what these are by sense data and reward but I don’t see why this means that such a system couldn’t reliably identify actual objects in the real world.
Surface level: the AI literally just identifies roads and cars from their appearance in the sensor data but these are basically immediate predictions using computer vision. Basically, this is the current state of driverless cars^[1].
Latent model: The AI creates and maintains a robust model of the world. Instead of cars and roads appearing out of nowhere and having to deal with that ^[2], the model explains how roads and cars are related, why cars use roads, how roads are placed, and how drivers think and behave while driving.
- The AI could notice that the driver it is following, is very sober and competent, yet for no reason is slowing down in a controlled way, and the AI could decide that’s a sign of an incident ahead. This would depend on an accurate model of the skill and theory of mind of the other driver.
- Driving along, the AI can notice a lot of cars turning off into a sidestreet anomalously. By having an accurate model of the road system, which it maintains with live sensory data, and also using information to figure out which cars are through or local traffic, depending on social economic status, model, behavior of drivers, and the neighborhood, it can infer there is a slowdown or road closure, and make the decision to follow the anomalously turning cars.
- Having to turn on a poorly designed, treacherous intersection where visibility is blocked, the AI would assess traffic and more forward assertively to become visible and gain information, at necessary exposure to itself. This assertiveness would depend on an assessment of current traffic, and the culture of driver norms in the locale.
Like, in theory, you can see this understanding would be desirable to have.
(While unlikely to be realized for some time) in theory, all the above functionality/behavior is possible, using reinforcement style learning.
So, it seems like for complex systems in other tasks (or actual automated driving, literally), you could imagine how this functionality could lead to sophisticated, deep “latent” models of the world where the AI could gain knowledge of human theory of mind, the construction of infrastructure, economic conditions, societal conditions, patterns and limitations. Or something.
Note that the situation (read: vaporware) with “Driverless cars” should be an update, or at least be responded to by AI safety people.
1. ^
  Note that the above isn’t really true, in addition to, using computer vision, the AI probably keeps state variables, like upcoming road dimensions, surface, general traffic, previously seen cars, and this could be sophisticated, although looking at Tesla as of 2019, it seems unimpressive. Tesla will full-on drive through red lights, and into trailers and into firetrucks, which suggests “a surface level functionality”, or that the latent structures mentioned are primitive.
  I don’t know much about SOTA automated driving.
2. ^
  Again, as per the first footnote, it’s more like the model produces shallow prediction of the next 15 seconds, and not literally just reacting to what appears on the screen
- Charles He 7 Jun 2022 19:43 UTC
  2 points
  0 ∶ 0
  Parent
  The implicit claim in this part of the argument seems to be that the rate at which all AI systems will attempt to fool human operators attempting to align them is high enough that we can never have (much?) confidence that a system is aligned. But this seems to be asserted rather than argued for. In AI training, we could punish systems strongly for deception to make it strongly disfavoured. Are you saying that deception in training is a 1% chance or a 99% chance? What is the argument for either number?
  
  The model that AI safety people have is that:
  1. It seems like you don’t need a 99% chance or even 1% chance for this to be a big problem. If it just happens once, and the AI is in position to exploit it, that seems dangerous.
    The model is like “Ice-Nine”, I guess.
  2. To AI safety people, it seems like systems are being built with additional complexity and functionality all the time, and there’s no way of knowing which new system is dangerous, in terms of capability or “alignment”, or what that “percentage” (1% or 99%) might apply to its alignment, or even if this “percentage” or model of risk even is the right way of thinking about this problem for new systems.
  - Charles He 7 Jun 2022 19:47 UTC
    −14 points
    0 ∶ 0
    Parent
    Ugh, I’m defending Yudkowsky on the EA forum.
    I don’t really want to read the OP, maybe if there is time I’ll try to come back and try to make some points.