Briefly how I’ve updated since ChatGPT

Link post

I’m laying out my thoughts in order to get people thinking about these points and perhaps correct me. I definitely don’t endorse deferring to anything I say, and I would write this differently if I thought people were likely to do so.

  1. OpenAI’s model of “deploy as early as possible in order to extend the timeline between when the world takes it seriously to when humans are no longer in control” seems less crazy to me.

    1. I think ChatGPT has made it a lot easier for me personally to think concretely about the issue and identify exactly what the key bottlenecks are.

    2. To the counterargument “but they’ve spurred other companies to catch up,” I would say that this was going to happen whenever an equivalent AI was released, and I’m unsure whether we’re more doomed in the world where this happened now, versus later when there’s a greater overhang of background technology and compute.

    3. I’m not advocating specifically for or against any deployment schedule, I just think it’s important that this model be viewed as not crazy, so it’s adequately considered in relevant discussions.

  2. Why will LLMs develop agency? My default explanation used to involve fancy causal stories about monotonically learning better and better search heuristics, and heuristics for searching over heuristics. While those concerns are still relevant, the much more likely path is simply that people will try their hardest to make the LLM into an agent as soon as possible, because agents with the ability to carry out long-term goals are much more useful.

  3. “The public” seems to be much more receptive than I previously thought, both wrt Eliezer and the idea that AI could be existentially dangerous. This is good! But we’re at the beginning where we are seeing the response from the people who are most receptive to the idea, and we’ve not yet got to the inevitable stage of political polarisation.

  4. Why doom? Companies and the open source community will continue to experiment with recursive LLMs, and end up with better and better simulations of entire research societies (a network epistemologist’s dream). This creates a “meta-architectures overhang” which will amplify the capabilities of any new releases of base-level LLMs. As these are open sourced or made available via API, somebody somewhere will plain tell them to recursively self-improve themselves, no complicated story about instrumental convergence needed.

    1. AI will not stay in a box (because humans didn’t try to put it into one in the first place). AI will not become an agent by accident (because humans will make it into one first). And if AI destroys the world, it’s as likely to be by human instruction as by instrumentally convergent reasons inherent to the AI itself. Oops.

    2. The recursive LLM thing is also something I’m exploring for alignment purposes. If the path towards extreme intelligence is to build up LLM-based research societies, we have the advantage that every part of it can be inspected. And you can automate this inspection to alert you of misaligned intentions at every step. It’s much harder to deceive when successfwl attempts depend on coordination.

  5. Lastly, AIs may soon be sentient, and people will torture them because people like doing that.

    1. I think it’s likely that there will be a window where some AIs are conscious (e.g. uploads), but not yet powerful enough to resist what a human might do to them.

    2. In that world, as long as those AIs are available worldwide, there’s a non-trivial population of humans who would derive sadistic pleasure from anonymously torturing them.[1] AIs process information extremely fast, and unlike with farm animals, you can torture them to death an arbitrary number of times.[2]

    3. To prevent this, it seems imperative to make sure that the AIs that are most likely to be “torturable” are

      1. never open-sourced,

      2. API access points are controlled for human sentiment,

      3. interactions with them should never be anonymous,

      4. and AIs can be directly trained/​instructed to exit a situation (and the IP could be timed out) when it detects ill-intent.

  1. ^

    Note that if it’s an AI trained to imitate humans, showing signs of distress may not be correlated with how they actually suffer. But given that I’m currently very uncertain about how they would suffer, it seems foolish not to take maximal precautions to not expose them to the entire population of sadists on the planet.

  2. ^

    If that’s how it’s gonna play out, I’d rather we all die before then.