“Intro to brain-like-AGI safety” series—halfway point!

For those who aren’t regular readers of alignmentforum or lesswrong, I’ve been writing a post series “Intro to Brain-Like-AGI Safety”. I’ve been posting once a week on Wednesdays since January, and I’m now about halfway done! (7 of 15)

Quick summary of the series so far

The series assumes no background; you can jump right into Post #1 to get answers to your burning questions like “what does ‘brain-like’ mean?”, “what does ‘AGI’ mean?”, “who cares?”, and “no seriously, why are we talking about this right now, when AGI doesn’t exist yet, and may not for the foreseeable future, and there’s no particular reason to expect that it would be ‘brain-like’ anyway, whatever that means?”. You’ll also find answers to eight popular objections to the idea that AGI may cause human extinction. :-)

Then in Posts #2 and #3, I dive into neuroscience, by hypothesizing a big-picture framework for how and where learning algorithms fit into the brain. (By “learning algorithms”, I mean within-lifetime learning algorithms, not evolution-as-a-learning-algorithm.)

Also in those two posts, you’ll find my take on brain-like-AGI timelines. Don’t get your hopes up: I didn’t put in a probability distribution for exactly when I think brain-like AGI will arrive. I was mostly interested in arguing the narrower point that brain-like AGI is at least plausible within the next few decades, contra the somewhat common opinion in neuroscience / CogSci that the human brain is so ridiculously complicated that brain-like AGI is definitely hundreds of years away.

Posts #4–#7 continue the neuroscience discussion with my hypothesized big picture of how I think motivation and goal-seeking work in the human brain. It sorta winds up being a weird variant of actor-critic model-based reinforcement learning. This topic is especially important for brain-like-AGI safety, as the biggest accident risk comes from AGIs whose motivation is pointed in an unintended direction.

Teaser for the rest of the series to come

…And that’s what I’ve published so far. The rest of the series (Posts #8-15) will have somewhat less neuroscience, and somewhat more direct discussion of AGI safety, including the alignment problem (finally!), wireheading, interpretability, motivation-sculpting, conservatism, and much more. I’ll close the series with a list of open questions, avenues for further research, and advice for getting involved in the field!

General comments

In total, the series will probably be about as long as a 250-page book, but I tried to make to easy to skim and skip around; in particular, every post starts with a summary and table of contents.

I note that the “AGI alignment problem” is often conceptualized as involving (1) an AGI and (2) a thing-that-we’re-trying-to-align-the-AGI-to, typically “human values”, whatever that means. Correspondingly, we wind up with two reasons that why AGI-concerned people sometimes become interested in neuroscience: (1) to better understand how the AGI might work, and (2) to better understand “human values”. This post series is 100% the 1st thing, not the 2nd. However, people interested in the 2nd thing can read it anyway, maybe you’ll find something interesting. (Sorry, you won’t find an answer to the question “What are human values?”, because I don’t know.)

How to follow & discuss the series

You can follow new posts on RSS or twitter, or just check the series page every Wednesday (I hope!). I’ll also cross-post on EA Forum again when the series is done.

Happy to talk more—you can comment here, or at the posts themselves, or by email, etc. :-)