There’s probably some analogy here to ‘inner alignment’ versus ‘our alignment’ in the AI safety literature, but I find these two terms so vague, confusing, and poorly defined that I can’t see which of them corresponds to what, exactly, in my gene/brain alignment analogy; any guidance on that would be appreciated.
The following table is my attempt to clear things up. I think there are two stories we can tell.
The left column is the Risks From Learned Optimization (2019) model. We’re drawing an analogy between the ML learning algorithm and evolution-as-a-learning-algorithm.
The right column is a model that I prefer. We’re drawing an analogy between the ML learning algorithm and within-lifetime learning.
Your OP talks plenty about both evolutionary learning and within-lifetime learning, so ¯\_(ツ)_/¯
However, gene/brain alignment requires a staggering amount of trial-and-error experimentation – and there are no shortcuts to getting adaptive reward functions … This lesson should make us very cautious about the prospects for AI alignment with humans.
Hmm, from my perspective, this observation doesn’t really provide any evidence either way. Evolution solves every problem via a staggering amount of trial-and-error experimentation! Whether the problem would be straightforward for a human engineer, or extraordinarily hard for a human engineer, it doesn’t matter, evolution is definitely going to solve it via a staggering amount of trial-and-error experimentation either way!
If you’re making the weaker claim that there may be no shortcuts to getting adaptive reward functions, then I agree. I think it’s an open question.
I personally am spending most of my days on the project of trying to figure out reward circuitry that would lead to aligned AI, and I think I’m gradually making research progress, but I am very open-minded to the possibility that there just isn’t any good solution to be found.
These motivational conflicts aren’t typically resolved by some ‘master utility function’ that weighs up all relevant inputs, but simply by the relative strengths of different behavioral priorities (e.g. hunger vs. fear).
I’m not sure what distinction you’re trying to draw in this sentence. If I’m deciding whether to watch TV versus go to the gym, it’s a decision that impacts lots of things—hunger, thirst, body temperature, energy reserves, social interactions, etc. But at the end of the day, I’m going to do one thing or the other. Therefore there has to be some “all things considered” final common pathway for a possible-course-of-action being worth doing or not worth doing, right? I don’t endorse the term “utility function” for that pathway, for various reasons, but whatever we call it, it does need to “weigh up all relevant inputs” in a certain sense, right? (I usually just call it “reward”, although that term needs a whole bunch of elaboration & caveats too.)
These Blank Slate models tend to posit a kind of neo-Behaviorist view of learning, in which a few crude, simple reward functions guide the acquisition of all cognitive, emotional, and motivational content in the human mind.
I’m not sure exactly who you’re referring to, but insofar as some shard theory discussions are downstream of my blog posts, I would like to state for the record that I don’t think the human “reward function” (or “reward circuitry” or whatever we call it) is “a few” or “crude” and I’m quite sure that I’ve never described it that way. I think the reward circuitry is quite complicated.
More specifically, I wrote here: “…To be sure, that’s an incomplete accounting of the functions of one little cell group among many dozens (or even hundreds?) in the hypothalamus. So yes, these things are complicated! But they’re not hopelessly complicated. Keep in mind, after all, the entire brain and body needs to get built by a mere 25,000 genes. My current low-confidence feeling is that reasonably-comprehensive pseudocode for the human hypothalamus would be maybe a few thousand lines long. Certainly not millions.”
You might also be interested in this discussion, where I was taking “your side” of a debate on how complicated the reward circuitry is. We specifically discussed habitat-related evolutionary aesthetics in humans, and I was on the “yes it is a real thing that evolved and is in the genome” side of the debate, and the person I was arguing against (Jacob Cannell) was on the “no it isn’t” side of the debate.
It’s tempting to think of the human brain as one general-purpose cognitive organ, but evolutionary psychologists have found it much more fruitful to analyze brains as collections of distinct ‘psychological adaptations’ that serve different functions. Many of these psychological adaptations take the form of evolved motivations, emotions, preferences, values, adaptive biases, and fast-and-frugal heuristics, rather than general-purpose learning mechanisms or information-processing systems.
I think “the human brain as one general-purpose cognitive organ” is a crazy thing to believe, and if anyone actually believes that I join you in disagreeing. For example, part of the medulla regulates your heart rate, and that’s the only thing it does, and the only thing it can do, and it would be crazy to describe that as a “general-purpose cognitive” capability.
That said, I imagine that there are at least a few things that you would classify as “psychological adaptations” whereas I would want to explain them in other ways, e.g. if humans all have pretty similar within-lifetime learning algorithms, with pretty similar reward circuitry, and they all grow up in pretty similar environments (in certain respects), then maybe they’re going to wind up learning similar things (in some cases), and those things can even wind up reliably in the same part of the cortex.
What counts as success or failure? You have no idea. You have to make guesses about what counts as ‘reward’ or ‘punishment’, by wiring up your perceptual systems in a way that assigns a valence (positive or negative) to each situation that seems like it might be important to survival or reproduction.
It’s probably worth noting that I agree with this paragraph but in my mind it would be referring to the “perceptual systems” of the hypothalamus & brainstem, not the thalamocortical ones. For visual, that would be mainly the superior colliculus / optic tectum. For example, the mouse superior colliculus innately detects expanding dark blobs in the upper FOV (which triggers on incoming-birds-of-prey) and triggers a scamper-away reflex (along with presumably negative valence), and I think the human superior colliculus has analogous heuristic detectors tuned to scuttling spiders & slithering snakes & human faces and a number of other things like that. And I think the information content / recipes necessary to detect those things is coming straight from the genome. (The literature on all these things is a bit of a mess in some cases, but I’m happy to discuss why I believe that in more detail.)
The following table is my attempt to clear things up. I think there are two stories we can tell.
The left column is the Risks From Learned Optimization (2019) model. We’re drawing an analogy between the ML learning algorithm and evolution-as-a-learning-algorithm.
The right column is a model that I prefer. We’re drawing an analogy between the ML learning algorithm and within-lifetime learning.
(↑ table is from here)
Your OP talks plenty about both evolutionary learning and within-lifetime learning, so ¯\_(ツ)_/¯
Hmm, from my perspective, this observation doesn’t really provide any evidence either way. Evolution solves every problem via a staggering amount of trial-and-error experimentation! Whether the problem would be straightforward for a human engineer, or extraordinarily hard for a human engineer, it doesn’t matter, evolution is definitely going to solve it via a staggering amount of trial-and-error experimentation either way!
If you’re making the weaker claim that there may be no shortcuts to getting adaptive reward functions, then I agree. I think it’s an open question.
I personally am spending most of my days on the project of trying to figure out reward circuitry that would lead to aligned AI, and I think I’m gradually making research progress, but I am very open-minded to the possibility that there just isn’t any good solution to be found.
I’m not sure what distinction you’re trying to draw in this sentence. If I’m deciding whether to watch TV versus go to the gym, it’s a decision that impacts lots of things—hunger, thirst, body temperature, energy reserves, social interactions, etc. But at the end of the day, I’m going to do one thing or the other. Therefore there has to be some “all things considered” final common pathway for a possible-course-of-action being worth doing or not worth doing, right? I don’t endorse the term “utility function” for that pathway, for various reasons, but whatever we call it, it does need to “weigh up all relevant inputs” in a certain sense, right? (I usually just call it “reward”, although that term needs a whole bunch of elaboration & caveats too.)
I’m not sure exactly who you’re referring to, but insofar as some shard theory discussions are downstream of my blog posts, I would like to state for the record that I don’t think the human “reward function” (or “reward circuitry” or whatever we call it) is “a few” or “crude” and I’m quite sure that I’ve never described it that way. I think the reward circuitry is quite complicated.
More specifically, I wrote here: “…To be sure, that’s an incomplete accounting of the functions of one little cell group among many dozens (or even hundreds?) in the hypothalamus. So yes, these things are complicated! But they’re not hopelessly complicated. Keep in mind, after all, the entire brain and body needs to get built by a mere 25,000 genes. My current low-confidence feeling is that reasonably-comprehensive pseudocode for the human hypothalamus would be maybe a few thousand lines long. Certainly not millions.”
You might also be interested in this discussion, where I was taking “your side” of a debate on how complicated the reward circuitry is. We specifically discussed habitat-related evolutionary aesthetics in humans, and I was on the “yes it is a real thing that evolved and is in the genome” side of the debate, and the person I was arguing against (Jacob Cannell) was on the “no it isn’t” side of the debate.
You might also be interested in my post Heritability, Behaviorism, and Within-Lifetime RL if you haven’t already seen it.
I think “the human brain as one general-purpose cognitive organ” is a crazy thing to believe, and if anyone actually believes that I join you in disagreeing. For example, part of the medulla regulates your heart rate, and that’s the only thing it does, and the only thing it can do, and it would be crazy to describe that as a “general-purpose cognitive” capability.
That said, I imagine that there are at least a few things that you would classify as “psychological adaptations” whereas I would want to explain them in other ways, e.g. if humans all have pretty similar within-lifetime learning algorithms, with pretty similar reward circuitry, and they all grow up in pretty similar environments (in certain respects), then maybe they’re going to wind up learning similar things (in some cases), and those things can even wind up reliably in the same part of the cortex.
It’s probably worth noting that I agree with this paragraph but in my mind it would be referring to the “perceptual systems” of the hypothalamus & brainstem, not the thalamocortical ones. For visual, that would be mainly the superior colliculus / optic tectum. For example, the mouse superior colliculus innately detects expanding dark blobs in the upper FOV (which triggers on incoming-birds-of-prey) and triggers a scamper-away reflex (along with presumably negative valence), and I think the human superior colliculus has analogous heuristic detectors tuned to scuttling spiders & slithering snakes & human faces and a number of other things like that. And I think the information content / recipes necessary to detect those things is coming straight from the genome. (The literature on all these things is a bit of a mess in some cases, but I’m happy to discuss why I believe that in more detail.)