Steven Byrnes

Karma: 1,506

Hi I’m Steve Byrnes, an AGI safety / AI alignment researcher in Boston, MA, USA, with a particular focus on brain algorithms. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn

Steven Byrnes Sep 18, 2023, 1:16 PM
20 points
6 ∶ 0
in reply to: Nora Belrose’s comment on: AI Pause Will Likely Backfire
I don’t think “mouldability” is a synonym of “white-boxiness”. In fact, I think they’re hardly related at all:
- There can be a black box with lots of knobs on the outside that change the box’s behavior. It’s still a black box.
- Conversely, consider an old-fashioned bimetallic strip thermostat with a broken dial. It’s not mouldable at all—it can do one and only thing, i.e. actuate a switch at a certain fixed temperature. (Well, I guess you can use it as a doorstop!) But a bimetallic strip thermostat still very white-boxy (after I spend 30 seconds telling you how it works).
You wrote “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.” I feel like I keep pressing you on this, and you keep motte-and-bailey’ing into some other claim that does not align with a common-sense reading of what you originally wrote:
- “Well, the cost of analysis could theoretically be even higher—like, if you had to drill into skulls…” OK sure but that’s not the same as “essentially no cost”.
- “Well, the cost of analysis may be astronomically high, but there’s a theorem proving that it’s not theoretically impossible…” OK sure but that’s not the same as “essentially no cost”.
- “Well, I can list out some specific analysis and manipulation tasks that we can do at essentially no cost: we can do X, and Y, and Z, …” OK sure but that’s not the same as “we can analyze and manipulate however we want at essentially no cost”.
Do you see what I mean?

Steven Byrnes Sep 17, 2023, 11:43 PM
12 points
3 ∶ 0
in reply to: Nora Belrose’s comment on: AI Pause Will Likely Backfire
If you want to say “it’s a black box but the box has a “gradient” output channel in addition to the “next-token-probability-distribution” output channel”, then I have no objection.

If you want to say ”...and those two output channels are sufficient for safe & beneficial AGI”, then you can say that too, although I happen to disagree.

If you want to say “we also have interpretability techniques on top of those, and they work well enough to ensure alignment for both current and future AIs”, then I’m open-minded and interested in details.

If you want to say “we can’t understand how a trained model does what it does in any detail, but if we had to drill into a skull and only measure a few neurons at a time etc. then things sure would be even worse!!”, then yeah duh.

But your OP said “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost”, and used the term “white box”. That’s the part that strikes me as crazy. To be charitable, I don’t think those words are communicating the message that you had intended to communicate.

For example, find a random software engineer on the street, and ask them: “if I give you a 1-terabyte compiled executable binary, and you can do whatever you want with that file on your home computer, would you describe it as closer to “white box” or “black box”?”. I predict most people would say “closer to black box”, even though they can look at all the bits and step through the execution and run decompilation tools etc. if they want. Likewise you can ask them whether it’s possible to “analyze” that binary “at essentially no cost”. I predict most people would say “no”.

Steven Byrnes Sep 17, 2023, 4:32 PM
6 points
1 ∶ 0
in reply to: Zach Stein-Perlman’s comment on: AI Pause Will Likely Backfire
I was reading it as a kinda disjunctive argument. If Nora says that a pause is bad because of A and B, either of which is sufficient on its own from her perspective, then you could say “A isn’t cruxy for her” (because B is sufficient) or you could say “B isn’t cruxy for her” (because A is sufficient). Really, neither of those claims is accurate.

Oh well, whatever, I agree with you that the OP could have been clearer.

Steven Byrnes Sep 17, 2023, 2:12 PM
11 points
7 ∶ 0
in reply to: Zach Stein-Perlman’s comment on: AI Pause Will Likely Backfire
If you desperately wish we had more time to work on alignment, but also think a pause won’t make that happen or would have larger countervailing costs, then that would lead to an attitude like: “If only we had more time! But alas, a pause would only make things worse. Let’s talk about other ideas…” For my part, I definitely say things like that (see here).
However, Nora has sections claiming “alignment is doing pretty well” and “alignment optimism”, so I think it’s self-consistent for her to not express that kind of mood.

Steven Byrnes Sep 17, 2023, 2:03 PM
14 points
2 ∶ 0
in reply to: Tom McGrath’s comment on: AI Pause Will Likely Backfire
I have a vague impression—I forget from where and it may well be false—that Nora has read some of my AI alignment research, and that she thinks of it as not entirely pointless. If so, then when I say “pre-2020 MIRI (esp. Abram & Eliezer) deserve some share of the credit for my thinking”, then that’s meaningful, because there is in fact some nonzero credit to be given. Conversely, if you (or anyone) don’t know anything about my AI alignment research, or think it’s dumb, then you should ignore that part of my comment, it’s not offering any evidence, it would just be saying that useless research can sometimes lead to further useless research, which is obvious! :)
I probably think less of current “empirical” research than you, because I don’t think AGI will look and act and be built just like today’s LLMs but better / larger. I expect highly-alignment-relevant differences between here and there, including (among other things) reinforcement learning being involved in a much more central way than it is today (i.e. RLHF fine-tuning). This is a big topic where I think reasonable people disagree and maybe this comment section isn’t a great place to hash it out. ¯\_(ツ)_/¯
My own research doesn’t involve LLMs and could have been done in 2017, but I’m not sure I would call it “purely conceptual”—it involves a lot of stuff like scrutinizing data tables in experimental neuroscience papers. The ELK research project led by Paul Christiano also could have been done 2017, as far as I can tell, but lots of people seem to think it’s worthwhile; do you? (Paul is a coinventor of RLHF.)

Steven Byrnes Sep 16, 2023, 4:02 PM
15 points
10 ∶ 2
on: AI Pause Will Likely Backfire
By contrast, AIs implemented using artificial neural networks (ANN) are white boxes in the sense that we have full read-write access to their internals. They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.
Suppose you walk down a street, and unbeknownst to you, you’re walking by a dumpster that has a suitcase full of millions of dollars. There’s a sense in which you “can”, “at essentially no cost”, walk over and take the money. But you don’t know that you should, so you don’t. All the value is in the knowledge.
A trained model is like a computer program with a billion unlabeled parameters and no documentation. Being able to view the code is helpful but doesn’t make it “white box”. Saying it’s “essentially no cost” to “analyze” a trained model is just crazy. I’m pretty sure you have met people doing mechanistic interpretability, right? It’s not trivial. They spend months on their projects. The thing you said is just so crazy that I have to assume I’m misunderstanding you. Can you clarify?
What links here?

Steven Byrnes Sep 16, 2023, 4:01 PM
42 points
15 ∶ 2
on: AI Pause Will Likely Backfire
Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).
I feel like you’re trying to round these three things into a “yay versus boo” axis, and then come down on the side of “boo”. I think we can try to do better than that.
One can make certain general claims about learning algorithms that are true and for which evolution provides as good an example as any. One can also make other claims that are true for evolution and false for other learning algorithms. and then we can argue about which category future AGI will be in. I think we should be open to that kind of dialog, and it involves talking about evolution.
Likewise, I think “inner misalignment versus outer misalignment” is a helpful and valid way to classify certain failure modes of certain AI algorithms.
For the third one, there’s an argument like:
“Maybe the AI will really want something-or-other to happen in the future, and try to make it happen, including by long-term planning—y’know, the way some humans really want to break out of prison, or the way Elon Musk really wants to go to Mars. Maybe the AIs have other desires and do other things too, but that’s not too relevant to what I’m saying. Next, There are a lot of reasons to think that “AIs that really want something-or-other to happen in the future” will show up sooner or later, e.g. the fact that smart people have been trying to build them since the dawn of AI and continuing through today. And if we get such AIs, and they’re very smart and competent, it has similar relevant consequences as “rigid utility maximizing consequentialists”—particularly power-seeking / instrumental convergence, and not pursuing plans that have obvious and effective countermeasures.”
Do you buy that argument? If so, I think some discussions of “rigid utility maximizing consequentialists” can be useful. I also think that some such discussions can lead to conclusions that do not necessarily transfer to more realistic AGIs (see here). So again, I think we should avoid yay-versus-boo thinking.
The Machine Intelligence Research Institute (MIRI), which was at the forefront of theoretical AI safety research during this period, has since admitted that its efforts have utterly failed.
I think that part of the blog post you linked was being facetious. IIUC they had some undisclosed research program involving Haskell for a few years, and then they partly but not entirely wound it down when it wasn’t going as well as they had hoped. But they have also been doing other things too the whole time, like their agent foundations team. (I have no personal knowledge beyond reading the newsletters etc.)
For example, FWIW, I have personally found MIRI employee Abram Demski’s blog posts (including pre-2020) to be very helpful to my thinking about AGI alignment.
Anyway, your more general claim in this section seems to be: Given current levels of capabilities, there is no more alignment research to be done. We’re tapped out. The well is dry. The only possible thing left to do is twiddle our thumbs and wait for more capable models to come out.
Is that really your belief? Do you look at literally everything on alignmentforum etc. as total garbage? Obviously I have a COI but I happen to think there is lots of alignment work yet to do that would be helpful and does not need newly-advanced capabilities to happen.
Nothing in this comment should be construed as “all things considered we should be for or against the pause”—as it happens I’m weakly against the pause too—these are narrower points. :)
What links here?
- RobertM's comment on AI Pause Will Likely Backfire by Nora Belrose (Sep 20, 2023, 7:07 AM; 27 points)

Steven Byrnes Sep 1, 2023, 6:18 PM
3 points
0 ∶ 0
in reply to: John G. Halstead’s comment on: Alignment & Capabilities: What’s the difference?
I think the attitude most people (including me) have is: “If we want to do technical work to reduce AI x-risk, then we should NOT be working on any technical problems that will almost definitely get solved “by default”, e.g. because they’re straightforward and lots of people are already working on them and mostly succeeding, or because there’s no way to make powerful AGI except via first solving those problems, etc.”.
Then I would rephrase your original question as: “OK, if we shouldn’t be working on those types of technical problems above … then are there any technical problems left that we should be working on?”
And my answer is: Yes! There are lots!
GPT-4 is more in line with the intentions of the user than GPT-3 and this is mainly due to more compute.
GPT-4 is not all that capable—and in particular, not capable enough to constitute an x-risk. For example, I can NOT take 1000 copies of GPT-4, ask each one to start a company, give each of them some seed money, and each will brainstorm company ideas, and start talking to potential customers, and researching the competitive landscape, and hiring people, filing paperwork, iterating product ideas, etc. etc. That’s way beyond GPT-4.
But there will eventually be some future AI that can do that kind of thing.
And when there is, then I’m very interested in exactly what that AI will be “trying” / “motivated” to do. (Hopefully not “self-replicate around the internet, gradually win allies and consolidate power, and eventually launch a coup against humanity”!)
Personally, I happen to think that this kind of future AI will NOT look very much like LLM+RLHF—see my post where I come out as an “LLM plateau-ist”. So I can’t really speak from my own inside view here. But among the people who think that future LLM+RLHF+AutoGPT version N could do that kind of thing, I think most of them are not very optimistic that we can trust such AIs to not launch a coup against humanity, solely on the basis of RLHF seeming to make AIs more helpful and docile right now.
In principle, there seem to be numerous ways that RLHF can go wrong, and there are some reasons to think that future more capable models will have alignment-related failure modes that current models don’t, which are inherent to the way that RLHF works, and thus which can’t be fixed by just doing more RLHF with a more capable base model. For example, you can tell a story like Ajeya’s “training game” in the context of RLHF.
So we need to figure out if RLHF is or isn’t a solution that will continue to work all the way through the time when we will have the kind of future agentic situationally-aware AI that poses an x-risk. And if it doesn’t, then we need to figure out what else to do instead. I think we should be starting work on that right now, because there are reasons to think it’s a very hard technical problem, and will remain very hard even in the future when we have misaligned systems right in front of us to run tests on.
I’m not sure I agree that nobody is working on the problem of which reward functions make AIs that are honest and cooperative.
Oh, I was talking about model-based RL. You’re talking about LLM+RLHF, which is a different AI architecture. These days, LLM+RLHF is so much in the news that people sometimes forget that other types of AI exist at all. But really, model-based RL remains a reasonably active field, and was more so in the recent past and might be again in the future. Famous examples of model-based RL include MuZero, AlphaStar, OpenAI Five, etc. All of those projects were laser-focused on making agents that were effective at winning games. They sure weren’t trying to make agents that viewed kindness as an end in itself.
In the case of figuring out how model-based RL works in the brain, here I’m intimately familiar with the literature, and I can vouch that there is dramatically more work and interest tackling the question of “how does the brain reward signal update the trained model?” than the question of “how is the brain reward signal calculated in the first place?” This is especially true among the AI-adjacent neuroscientists with a knack for algorithms.

Steven Byrnes Sep 1, 2023, 12:50 PM
5 points
1 ∶ 0
in reply to: Larks’s comment on: How much do EAGs cost (and why)?
There’s a school of thought that academics travel much much more than optimal or healthy. See Cal Newport’s Deep Work, where he cites a claim that it’s “typical for junior faculty to travel twelve to twenty-four times a year”, and compares that to Radhika Nagpal’s blog post The Awesomest 7-Year Postdoc or: How I Learned to Stop Worrying and Love the Tenure-Track Faculty Life which says:
I travel at most 5 times a year. This includes: all invited lectures, all NSF/Darpa investigator or panel meetings, conferences, special workshops, etc. Typically it looks something like this: I do one or two invited lectures at places where I really like the people, I go one full week to a main conference, I do maybe one NSF/Darpa event, and I reserve one wildcard to attend something I really care about (e.g. the Grace Hopper Conference, or a workshop on a special topic). It is *not easy* to say no that often, especially when the invitations are so attractive, or when the people asking are so ungraceful in accepting no for an answer. But when I didn’t have this limit I noticed other things. Like how exhausted and unhappy I was, how I got sick a lot, how it affected my kids and my husband, and how when I stopped traveling I had so much more time to pay real attention to my research and my amazing students.
The author of that post wound up getting tenure at Harvard on schedule, and then getting further promoted to full professor unusually fast.
Anyway, for my part, I have little kids, so traveling is a burden on my family, in addition to sucking up a surprising amount of time / energy (much more than you would think just the nominal travel time, because there’s also booking & planning, packing & unpacking, catching up on chores and sleep and taking notes afterwards, etc.). Big opportunity cost.

Steven Byrnes Sep 1, 2023, 3:12 AM
3 points
0 ∶ 0
in reply to: MH ’s comment on: Should I patent my cultivated meat method?
If you publish it, a third party could make a small tweak and apply for a patent. If you patent it, a third party could make a small tweak and apply for a patent. What do you see as the difference? Or sorry if I’m misunderstanding the rules.

Steven Byrnes Sep 1, 2023, 3:09 AM
2 points
0 ∶ 0
in reply to: Linch’s comment on: Should I patent my cultivated meat method?
In theory, publishing X and patenting X are both equally valid ways to prevent other people from patenting X. Does it not work that way in practice?
Could be wrong, but I had the impression that software companies have historically amassed patents NOT because patenting X is the best way to prevent another company from patenting the exact same thing X or things very similar to X, but rather because “the best defense is a good offense”, and if I have a dubious software patent on X and you have a dubious software patent on Y then we can have a balance of terror so that neither wants to sue the other. (That wouldn’t help defend Microsoft against patent trolls but would help defend Microsoft against Oracle etc.)
Possibly related? “Technical Disclosure Commons”, IBM Technical Disclosure Bulletin, Defensive Publication.
IANAL.

Steven Byrnes Sep 1, 2023, 2:41 AM
16 points
3 ∶ 0
on: Alignment & Capabilities: What’s the difference?
I define “alignment” as “the AI is trying to do things that the AI designer had intended for the AI to be trying to do”, see here for discussion.
If you define “capabilities” as “anything that would make an AI more useful / desirable to a person or company”, then alignment research would be by definition a subset of capabilities research.
But it’s a very small subset!
Examples of things that constitute capabilities progress but not alignment progress include: faster and better and more and cheaper chips (and other related hardware like interconnects), the development of CUDA, PyTorch, etc., the invention of BatchNorm and Xavier initialization and adam optimizers and Transformers, etc. etc.
Here’s a concrete example. Suppose future AGI winds up working somewhat like human brain within-lifetime learning (which I claim is in the broad category of model-based reinforcement learning (RL).) A key ingredient in model-based RL is the reward function, which in the human brain case loosely corresponds to “innate drives”, like pain being bad (other things equal), eating when hungry being good, and hundreds more things like that. If future AI works like that, then future AI programmers can put whatever innate drives they want into their future AIs. So there’s a technical problem of “what innate drives / reward function (if any) would lead to AIs that are honest, cooperative, kind, etc.?” And this problem is not only currently unsolved, but almost nobody is working on it. Is solving this problem necessary to create economically-useful powerful AGIs? Unfortunately, it is not!! Just look at human high-functioning sociopaths. If we made AGIs like that, we could get extremely competent agents—agents that can make and execute complicated plans, figure things out, do science, invent tools to solve their problems, etc.—with none of the machinery that gives humans an intrinsic tendency to compassion and morality. Such AGIs would nevertheless be very profitable to use … for exactly as long as they can be successfully prevented from breaking free and pursuing their own interests, in which case we’re in big trouble. (By analogy, human slaves are likewise not “aligned” with their masters but still economically useful.) I have much more discussion and elaboration of all this stuff here.

Steven Byrnes Aug 28, 2023, 1:21 PM
8 points
2 ∶ 0
in reply to: Chris Leong’s comment on: Eliezer Yudkowsky Is Frequently, Confidently, Egregiously Wrong
At the same time, I think Eliezer made a really strong (and well-argued) point that if we believe in epiphenomenalism then we have no reason to believe that our reports of consciousness have any connection to the phenomenon of consciousness. I haven’t seen this point made so clearly elsewhere
Chalmers here says something like that (“It is certainly at least strange to suggest that consciousness plays no causal role in my utterances of ‘I am conscious’. Some have suggested more strongly that this rules out any knowledge of consciousness… The oddness of epiphenomenalism is exacerbated by the fact that the relationship between consciousness and reports about consciousness seems to be something of a lucky coincidence, on the epiphenomenalist view …”)
I liked how Eliezer made that argument more grounded and rigorous by presenting it in the context of Bayesian epistemology.
Also, that Chalmers section that I excerpted above is intermingled by a bunch of counterarguments, and Chalmers eventually says “I think that there is no knockdown objection to epiphenomenalism here.” I think Chalmers loses points for that—I think the arguments are totally knockdown and the counterarguments are galaxy-brained copium, and I say kudos to Eliezer for just outright stating that.
I also think that getting the right answer in a controversial domain and stating it clearly is much more important and praiseworthy than being original (in this context). So even if Eliezer’s point is already in the literature, I don’t care. (The fact that you can’t get academic publications and tenure from that activity is one of many big systematic problems with academia, IMO.)
(I am not a philosopher of consciousness.)

Steven Byrnes Aug 13, 2023, 12:37 AM
15 points
4 ∶ 1
on: The 25 researchers who have published the largest number of academic articles on existential risk
I would have liked this article much more if the title had been “The 25 researchers who have published the largest number of academic articles on existential risk”, or something like that.
The current title (“The top 25 existential risk researchers based on publication count”) seems to insinuate that this criterion is reasonable in the context of figuring out who are the “Top 25 existential risk researchers” full stop, which it’s not, for reasons pointed out in other comments.

Steven Byrnes Jul 13, 2023, 6:47 PM
2 points
2 ∶ 0
on: What new psychology research could best promote AI safety & alignment research?
I have some interest in cluster B personality disorders, on the theory that something(s) in human brains makes people tend to be nice to their friends and family, and whatever that thing is, it would be nice to understand it better because maybe we can put something like it into future AIs, assuming those future AIs have a sufficiently similar high-level architecture to the human brain, which I think is plausible.
And whatever that thing is, it evidently isn’t working in the normal way in cluster B personality disorder people, so maybe better understanding the brain mechanisms behind cluster B personality disorders would get a “foot in the door” in understanding that thing.
Sigh. This comment won’t be very helpful. Here’s where I’m coming from. I have particular beliefs about how social instincts need to work (short version), beliefs which I think we mostly don’t share—so an “explanation” that would satisfy you would probably bounce off me and vice-versa. (I’m happy to work on reconciling if you think it’s a good use of your time.) If it helps, I continue to be pretty happy about the ASPD theory I suggested here, with the caveat that I now think that it’s only an explanation of a subset of ASPD cases. I’m pretty confused on borderline, and I’m at a total loss on narcissism. There’s obviously loads of literature on borderline & narcissism, and I can’t tell you concretely any new studies or analysis that I wish existed but don’t yet. But anyway, if you’re aware of gaps in the literature on cluster B stuff, I’m generally happy for them to be filled. And I think there’s a particular shortage of “grand theorizing” on what’s going on mechanistically in narcissism (or at least, I’ve been unable to find any in my brief search). (In general, I find that “grand theorizing” is almost always helpfully thought-provoking, even if it’s often wrong.)

Steven Byrnes Jun 28, 2023, 2:41 PM
7 points
2 ∶ 0
in reply to: titotal’s comment on: Munk AI debate: confusions and possible cruxes
Are we talking about in the debate, or in long-form good-faith discussion?
For the latter, it’s obviously worth talking about, and I talk about it myself plenty. Holden’s post AI Could Defeat All Of Us Combined is pretty good, and the new lunar society podcast interview of Carl Shulman is extremely good on this topic (the relevant part is mostly the second episode [it was such a long interview they split it into 2 parts]).
For the former, i.e. in the context of a debate, the point is not to hash out particular details and intervention points, but rather just to argue that this is a thing worth consideration at all. And in that case, I usually say something like:
- The path we’re heading down is to eventually make AIs that are like a new intelligent species on our planet, and able to do everything that humans can do—understand what’s going on, creatively solve problems, take initiative, get stuff done, make plans, pivot when the plans fail, invent new tools to solve their problems, etc.—but with various advantages over humans like speed and the ability to copy themselves.
- Nobody currently has a great plan to figure out whether such AIs have our best interests at heart. We can ask the AI, but it will probably just say “yes”, and we won’t know if it’s lying.
- The path we’re heading down is to eventually wind up with billions or trillions of such AIs, with billions or trillions of robot bodies spread all around the world.
- It seems pretty obvious to me that by the time we get to that point—and indeed probably much much earlier—human extinction should be at least on the table as a possibility.
Oh I also just have to share this hilarious quote from Joe Carlsmith:
I remember looking at some farmland out the window of a bus, and wondering: am I supposed to think that this will all be compute clusters or something? I remember looking at a church and thinking: am I supposed to imagine robots tearing this church apart? I remember a late night at the Future of Humanity Institute office (I ended up working there in 2017-18), asking someone passing through the kitchen how to imagine the AI killing us; he turned to me, pale in the fluorescent light, and said “whirling knives.”

Steven Byrnes Jun 28, 2023, 2:40 AM
4 points
3 ∶ 0
in reply to: Matt Boyd’s comment on: Munk AI debate: confusions and possible cruxes
Thanks!
we need good clear scenarios of how exactly step by step this happens
Hmm, depending on what you mean by “this”, I think there are some tricky communication issues that come up here, see for example this Rob Miles video.
On top of that, obviously this kind of debate format is generally terrible for communicating anything of substance and nuance.
Melanie seemed either (a) uninformed of the key arguments (she just needs to listen to one of Yampolskiy’s recent podcast interviews to get a good accessible summary). Or (b) refused to engage with such arguments.
Melanie is definitely aware of things like orthogonality thesis etc.—you can read her Quanta Magazine article for example. Here’s a twitter thread where I was talking with her about it.

Munk AI debate: confusions and possible cruxes

Steven ByrnesJun 27, 2023, 3:01 PM

142 points

10 comments EA link

Steven Byrnes Jun 15, 2023, 11:13 PM
13 points
2 ∶ 0
on: Productive criticism: running a draft past the people you’re criticizing
In this post the criticizer gave the criticizee an opportunity to reply in-line in the published post—in effect, the criticizee was offered the last word. I thought that was super classy, and I’m proud to have stolen that idea on two occasions (1,2).
If anyone’s interested, the relevant part of my email was:
…
You can leave google docs margin comments if you want, and:
- If I’m just straight-up wrong about something, or putting words in your mouth, then I’ll just correct the text before publication.
- If you are leave a google docs comment that’s more like a counter-argument, and I’m not immediately convinced, I’d probably copy what you wrote into an in-text reply box—just like the gray boxes here: [link] So you get to have the last word if you want, although I might still re-reply in the comments.
- You can also / alternatively obviously leave comments on the published lesswrong post like normal.
If you would like to leave pre-publication feedback, but don’t expect to get around to it “soon” (say, the next 3 weeks), let me know and I’ll hold off publication.
(In the LW/EAF post editor, the inline reply-boxes are secretly just 1×1 tables.)
Another super classy move was I wrote a criticism post once, and the person I criticized retweeted it. (Without even dunking on it!) (The classy person here was Robin Hanson.) I’m proud to say that I’ve stolen that one too, although I guess not every time.

Steven Byrnes May 30, 2023, 6:24 PM
2 points
0 ∶ 0
on: Biomimetic alignment: Alignment between animal genes and animal brains as a model for alignment between humans and AI systems.
There’s probably some analogy here to ‘inner alignment’ versus ‘our alignment’ in the AI safety literature, but I find these two terms so vague, confusing, and poorly defined that I can’t see which of them corresponds to what, exactly, in my gene/brain alignment analogy; any guidance on that would be appreciated.
The following table is my attempt to clear things up. I think there are two stories we can tell.
- The left column is the Risks From Learned Optimization (2019) model. We’re drawing an analogy between the ML learning algorithm and evolution-as-a-learning-algorithm.
- The right column is a model that I prefer. We’re drawing an analogy between the ML learning algorithm and within-lifetime learning.
(↑ table is from here)
Your OP talks plenty about both evolutionary learning and within-lifetime learning, so ¯\_(ツ)_/¯
However, gene/brain alignment requires a staggering amount of trial-and-error experimentation – and there are no shortcuts to getting adaptive reward functions … This lesson should make us very cautious about the prospects for AI alignment with humans.
Hmm, from my perspective, this observation doesn’t really provide any evidence either way. Evolution solves every problem via a staggering amount of trial-and-error experimentation! Whether the problem would be straightforward for a human engineer, or extraordinarily hard for a human engineer, it doesn’t matter, evolution is definitely going to solve it via a staggering amount of trial-and-error experimentation either way!
If you’re making the weaker claim that there may be no shortcuts to getting adaptive reward functions, then I agree. I think it’s an open question.
I personally am spending most of my days on the project of trying to figure out reward circuitry that would lead to aligned AI, and I think I’m gradually making research progress, but I am very open-minded to the possibility that there just isn’t any good solution to be found.
These motivational conflicts aren’t typically resolved by some ‘master utility function’ that weighs up all relevant inputs, but simply by the relative strengths of different behavioral priorities (e.g. hunger vs. fear).
I’m not sure what distinction you’re trying to draw in this sentence. If I’m deciding whether to watch TV versus go to the gym, it’s a decision that impacts lots of things—hunger, thirst, body temperature, energy reserves, social interactions, etc. But at the end of the day, I’m going to do one thing or the other. Therefore there has to be some “all things considered” final common pathway for a possible-course-of-action being worth doing or not worth doing, right? I don’t endorse the term “utility function” for that pathway, for various reasons, but whatever we call it, it does need to “weigh up all relevant inputs” in a certain sense, right? (I usually just call it “reward”, although that term needs a whole bunch of elaboration & caveats too.)
These Blank Slate models tend to posit a kind of neo-Behaviorist view of learning, in which a few crude, simple reward functions guide the acquisition of all cognitive, emotional, and motivational content in the human mind.
I’m not sure exactly who you’re referring to, but insofar as some shard theory discussions are downstream of my blog posts, I would like to state for the record that I don’t think the human “reward function” (or “reward circuitry” or whatever we call it) is “a few” or “crude” and I’m quite sure that I’ve never described it that way. I think the reward circuitry is quite complicated.
More specifically, I wrote here: “…To be sure, that’s an incomplete accounting of the functions of one little cell group among many dozens (or even hundreds?) in the hypothalamus. So yes, these things are complicated! But they’re not hopelessly complicated. Keep in mind, after all, the entire brain and body needs to get built by a mere 25,000 genes. My current low-confidence feeling is that reasonably-comprehensive pseudocode for the human hypothalamus would be maybe a few thousand lines long. Certainly not millions.”
You might also be interested in this discussion, where I was taking “your side” of a debate on how complicated the reward circuitry is. We specifically discussed habitat-related evolutionary aesthetics in humans, and I was on the “yes it is a real thing that evolved and is in the genome” side of the debate, and the person I was arguing against (Jacob Cannell) was on the “no it isn’t” side of the debate.
You might also be interested in my post Heritability, Behaviorism, and Within-Lifetime RL if you haven’t already seen it.
It’s tempting to think of the human brain as one general-purpose cognitive organ, but evolutionary psychologists have found it much more fruitful to analyze brains as collections of distinct ‘psychological adaptations’ that serve different functions. Many of these psychological adaptations take the form of evolved motivations, emotions, preferences, values, adaptive biases, and fast-and-frugal heuristics, rather than general-purpose learning mechanisms or information-processing systems.
I think “the human brain as one general-purpose cognitive organ” is a crazy thing to believe, and if anyone actually believes that I join you in disagreeing. For example, part of the medulla regulates your heart rate, and that’s the only thing it does, and the only thing it can do, and it would be crazy to describe that as a “general-purpose cognitive” capability.
That said, I imagine that there are at least a few things that you would classify as “psychological adaptations” whereas I would want to explain them in other ways, e.g. if humans all have pretty similar within-lifetime learning algorithms, with pretty similar reward circuitry, and they all grow up in pretty similar environments (in certain respects), then maybe they’re going to wind up learning similar things (in some cases), and those things can even wind up reliably in the same part of the cortex.
What counts as success or failure? You have no idea. You have to make guesses about what counts as ‘reward’ or ‘punishment’, by wiring up your perceptual systems in a way that assigns a valence (positive or negative) to each situation that seems like it might be important to survival or reproduction.
It’s probably worth noting that I agree with this paragraph but in my mind it would be referring to the “perceptual systems” of the hypothalamus & brainstem, not the thalamocortical ones. For visual, that would be mainly the superior colliculus / optic tectum. For example, the mouse superior colliculus innately detects expanding dark blobs in the upper FOV (which triggers on incoming-birds-of-prey) and triggers a scamper-away reflex (along with presumably negative valence), and I think the human superior colliculus has analogous heuristic detectors tuned to scuttling spiders & slithering snakes & human faces and a number of other things like that. And I think the information content / recipes necessary to detect those things is coming straight from the genome. (The literature on all these things is a bit of a mess in some cases, but I’m happy to discuss why I believe that in more detail.)

Steven Byrnes

Munk AI de­bate: con­fu­sions and pos­si­ble cruxes

Munk AI debate: confusions and possible cruxes