Hi I’m Steve Byrnes, an AGI safety / AI alignment researcher in Boston, MA, USA, with a particular focus on brain algorithms. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn
Steven Byrnes
Hi, I’m an AI alignment technical researcher who mostly works independently, and I’m in the market for a new productivity coach / accountability buddy, to chat with periodically (I’ve been doing one ≈20-minute meeting every 2 weeks) about work habits, and set goals, and so on. I’m open to either paying fair market rate, or to a reciprocal arrangement where we trade advice and promises etc. I slightly prefer someone not directly involved in AI alignment—since I don’t want us to get nerd-sniped into object-level discussions—but whatever, that’s not a hard requirement. You can reply here, or DM or email me. :)update: I’m all set now
“Artificial General Intelligence”: an extremely brief FAQ
Some (problematic) aesthetics of what constitutes good work in academia
Humans are less than maximally aligned with each other (e.g. we care less about the welfare of a random stranger than about our own welfare), and humans are also less than maximally misaligned with each other (e.g. most people don’t feel a sadistic desire for random strangers to suffer). I hope that everyone can agree about both those obvious things.
That still leaves the question of where we are on the vast spectrum in between those two extremes. But I think your claim “humans are largely misaligned with each other” is not meaningful enough to argue about. What percentage is “largely”, and how do we even measure that?
Anyway, I am concerned that future AIs will be more misaligned with random humans than random humans are with each other, and that this difference will have important bad consequences, and I also think there are other disanalogies / reasons-for-concern as well. But this is supposed to be a post about terminology so maybe we shouldn’t get into that kind of stuff here.
My terminology would be that (2) is “ambitious value learning” and (1) is “misaligned AI that cooperates with humans because it views cooperating-with-humans to be in its own strategic / selfish best interest”.
I strongly vote against calling (1) “aligned”. If you think we can have a good future by ensuring that it is always in the strategic / selfish best interest of AIs to be nice to humans, then I happen to disagree but it’s a perfectly reasonable position to be arguing, and if you used the word “misaligned” for those AIs (e.g. if you say “alignment is unnecessary”), I think it would be viewed as a helpful and clarifying way to describe your position, and not as a reductio or concession.
For my part, I define “alignment” as “the AI is trying to do things that the AGI designer had intended for it to be trying to do, as an end in itself and not just as a means-to-an-end towards some different goal that it really cares about.” (And if the AI is not the kind of thing for which the word “trying” and “cares about” is applicable in the first place, then the AI is neither aligned nor misaligned, and also I’d claim it’s not an x-risk in any case.) More caveats in a thing I wrote here:
Some researchers think that the “correct” design intentions (for an AGI’s motivation) are obvious, and define the word “alignment” accordingly. Three common examples are (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do”—this AGI would be “aligned” to the supervisor’s intentions. (2) “I am designing the AGI so that it shares the values of its human supervisor”—this AGI would be “aligned” to the supervisor. (3) “I am designing the AGI so that it shares the collective values of humanity”—this AGI would be “aligned” to humanity.
I’m avoiding this approach because I think that the “correct” intended AGI motivation is still an open question. For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.
Of course, sometimes I want to talk about (1,2,3) above, but I would use different terms for that purpose, e.g. (1) “the Paul Christiano version of corrigibility”, (2) “ambitious value learning”, and (3) “CEV”.
- Mar 10, 2024, 7:41 PM; 8 points) 's comment on Clarifying two uses of “alignment” by (
May I ask, what is your position on creating artificial consciousness?
Do you see digital suffering as a risk? If so, should we be careful to avoid creating AC?I think the word “we” is hiding a lot of complexity here—like saying “should we decommission all the world’s nuclear weapons?” Well, that sounds nice, but how exactly? If I could wave a magic wand and nobody ever builds conscious AIs, I would think seriously about it, although I don’t know what I would decide—it depends on details I think. Back in the real world, I think that we’re eventually going to get conscious AIs whether that’s a good idea or not. There are surely interventions that will buy time until that happens, but preventing it forever and ever seems infeasible to me. Scientific knowledge tends to get out and accumulate, sooner or later, IMO. “Forever” is a very very long time.
The last time I wrote about my opinions is here.
Do you see digital suffering as a risk?
Yes. The main way I think about that is: I think eventually AIs will be in charge, so the goal is to wind up with AIs that tend to be nice to other AIs. This challenge is somewhat related to the challenge of winding up with AIs that are nice to humans. So preventing digital suffering winds up closely entangled with the alignment problem, which is my area of research. That’s not in itself a reason for optimism, of course.
We might also get a “singleton” world where there is effectively one and only one powerful AI in the world (or many copies of the same AI pursuing the same goals) which would alleviate some or maybe all of that concern. I currently think an eventual “singleton” world is very likely, although I seem to be very much in the minority on that.
Sorry if I missed it, but is there some part of this post where you suggest specific concrete interventions / actions that you think would be helpful?
Mark Solms thinks he understands how to make artificial consciousness (I think everything he says on the topic is wrong), and his book Hidden Spring has an interesting discussion (in chapter 12) on the “oh jeez now what” question. I mostly disagree with what he says about that too, but I find it to be an interesting case-study of someone grappling with the question.
In short, he suggests turning off the sentient machine, then registering a patent for making conscious machines, and assigning that patent to a nonprofit like maybe Future of Life Institute, and then
organise a symposium in which leading scientists and philosophers and other stakeholders are invited to consider the implications, and to make recommendations concerning the way forward, including whether and when and under what conditions the sentient machine should be switched on again – and possibly developed further. Hopefully this will lead to the drawing up of a set of broader guidelines and constraints upon the future development, exploitation and proliferation of sentient AI in general.
He also has a strongly-worded defense of his figuring out how consciousness works and publishing it, on the grounds that if he didn’t, someone else would.
I am not claiming analogies have no place in AI risk discussions. I’ve certainly used them a number of times myself.
Yes you have!—including just two paragraphs earlier in that very comment, i.e. you are using the analogy “future AI is very much like today’s LLMs but better”. :)
Cf. what I called “left-column thinking” in the diagram here.
For all we know, future AIs could be trained in an entirely different way from LLMs, in which case the way that “LLMs are already being trained” would be pretty irrelevant in a discussion of AI risk. That’s actually my own guess, but obviously nobody knows for sure either way. :)
It is certainly far from obvious: for example, devastating as the COVID-19 pandemic was, I don’t think anyone believes that 10,000 random re-rolls of the COVID-19 pandemic would lead to at least one existential catastrophe. The COVID-19 pandemic just was not the sort of thing to pose a meaningful threat of existential catastrophe, so if natural pandemics are meant to go beyond the threat posed by the recent COVID-19 pandemic, Ord really should tell us how they do so.
This seems very misleading. We know that COVID-19 has <<5% IFR. Presumably the concern is that some natural pandemics may be much much more virulent than COVID-19 was. So it’s important that the thing we imagine is “10,000 random re-rolls in which there is a natural pandemic”, NOT “10,000 random re-rolls of COVID-19 in particular”. And then we can ask questions like “How many of those 10,000 natural pandemics have >50% IFR? Or >90%? And what would we expect to happen in those cases?” I don’t know what the answers are, but that’s a much more helpful starting point I think.
We discussed the risk of `do-it-yourself’ science in Part 10 of this series. There, we saw that a paper by David Sarapong and colleagues laments “Sensational and alarmist headlines about DiY science” which “argue that the practice could serve as a context for inducing rogue science which could potentially lead to a ‘zombie apocalypse’.” These experts find little empirical support for any such claims.
Maybe this is addressed in Part 10, but this paragraph seems misleading insofar as Ord is talking about risk by 2100, and a major part of the story is that DIY biology in, say, 2085 may be importantly different and more dangerous than DIY biology in 2023, because the science and tech keeps advancing and improving each year.
Needless to say, even if we could be 100% certain that DIY biology in 2085 will be super dangerous, there obviously would not be any “empirical support” for that, because 2085 hasn’t happened yet. It’s just not the kind of thing that presents empirical evidence for us to use. We have to do the best we can without it. The linked paper does not seem to discuss that issue at all, unless I missed it.
(I have a similar complaint about the the discussion of Soviet bioweapons in Section 4—running a bioweapons program with 2024 science & technology is presumably quite different than running a bioweapons program with 1985 science & technology, and running one in 2085 would be quite different yet again.
(Recently I’ve been using “AI safety” and “AI x-safety” interchangeably when I want to refer to the “overarching” project of making the AI transition go well, but I’m open to being convinced that we should come up with another term for this.)
I’ve been using the term “Safe And Beneficial AGI” (or more casually, “awesome post-AGI utopia”) as the overarching “go well” project, and “AGI safety” as the part where we try to make AGIs that don’t accidentally [i.e. accidentally from the human supervisors’ / programmers’ perspective] kill everyone, and (following common usage according to OP) “Alignment” for “The AGI is trying to do things that the AGI designer had intended for it to be trying to do”.
(I didn’t make up the term “Safe and Beneficial AGI”. I think I got it from Future of Life Institute. Maybe they in turn got it from somewhere else, I dunno.)
(See also: my post Safety ≠ alignment (but they’re close!))
See also a thing I wrote here:
Some researchers think that the “correct” design intentions (for an AGI’s motivation) are obvious, and define the word “alignment” accordingly. Three common examples are (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do”—this AGI would be “aligned” to the supervisor’s intentions. (2) “I am designing the AGI so that it shares the values of its human supervisor”—this AGI would be “aligned” to the supervisor. (3) “I am designing the AGI so that it shares the collective values of humanity”—this AGI would be “aligned” to humanity.
I’m avoiding this approach because I think that the “correct” intended AGI motivation is still an open question. For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.
Of course, sometimes I want to talk about (1,2,3) above, but I would use different terms for that purpose, e.g. (1) “the Paul Christiano version of corrigibility”, (2) “ambitious value learning”, and (3) “CEV”.
This kinda overlaps with (2), but the end of 2035 is 12 years away. A lot can happen in 12 years! If we look back to 12 years ago, it was December 2011. AlexNet had not come out yet, neural nets were a backwater within AI, a neural network with 10 layers and 60M parameters was considered groundbreakingly deep and massive, the idea of using GPUs in AI was revolutionary, tensorflow was still years away, doing even very simple image classification tasks would continue to be treated as a funny joke for several more years (literally—this comic is from 2014!), I don’t think anyone was dreaming of AI that could pass a 2nd-grade science quiz or draw a recognizable picture without handholding, GANs had not been invented, nor transformers, nor deep RL, etc. etc., I think.
So “AGI by 2035” isn’t like “wow that could only happen if we’re already almost there”, instead it leaves tons of time for like a whole different subfield of AI to develop from almost nothing.
(I’m making a case against being confidently skeptical about AGI by 2035, not a case for confidently expecting AGI by 2035.)
That might be true in the very short term but I don’t believe it in general. For example, how many reporters were on the Ukraine beat before Russia invaded in February 2022? And how many reporters were on the Ukraine beat after Russia invaded? Probably a lot more, right?
Thanks for the comment!
I think we should imagine two scenarios, one where I see the demonic possession people as being “on my team” and the other where I see them as being “against my team”.
To elaborate, here’s yet another example: Concerned Climate Scientist Alice responding to statements by environmentalists of the Gaia / naturalness / hippy-type tradition. Alice probably thinks that a lot of their beliefs are utterly nuts. But it’s pretty plausible that she sees them as kinda “on her side” from a vibes perspective. (Hmm, actually, also imagine this is 20 years ago; I think there’s been something of a tribal split between pro-tech environmentalists and anti-tech environmentalists since then.) So probably Alice would probably make somewhat diplomatic statements, emphasizing areas of agreement, etc. Maybe she would say “I think they have the right idea about deforestation and many other things, although I come at it from a more scientific perspective. I don’t think we should take the Gaia idea too literally. But anyway, everyone agrees that there’s an environmental crisis here…” or something like that.
In your demon example, imagine someone saying “I think it’s really great to see so many people questioning the narrative that the police are always perfect. I don’t think demonic possession is the problem, but y’know why so many people keep talking about demonic possession? It’s because they can see there’s a problem, and they’re angry, and they have every right to be angry because there is in fact a problem. And that problem is police corruption…”.
So finally back to the AI example, I claim there’s a strong undercurrent of “The people talking about AI x-risk, they suck, those people are not on my team.” And if there wasn’t that undercurrent, I think most of the x-risk-doesn’t-exist people would have at worst mixed feelings about the x-risk discourse. Maybe they be vaguely happy that there are all these new anti-AI vibes going around, and they would try to redirect those vibes in the directions that they believe to be actually productive, as in the above examples: “I think it’s really great to see people across society questioning the narrative that AI is always a force for good and tech companies are always a force for good. They’re absolutely right to question that narrative; that narrative is wrong and dangerous! Now, on this specific question, I don’t think future AI x-risk is anything to worry about, but let’s talk about AI companies stomping on copyright law…”
Very different vibe, right? Much less aggressive trashing of AI x-risk than what we actually see from some people.
To be clear, in a perfect world, people would ignore vibes and stay on-topic and at the object level, and Alice would just straightforwardly say “My opinion is that Gaia is pseudoscientific nonsense” instead of sanewashing it and immediately changing the subject, and ditto with the demon person and the other imaginary people above. I’m just saying what often happens in practice.
Back to your example, I think it’s far from obvious IMO that the number of articles about police corruption are going to go down in absolute numbers, although it obviously goes down as a fraction of police articles. It’s also far from obvious IMO that this situation will make it harder rather than easier to get anti-corruption laws passed, or to fundraise.
“X distracts from Y” as a thinly-disguised fight over group status / politics
-
I suggest to spend a few minutes pondering what to do if crazy people (perhaps just walking by) decide to “join” the protest. Y’know, SF gonna SF.
-
FYI at a firm I used to work at, once there was a group protesting us out front. Management sent an email that day suggesting that people leave out a side door. So I did. I wasn’t thinking too hard about it, and I don’t know how many people at the firm overall did the same.
(I have no personal experience with protests, feel free to ignore.)
-
In your hypothetical, if Meta says “OK you win, you’re right, we’ll henceforth take steps to actually cure cancer”, onlookers would assume that this is a sensible response, i.e. that Meta is responding appropriately to the complaint. If the protester then gets back on the news the following week and says “no no no this is making things even worse”, I think onlookers would be very confused and say “what the heck is wrong with that protester?”
I don’t think “mouldability” is a synonym of “white-boxiness”. In fact, I think they’re hardly related at all:
There can be a black box with lots of knobs on the outside that change the box’s behavior. It’s still a black box.
Conversely, consider an old-fashioned bimetallic strip thermostat with a broken dial. It’s not mouldable at all—it can do one and only thing, i.e. actuate a switch at a certain fixed temperature. (Well, I guess you can use it as a doorstop!) But a bimetallic strip thermostat still very white-boxy (after I spend 30 seconds telling you how it works).
You wrote “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.” I feel like I keep pressing you on this, and you keep motte-and-bailey’ing into some other claim that does not align with a common-sense reading of what you originally wrote:
“Well, the cost of analysis could theoretically be even higher—like, if you had to drill into skulls…” OK sure but that’s not the same as “essentially no cost”.
“Well, the cost of analysis may be astronomically high, but there’s a theorem proving that it’s not theoretically impossible…” OK sure but that’s not the same as “essentially no cost”.
“Well, I can list out some specific analysis and manipulation tasks that we can do at essentially no cost: we can do X, and Y, and Z, …” OK sure but that’s not the same as “we can analyze and manipulate however we want at essentially no cost”.
Do you see what I mean?
If you want to say “it’s a black box but the box has a “gradient” output channel in addition to the “next-token-probability-distribution” output channel”, then I have no objection.
If you want to say ”...and those two output channels are sufficient for safe & beneficial AGI”, then you can say that too, although I happen to disagree.
If you want to say “we also have interpretability techniques on top of those, and they work well enough to ensure alignment for both current and future AIs”, then I’m open-minded and interested in details.
If you want to say “we can’t understand how a trained model does what it does in any detail, but if we had to drill into a skull and only measure a few neurons at a time etc. then things sure would be even worse!!”, then yeah duh.
But your OP said “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost”, and used the term “white box”. That’s the part that strikes me as crazy. To be charitable, I don’t think those words are communicating the message that you had intended to communicate.
For example, find a random software engineer on the street, and ask them: “if I give you a 1-terabyte compiled executable binary, and you can do whatever you want with that file on your home computer, would you describe it as closer to “white box” or “black box”?”. I predict most people would say “closer to black box”, even though they can look at all the bits and step through the execution and run decompilation tools etc. if they want. Likewise you can ask them whether it’s possible to “analyze” that binary “at essentially no cost”. I predict most people would say “no”.
I haven’t done any surveys or anything, but that seems very inaccurate to me. I would have guessed that >90% of “people in AI safety” are either strongly expecting that transformers (or diffusion models) will be the major underpinning of AGI, or at least they’re acting as if they strongly expect that. (I’m including LLMs + scaffolding and so on in this category.)
For example: people seem very happy to make guesses about what tasks the first AGIs will be better and worse at doing based on current LLM capabilities; and people seem very happy to make guesses about how much compute the first AGIs will require based on current LLM compute requirements; and people seem very happy to make guesses about which companies are likely to develop AGIs based on which companies are best at training LLMs today; and people seem very happy to make guesses about AGI UIs based on the particular LLM interface of “context window → output token”; etc. etc. This kind of thing happens constantly, and sometimes I feel like I’m the only one who even notices. It drives me nuts.