Hi I’m Steve Byrnes, an AGI safety / AI alignment researcher in Boston, MA, USA, with a particular focus on brain algorithms. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn
Steven Byrnes
Munk AI debate: confusions and possible cruxes
Changing the world through slack & hobbies
“X distracts from Y” as a thinly-disguised fight over group status / politics
What does it take to defend the world against out-of-control AGIs?
Some (problematic) aesthetics of what constitutes good work in academia
Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).
I feel like you’re trying to round these three things into a “yay versus boo” axis, and then come down on the side of “boo”. I think we can try to do better than that.
One can make certain general claims about learning algorithms that are true and for which evolution provides as good an example as any. One can also make other claims that are true for evolution and false for other learning algorithms. and then we can argue about which category future AGI will be in. I think we should be open to that kind of dialog, and it involves talking about evolution.
For the third one, there’s an argument like:
“Maybe the AI will really want something-or-other to happen in the future, and try to make it happen, including by long-term planning—y’know, the way some humans really want to break out of prison, or the way Elon Musk really wants to go to Mars. Maybe the AIs have other desires and do other things too, but that’s not too relevant to what I’m saying. Next, There are a lot of reasons to think that “AIs that really want something-or-other to happen in the future” will show up sooner or later, e.g. the fact that smart people have been trying to build them since the dawn of AI and continuing through today. And if we get such AIs, and they’re very smart and competent, it has similar relevant consequences as “rigid utility maximizing consequentialists”—particularly power-seeking / instrumental convergence, and not pursuing plans that have obvious and effective countermeasures.”
Do you buy that argument? If so, I think some discussions of “rigid utility maximizing consequentialists” can be useful. I also think that some such discussions can lead to conclusions that do not necessarily transfer to more realistic AGIs (see here). So again, I think we should avoid yay-versus-boo thinking.
The Machine Intelligence Research Institute (MIRI), which was at the forefront of theoretical AI safety research during this period, has since admitted that its efforts have utterly failed.
I think that part of the blog post you linked was being facetious. IIUC they had some undisclosed research program involving Haskell for a few years, and then they partly but not entirely wound it down when it wasn’t going as well as they had hoped. But they have also been doing other things too the whole time, like their agent foundations team. (I have no personal knowledge beyond reading the newsletters etc.)
For example, FWIW, I have personally found MIRI employee Abram Demski’s blog posts (including pre-2020) to be very helpful to my thinking about AGI alignment.
Anyway, your more general claim in this section seems to be: Given current levels of capabilities, there is no more alignment research to be done. We’re tapped out. The well is dry. The only possible thing left to do is twiddle our thumbs and wait for more capable models to come out.
Is that really your belief? Do you look at literally everything on alignmentforum etc. as total garbage? Obviously I have a COI but I happen to think there is lots of alignment work yet to do that would be helpful and does not need newly-advanced capabilities to happen.
Nothing in this comment should be construed as “all things considered we should be for or against the pause”—as it happens I’m weakly against the pause too—these are narrower points. :)
Thanks for the comment!
I think we should imagine two scenarios, one where I see the demonic possession people as being “on my team” and the other where I see them as being “against my team”.
To elaborate, here’s yet another example: Concerned Climate Scientist Alice responding to statements by environmentalists of the Gaia / naturalness / hippy-type tradition. Alice probably thinks that a lot of their beliefs are utterly nuts. But it’s pretty plausible that she sees them as kinda “on her side” from a vibes perspective. (Hmm, actually, also imagine this is 20 years ago; I think there’s been something of a tribal split between pro-tech environmentalists and anti-tech environmentalists since then.) So probably Alice would probably make somewhat diplomatic statements, emphasizing areas of agreement, etc. Maybe she would say “I think they have the right idea about deforestation and many other things, although I come at it from a more scientific perspective. I don’t think we should take the Gaia idea too literally. But anyway, everyone agrees that there’s an environmental crisis here…” or something like that.
In your demon example, imagine someone saying “I think it’s really great to see so many people questioning the narrative that the police are always perfect. I don’t think demonic possession is the problem, but y’know why so many people keep talking about demonic possession? It’s because they can see there’s a problem, and they’re angry, and they have every right to be angry because there is in fact a problem. And that problem is police corruption…”.
So finally back to the AI example, I claim there’s a strong undercurrent of “The people talking about AI x-risk, they suck, those people are not on my team.” And if there wasn’t that undercurrent, I think most of the x-risk-doesn’t-exist people would have at worst mixed feelings about the x-risk discourse. Maybe they be vaguely happy that there are all these new anti-AI vibes going around, and they would try to redirect those vibes in the directions that they believe to be actually productive, as in the above examples: “I think it’s really great to see people across society questioning the narrative that AI is always a force for good and tech companies are always a force for good. They’re absolutely right to question that narrative; that narrative is wrong and dangerous! Now, on this specific question, I don’t think future AI x-risk is anything to worry about, but let’s talk about AI companies stomping on copyright law…”
Very different vibe, right? Much less aggressive trashing of AI x-risk than what we actually see from some people.
To be clear, in a perfect world, people would ignore vibes and stay on-topic and at the object level, and Alice would just straightforwardly say “My opinion is that Gaia is pseudoscientific nonsense” instead of sanewashing it and immediately changing the subject, and ditto with the demon person and the other imaginary people above. I’m just saying what often happens in practice.
Back to your example, I think it’s far from obvious IMO that the number of articles about police corruption are going to go down in absolute numbers, although it obviously goes down as a fraction of police articles. It’s also far from obvious IMO that this situation will make it harder rather than easier to get anti-corruption laws passed, or to fundraise.
It is certainly far from obvious: for example, devastating as the COVID-19 pandemic was, I don’t think anyone believes that 10,000 random re-rolls of the COVID-19 pandemic would lead to at least one existential catastrophe. The COVID-19 pandemic just was not the sort of thing to pose a meaningful threat of existential catastrophe, so if natural pandemics are meant to go beyond the threat posed by the recent COVID-19 pandemic, Ord really should tell us how they do so.
This seems very misleading. We know that COVID-19 has <<5% IFR. Presumably the concern is that some natural pandemics may be much much more virulent than COVID-19 was. So it’s important that the thing we imagine is “10,000 random re-rolls in which there is a natural pandemic”, NOT “10,000 random re-rolls of COVID-19 in particular”. And then we can ask questions like “How many of those 10,000 natural pandemics have >50% IFR? Or >90%? And what would we expect to happen in those cases?” I don’t know what the answers are, but that’s a much more helpful starting point I think.
We discussed the risk of `do-it-yourself’ science in Part 10 of this series. There, we saw that a paper by David Sarapong and colleagues laments “Sensational and alarmist headlines about DiY science” which “argue that the practice could serve as a context for inducing rogue science which could potentially lead to a ‘zombie apocalypse’.” These experts find little empirical support for any such claims.
Maybe this is addressed in Part 10, but this paragraph seems misleading insofar as Ord is talking about risk by 2100, and a major part of the story is that DIY biology in, say, 2085 may be importantly different and more dangerous than DIY biology in 2023, because the science and tech keeps advancing and improving each year.
Needless to say, even if we could be 100% certain that DIY biology in 2085 will be super dangerous, there obviously would not be any “empirical support” for that, because 2085 hasn’t happened yet. It’s just not the kind of thing that presents empirical evidence for us to use. We have to do the best we can without it. The linked paper does not seem to discuss that issue at all, unless I missed it.
(I have a similar complaint about the the discussion of Soviet bioweapons in Section 4—running a bioweapons program with 2024 science & technology is presumably quite different than running a bioweapons program with 1985 science & technology, and running one in 2085 would be quite different yet again.
The quote above is an excerpt from here, and immediately after listing those four points, Eliezer says “But there are further reasons why the above problem might be difficult to solve, as opposed to being the sort of thing you can handle straightforwardly with a moderate effort…”.
Again, this remark seems explicitly to assume that the AI is maximising some kind of reward function. Humans often act not as maximisers but as satisficers, choosing an outcome that is good enough rather than searching for the best possible outcome. Often humans also act on the basis of habit or following simple rules of thumb, and are often risk averse. As such, I believe that to assume that an AI agent would be necessarily maximising its reward is to make fairly strong assumptions about the nature of the AI in question. Absent these assumptions, it is not obvious why an AI would necessarily have any particular reason to usurp humanity.
Imagine that, when you wake up tomorrow morning, you will have acquired a magical ability to reach in and modify your own brain connections however you like.
Over breakfast, you start thinking about how frustrating it is that you’re in debt, and feeling annoyed at yourself that you’ve been spending so much money impulse-buying in-app purchases in Farmville. So you open up your new brain-editing console, look up which neocortical generative models were active the last few times you made a Farmville in-app purchase, and lower their prominence, just a bit.
Then you take a shower, and start thinking about the documentary you saw last night about gestation crates. ‘Man, I’m never going to eat pork again!’ you say to yourself. But you’ve said that many times before, and it’s never stuck. So after the shower, you open up your new brain-editing console, and pull up that memory of the gestation crate documentary and the way you felt after watching it, and set that memory and emotion to activate loudly every time you feel tempted to eat pork, for the rest of your life.
Do you see the direction that things are going? As time goes on, if an agent has the power of both meta-cognition and self-modification, any one of its human-like goals (quasi-goals which are context-dependent, self-contradictory, satisficing, etc.) can gradually transform itself into a utility-function-like goal (which is self-consistent, all-consuming, maximizing)! To be explicit: during the little bits of time when one particular goal happens to be salient and determining behavior, the agent may be motivated to “fix” any part of itself that gets in the way of that goal, until bit by bit, that one goal gradually cements its control over the whole system.
Moreover, if the agent does gradually self-modify from human-like quasi-goals to an all-consuming utility-function-like goal, then I would think it’s very difficult to predict exactly what goal it will wind up having. And most goals have problematic convergent instrumental sub-goals that could make them into x-risks.
...Well, at least, I find this a plausible argument, and don’t see any straightforward way to reliably avoid this kind of goal-transformation. But obviously this is super weird and hard to think about and I’m not very confident. :-)
(I think I stole this line of thought from Eliezer Yudkowsky but can’t find the reference.)
Everything up to here is actually just one of several lines of thought that lead to the conclusion that we might well get an AGI that is trying to maximize a reward.
Another line of thought is what Rohin said: We’ve been using reward functions since forever, so it’s quite possible that we’ll keep doing so.
Another line of thought is: We humans actually have explicit real-world goals, like curing Alzheimer’s and solving climate change etc. And generally the best way to achieve goals is to have an agent seeking them.
Another line of thought is: Different people will try to make AGIs in different ways, and it’s a big world, and (eventually by default) there will be very low barriers-to-entry in building AGIs. So (again by default) sooner or later someone will make an explicitly-goal-seeking AGI, even if thoughtful AGI experts pronounce that doing so is a terrible idea.
I think things like “If we see Sign X of misalignment from the AI, we should shut it down and retrain” comprise a small fraction of AI safety research, and I think even that small fraction consists primarily of stating extremely obvious ideas (let’s use honeypots! let’s do sandbox tests! let’s use interpretability! etc.) and exploring whether or not they would work, rather than stating non-obvious ideas. The horse has long ago left the barn on “the idea of sandbox testing and honeypots” being somewhere in an LLM’s training data!
I think a much larger fraction of AI safety research is geared towards thinking about how to make the AI not misaligned in the first place. So if the AI is scheming against us, reading those posts won’t be very helpful to it, because those ideas have evidently already failed.
I also think you’re understating how secrecy would inhibit progress. And we need progress, if we are to succeed at the goal of knowing how to make an AI that’s not misaligned in the first place.
In fact, even in the “If we see Sign X of misalignment from the AI, we should shut it down and retrain” type of research, I would strongly vote for open-and-therefore-better research (that the AI can also see) versus closed-and-therefore-probably-worse research (that the AI can’t see). For example, really good interpretability could be robust enough that it still works even if the AI has read the same articles as the programmers, and bad interpretability won’t work even if the AI hasn’t.
So I think this article is focusing on a niche benefit of secrecy that seems very unlikely to outweigh the cost.
But meanwhile a very big and real secrecy-related problem is the kind of conventional AGI-related infohazards that safety researchers talk about all the time, i.e. people don’t want to publicly share ideas that would make AGI happen sooner. For example, lots of people disagree with Eliezer Yudkowsky about important aspects of AGI doom, and it’s not getting resolved because Eliezer is not sharing important parts of his beliefs that he sees as sensitive. Ditto with me for sure, ditto with lots of people I’ve talked to.
Would this problem be solvable with a giant closed Manhattan Project thing like you talked about? I dunno. The Manhattan project itself had a bunch of USSR spies in it. Not exactly reassuring! OTOH I’m biased because I like living in Boston and don’t want to move to a barbed-wire-enclosed base in the desert :-P
- 4 Dec 2022 14:38 UTC; 7 points) 's comment on AI can exploit safety plans posted on the Internet by (LessWrong;
I regularly write posts on lesswrong (and cross-post when applicable to alignmentforum). Am I a blogger? I certainly describe myself that way. But I get a strong impression from the Effective Ideas website that this doesn’t count. (You can correct me if I’m wrong.)
I guess the question is: do we think of lesswrong as a “blogging platform” akin to substack? Or do we think of it as a “community forum” akin to hacker news? (Or both!)
The same question, of course, applies to people who “blog” exclusively on EA Forum!
You might say: Maybe my lesswrong posts don’t constitute a proper “blog” because people can’t see just my posts, separated from everyone else’s lesswrong posts? Ah, but they can! Not only that, they can also view just my posts on my solo RSS feed, or my solo twitter, or an index of my posts on my personal website!
For my part, I find lesswrong to be a nice “blogging platform”, and have not so far felt tempted to set up a separate substack / wordpress / whatever. If I did, I would probably wind up cross-posting to lesswrong anyway, and the end result would just be a split-up comment section and more hassle posting and editing, with no appreciable upside, it seems to me. However, maybe I’d do it anyway, if eligibility for this giant prize is on the line. Is it?
I don’t think “mouldability” is a synonym of “white-boxiness”. In fact, I think they’re hardly related at all:
There can be a black box with lots of knobs on the outside that change the box’s behavior. It’s still a black box.
Conversely, consider an old-fashioned bimetallic strip thermostat with a broken dial. It’s not mouldable at all—it can do one and only thing, i.e. actuate a switch at a certain fixed temperature. (Well, I guess you can use it as a doorstop!) But a bimetallic strip thermostat still very white-boxy (after I spend 30 seconds telling you how it works).
You wrote “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.” I feel like I keep pressing you on this, and you keep motte-and-bailey’ing into some other claim that does not align with a common-sense reading of what you originally wrote:
“Well, the cost of analysis could theoretically be even higher—like, if you had to drill into skulls…” OK sure but that’s not the same as “essentially no cost”.
“Well, the cost of analysis may be astronomically high, but there’s a theorem proving that it’s not theoretically impossible…” OK sure but that’s not the same as “essentially no cost”.
“Well, I can list out some specific analysis and manipulation tasks that we can do at essentially no cost: we can do X, and Y, and Z, …” OK sure but that’s not the same as “we can analyze and manipulate however we want at essentially no cost”.
Do you see what I mean?
My terminology would be that (2) is “ambitious value learning” and (1) is “misaligned AI that cooperates with humans because it views cooperating-with-humans to be in its own strategic / selfish best interest”.
I strongly vote against calling (1) “aligned”. If you think we can have a good future by ensuring that it is always in the strategic / selfish best interest of AIs to be nice to humans, then I happen to disagree but it’s a perfectly reasonable position to be arguing, and if you used the word “misaligned” for those AIs (e.g. if you say “alignment is unnecessary”), I think it would be viewed as a helpful and clarifying way to describe your position, and not as a reductio or concession.
For my part, I define “alignment” as “the AI is trying to do things that the AGI designer had intended for it to be trying to do, as an end in itself and not just as a means-to-an-end towards some different goal that it really cares about.” (And if the AI is not the kind of thing for which the word “trying” and “cares about” is applicable in the first place, then the AI is neither aligned nor misaligned, and also I’d claim it’s not an x-risk in any case.) More caveats in a thing I wrote here:
Some researchers think that the “correct” design intentions (for an AGI’s motivation) are obvious, and define the word “alignment” accordingly. Three common examples are (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do”—this AGI would be “aligned” to the supervisor’s intentions. (2) “I am designing the AGI so that it shares the values of its human supervisor”—this AGI would be “aligned” to the supervisor. (3) “I am designing the AGI so that it shares the collective values of humanity”—this AGI would be “aligned” to humanity.
I’m avoiding this approach because I think that the “correct” intended AGI motivation is still an open question. For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.
Of course, sometimes I want to talk about (1,2,3) above, but I would use different terms for that purpose, e.g. (1) “the Paul Christiano version of corrigibility”, (2) “ambitious value learning”, and (3) “CEV”.
In your hypothetical, if Meta says “OK you win, you’re right, we’ll henceforth take steps to actually cure cancer”, onlookers would assume that this is a sensible response, i.e. that Meta is responding appropriately to the complaint. If the protester then gets back on the news the following week and says “no no no this is making things even worse”, I think onlookers would be very confused and say “what the heck is wrong with that protester?”
A nice short argument that a sufficiently intelligent AGI would have the power to usurp humanity is Scott Alexander’s Superintelligence FAQ Section 3.1.
This kinda overlaps with (2), but the end of 2035 is 12 years away. A lot can happen in 12 years! If we look back to 12 years ago, it was December 2011. AlexNet had not come out yet, neural nets were a backwater within AI, a neural network with 10 layers and 60M parameters was considered groundbreakingly deep and massive, the idea of using GPUs in AI was revolutionary, tensorflow was still years away, doing even very simple image classification tasks would continue to be treated as a funny joke for several more years (literally—this comic is from 2014!), I don’t think anyone was dreaming of AI that could pass a 2nd-grade science quiz or draw a recognizable picture without handholding, GANs had not been invented, nor transformers, nor deep RL, etc. etc., I think.
So “AGI by 2035” isn’t like “wow that could only happen if we’re already almost there”, instead it leaves tons of time for like a whole different subfield of AI to develop from almost nothing.
(I’m making a case against being confidently skeptical about AGI by 2035, not a case for confidently expecting AGI by 2035.)
You’re entitled to disagree with short-timelines people (and I do too) but I don’t like the use of the word “hype” here (and “purely hype” is even worse); it seems inaccurate, and kinda an accusation of bad faith. “Hype” typically means Person X is promoting a product, that they benefit from the success of that product, and that they are probably exaggerating the impressiveness of that product in bad faith (or at least, with a self-serving bias). None of those applies to Greg here, AFAICT. Instead, you can just say “he’s wrong” etc.
OTOH, I am (or I guess was?) a professional physicist, and when I read Rationality A-Z, I found that Yudkowsky was always reaching exactly the same conclusions as me whenever he talked about physics, including areas where (IMO) the physics literature itself is a mess—not only interpretations of QM, but also how to think about entropy & the 2nd law of thermodynamics, and, umm, I thought there was a third thing too but I forget.
That increased my respect for him quite a bit.
And who the heck am I? Granted, I can’t out-credential Scott Aaronson in QM. But FWIW, hmm let’s see, I had the highest physics GPA in my Harvard undergrad class and got the highest preliminary-exam score in my UC Berkeley physics grad school class, and I’ve played a major role in designing I think 5 different atomic interferometers (including an atomic clock) for various different applications, and in particular I was always in charge of all the QM calculations related to estimating their performance, and also I once did a semester-long (unpublished) research project on quantum computing with superconducting qubits, and also I have made lots of neat wikipedia QM diagrams and explanations including a pedagogical introduction to density matrices and mixed states.
I don’t recall feeling strongly that literally every word Yudkowsky wrote about physics was correct, more like “he basically figured out the right idea, despite not being a physicist, even in areas where physicists who are devoting their career to that particular topic are all over the place”. In particular, I don’t remember exactly what Yudkowsky wrote about the no-communication theorem. But I for one absolutely understand mixed states, and that doesn’t prevent me from being a pro-MWI extremist like Yudkowsky.
I feel like that guy’s got a LOT of chutzpah to not-quite-say-outright-but-very-strongly-suggest that the Effective Altruism movement is a group of people who don’t care about the Global South. :-P
More seriously, I think we’re in a funny situation where maybe there are these tradeoffs in the abstract, but they don’t seem to come up in practice.
Like in the abstract, the very best longtermist intervention could be terrible for people today. But in practice, I would argue that most if not all current longtermist cause areas (pandemic prevention, AI risk, preventing nuclear war, etc.) are plausibly a very good use of philanthropic effort even if you only care about people alive today (including children).
Or, in the abstract, AI risk and malaria are competing for philanthropic funds. But in practice, a lot of the same people seem to care about both, including many of the people that the article (selectively) quotes. …And meanwhile most people in the world care about neither.
I mean, there could still be an interesting article about how there are these theoretical tradeoffs between present and future generations. But it’s misleading to name names and suggest that those people would gleefully make those tradeoffs, even if it involves torturing people alive today or whatever. Unless, of course, there’s actual evidence that they would do that. (The other strong possibility is, if actually faced with those tradeoffs in real life, they would say, “Uh, well, I guess that’s my stop, this is where I jump off the longtermist train!!”).
Anyway, I found the article extremely misleading and annoying. For example, the author led off with a quote where Jaan Tallinn says directly that climate change might be an existential risk (via a runaway scenario), and then two paragraphs later the author is asking “why does Tallinn think that climate change isn’t an existential risk?” Huh?? The article could have equally well said that Jaan Tallinn believes that climate change is “very plausibly an existential risk”, and Jaan Tallinn is the co-founder of an organization that does climate change outreach among other things, and while climate change isn’t a principal focus of current longtermist philanthropy, well, it’s not like climate change is a principal focus of current cancer research philanthropy either! And anyway it does come up to a reasonable extent, with healthy discussions focusing in particular on whether there are especially tractable and neglected things to do.
So anyway, I found the article very misleading.
(I agree with Rohin that if people are being intimidated, silenced, or cancelled, then that would be a very bad thing.)