Steven Byrnes

Karma: 1,506

Hi I’m Steve Byrnes, an AGI safety / AI alignment researcher in Boston, MA, USA, with a particular focus on brain algorithms. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn

Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI

Steven ByrnesMay 8, 2025, 9:11 PM

8 points

1 comment EA link

Steven Byrnes May 5, 2025, 12:56 PM
2 points
0 ∶ 0
in reply to: Yarrow🔸’s comment on: On January 1, 2030, there will be no AGI (and AGI will still not be imminent)
There’s a popular mistake these days of assuming that LLMs are the entirety of AI, rather than a subfield of AI.
If you make this mistake, then you can go from there to either of two faulty conclusions:
- (Faulty inference 1) Transformative AI will happen sooner or later [true IMO] THEREFORE LLMs will scale to TAI [false IMO]
- (Faulty inference 2) LLMs will never scale to TAI [true IMO] THEREFORE TAI will never happen [false IMO]
I have seen an awful lot of both (1) and (2), including by e.g. CS professors who really ought to know better (example), and I try to call out both of them when I see them.
You yourself seem mildly guilty of something-like-(2), in this very post. Otherwise you would be asking questions like “how quickly can AI paradigms go FROM obscure and unimpressive arxiv papers that nobody has heard of, TO a highly-developed technique subject to untold billions of dollars and millions of person-hours of investment?”, and you’d notice that an answer like “5 years” is not out of the question. (See second half of this comment.)
I’m not sure how you define “imminent” in the OP title, but FWIW, LLM skeptic Yann LeCun says human-level AI “will take several years if not a decade…[but with] a long tail”, and LLM skeptic Francois Chollet says 2038-2048.

Steven Byrnes Apr 9, 2025, 10:48 AM
10 points
4 ∶ 0
on: On January 1, 2030, there will be no AGI (and AGI will still not be imminent)
The community of people most focused on keeping up the drumbeat of near-term AGI predictions seems insular, intolerant of disagreement or intellectual or social non-conformity (relative to the group’s norms), and closed-off to even reasonable, relatively gentle criticism (whether or not they pay lip service to listening to criticism or perform being open-minded). It doesn’t feel like a scientific community. It feels more like a niche subculture. It seems like a group of people just saying increasingly small numbers to each other (10 years, 5 years, 3 years, 2 years), hyping each other up (either with excitement or anxiety), and reinforcing each other’s ideas all the time. It doesn’t seem like an intellectually healthy community.
I’m someone who doesn’t think foundation models will scale to AGI. Here is my most recent field report from talking to a couple dozen AI safety / alignment researchers at EAG bay area a couple months ago:
- Practically everyone was intensely interested in why I don’t think foundation models will scale to AGI—so much so that it got annoying, because I was giving the same spiel over and over, when there were many other interesting things that I kinda wanted to talk about.
- There were a number of people, all quite new to the fields of AI and AI safety / alignment, for whom it seems to have never crossed their mind until they talked to me that maybe foundation models won’t scale to AGI, and likewise who didn’t seem to realize that the field of AI is broader than just foundation models.
- There were a (quite small) number of people who generally agreed with me. These included one or two agent foundations researchers, and another person (not in the field) who thought the whole AGI thing was stupid (so then I was arguing on the other side that we should expect AGI sooner or later, and that it wasn’t centuries away, even if the AGI is not a foundation model).
- Putting those two groups aside, everyone else understood what I was talking about and mostly immediately had substantive counterarguments, and I had responses to those, etc.
- (I’m not sure how you distinguish between “pay lip service to listening to criticism or perform being open-minded” versus “are actually listening to criticism and are actually being open-minded, but are disagreeing with the criticism”??)
- Most people actually wanted to defend something weaker, like “foundation models in conjunction with yet-to-be-invented modifications and scaffolding and whatnot will scale to AGI” (for my part, I think this weaker claim is also wrong).
- I think it’s worth distinguishing people’s gut beliefs from their professed probability distributions. Their professed probabilities almost always include some decent chunk in the scenario that foundation models won’t scale to AGI, but rather it will be a totally different AI paradigm. (By “decent chunk” I mean 10% or 20% or whatever.) But they’re spending most of their time thinking and talking from their gut belief, forgetting the professed probabilities. (I do this too.)

Steven Byrnes Mar 12, 2025, 8:36 PM
6 points
1 ∶ 0
in reply to: Vasco Grilo🔸’s comment on: AI is not taking over material science (for now): an analysis and conference report
I think you misunderstood David’s point. See my post “Artificial General Intelligence”: an extremely brief FAQ. It’s not that technology increases conflict between humans, but rather that the arrival of AGI amounts to the the arrival of a new intelligent species on our planet. There is no direct precedent for the arrival of a new intelligent species on our planet, apart from humans themselves, which did in fact turn out very badly for many existing species. The arrival of Europeans in North America is not quite “a new species”, but it’s at least “a new lineage”, and it also turned out very badly for the Native Americans.
Of course, there are also many disanalogies between the arrival of Europeans in the Americas, versus the arrival of AGIs on Earth. But I think it’s a less bad starting point than talking about shoe factories and whatnot!
(Like, if the Native Americans had said to each other, “well, when we invented such-and-such basket weaving technology, that turned out really good for us, so if Europeans arrive on our continent, that’s probably going to turn out good as well” … then that would be a staggering non sequitur, right? Likewise if they said “well, basket weaving and other technologies have not increased conflict between our Native American tribes so far, so if Europeans arrive, that will also probably not increase conflict, because Europeans are kinda like a new technology”. …That’s how weird your comparison feels, from my perspective :) .)

Steven Byrnes Mar 12, 2025, 8:19 PM
10 points
4 ∶ 0
in reply to: titotal’s comment on: AI is not taking over material science (for now): an analysis and conference report
Thanks for the reply!
30 years sounds like a long time, but AI winters have lasted that long before: there’s no guarantee that because AI has rapidly advanced recently that it will not stall out at some point.
I agree with “there’s no guarantee”. But that’s the wrong threshold.
Pascal’s wager is a scenario where people prepare for a possible risk because there’s even a slight chance that it will actualize. I sometimes talk about “the insane bizarro-world reversal of Pascal’s wager”, in which people don’t prepare for a possible risk because there’s even a slight chance that it won’t actualize. Pascal’s wager is dumb, but “the insane bizarro-world reversal of Pascal’s wager” is much, much dumber still. :) “Oh yeah, it’s fine to put the space heater next to the curtains—there’s no guarantee that it will burn your house down.” :-P
That’s how I’m interpreting you, right? You’re saying, it’s possible that we won’t have AGI in 30 years. OK, yeah I agree, that’s possible! But is it likely? Is it overwhelmingly likely? I don’t think so. At any rate, “AGI is more than 30 years away” does not seem like the kind of thing that you should feel extraordinarily (e.g. ≥95%) confident about. Where would you have gotten such overwhelming confidence? Technological forecasting is very hard. Again, a lot can happen in 30 years.
If you put a less unreasonable (from my perspective) number like 50% that we’ll have AGI in 30 years, and 50% we won’t, then again I think your vibes and mood are incongruent with that. Like, if I think it’s 50-50 whether there will be a full-blown alien invasion in my lifetime, then I would not describe myself as an “alien invasion risk skeptic”, right?
anytime soon … a few years …
OK, let’s talk about 15 years, or even 30 years. In climate change, people routinely talk about bad things that might happen in 2055—and even 2100 and beyond. And looking backwards, our current climate change situation would be even worse if not for prescient investments in renewable energy R&D made more than 30 years ago.
People also routinely talk 30 years out or more in the context of science, government, infrastructure, institution-building, life-planning, etc. Indeed, here is an article about a US military program that’s planning out into the 2080s!
My point is: We should treat dates like 2040 or even 2055 as real actual dates within our immediate planning horizon, not as an abstract fantasy-land to be breezily dismissed and ignored. Right?
Ai researchers have done a lot of work to figure out how to optimise and get good at the current paradigm: but by definition, the next paradigm will be different, and will require different things to optimize.
Yes, but 30 years, and indeed even 15 years, is more than enough time for that to happen. Again, 13 years gets us from pre-AlexNet to today, and 7 years gets us from pre-LLMs to today. Moreover, the field of AI is broad. Not everybody is working on LLMs, or even deep learning. Whatever you think is necessary to get AGI, somebody somewhere is probably already working on it. Whatever needs to be optimized for those paradigms, I bet that people are doing the very early steps of optimization as we speak. But the systems still work very very badly! Maybe they barely work at all, on toy problems. Maybe not even that. And that’s why you and I might not know that this line of research even exists. We’ll only start hearing about it after a lot of work has already gone into getting that paradigm to work well, at which point there could be very little time indeed (e.g. 2 years) before it’s superhuman across the board. (See graph here illustrating this point.) If you disagree with “2 years”, fine, call it 10 years, or even 25 years. My point would still hold.
Also, I think it’s worth keeping in mind that humans are very much better than chimps at rocketry, and better at biology, and better at quantum mechanics, and better at writing grant applications, and better at inventing “new techniques to improve calculations of interactions between electrons and phonons”, etc. etc. And there just wasn’t enough time between chimps and humans for a lot of complex algorithmic brain stuff to evolve. And there wasn’t any evolutionary pressure for being good at quantum mechanics specifically. Rather, all those above capabilities arose basically simultaneously and incidentally, from the same not-terribly-complicated alteration of brain algorithms. So I think that’s at least suggestive of a possibility that future yet-to-be-invented algorithm classes will go from a basically-useless obscure research toy to superintelligence in the course of just a few code changes. (I’m not saying that’s 100% certain, just a possibility.)

Steven Byrnes Mar 11, 2025, 5:41 PM
45 points
10 ∶ 1
on: AI is not taking over material science (for now): an analysis and conference report
Now, one could say that the physicists will be replaced, because all of science will be replaced by an automated science machine. The CEO of a company can just ask in words “find me a material that does X”, and the machine will do all the necessary background research, choose steps, execute them, analyse the results, and publish them.
I’m not really sure how to respond to objections like this, because I simply don’t believe that superintelligence of this sort is going to happen anytime soon.
Do you think that it’s going to happen eventually? Do you think it might happen in the next 30 years? If so, then I think you’re burying the lede! Think of everything that “M, G, and N” do, to move the CEST field forward. A future AI could do the same things in the same way, but costing 10¢/hour to run at 10× speed, and thousands of copies of them could run in parallel at once, collectively producing PRL-quality novel CEST research results every second. And this could happen in your lifetime! And what else would those AIs be able to do? (Like, why is the CEO human, in that story?)
The mood should be “Holy #$%^”. So what if it’s 30 years away? Or even longer? “Holy #$%^” was the response of Alan Turing, John von Neumann, and Norbert Wiener, when they connected the dots on AI risk. None of them thought that this was an issue that would arise in under a decade, but they all still (correctly) treated it as a deadly serious global problem. Or as Stuart Russell says, if there were a fleet of alien spacecraft, and we can see them in the telescopes, approaching closer each year, with an estimated arrival date of 2060, would you respond with the attitude of dismissal? Would you write “I am skeptical of alien risk” in your profile? I hope not! That would just be crazy way to describe the situation viz. aliens!
More importantly, I think it’s impossible that such a seismic shift would occur without seeing any signs of it a computational physics conference. Before AGI takes over our jobs completely, it’s extremely likely that we would see sub-AGI partnered with humans, able to massively speed up the pace of scientific research, including the very hard parts, not just the low-level code. These are insanely smart people who are massively motivated to make breakthroughs through whatever means they can: if AI did enable massive breakthroughs, they’d be using it.
I’m not going to make any long term predictions, but in the next decade and probably beyond, I do not think we will see either of the above two cases coming to fruition. I think AI will remain a productivity tool, inserted to automate away parts of the process that are rote and tedious.
I too think it’s likely that “real AGI” is more than 10 years out, and will NOT come in the form of a foundation model. But I think your reasoning here isn’t sound, because you’re not anchoring well on how quickly AI paradigm shifts can happen. Seven years ago, there was no such thing as an LLM. Thirteen years ago, AlexNet had not happened yet, deep learning was a backwater, and GPUs were used for processing graphics. I can imagine someone in 2000 making an argument: “Take some future date where we have AIs solving FrontierMath problems, getting superhuman scores on every professional-level test in every field, autonomously doing most SWE-bench problems, etc. Then travel back in time 10 years. Surely there would already be AI doing much much more basic things like solving Winograd schemas, passing 8th-grade science tests, etc., at least in the hands of enthusiastic experts who are eager to work with bleeding-edge technology.” That would have sounded like a very reasonable prediction, at the time, right? But it would have been wrong! Ten years before today, we didn’t even have the LLM paradigm, and NLP was hilariously bad. Enthusiastic experts deploying bleeding-edge technology were unable to get AI to pass an 8th-grade science test or Winograd schema challenge until 2019.
And these comparisons actually understate the possible rate that a future AI paradigm shift could unfold, because today we have a zillion chips, datacenters, experts on parallelization, experts on machine learning theory, software toolkits like JAX and Kubernetes, etc. This was much less the case in 2018, and less still in 2012.

Steven Byrnes Dec 12, 2024, 4:14 PM
2 points
0 ∶ 0
in reply to: aog’s comment on: Consider granting AIs freedom
Thanks! Hmm, some reasons that analogy is not too reassuring:
- “Regulatory capture” would be analogous to AIs winding up with strong influence over the rules that AIs need to follow.
- “Amazon putting mom & pop retailers out of business” would be analogous to AIs driving human salary and job options below subsistence level.
- “Lobbying for favorable regulation” would be analogous to AIs working to ensure that they can pollute more, and pay less taxes, and get more say in government, etc.
- “Corporate undermining of general welfare” (e.g. aggressive marketing of cigarettes and opioids, leaded gasoline, suppression of data on PFOA, lung cancer, climate change, etc.) would be analogous to AIs creating externalities, including by exploiting edge-cases in any laws restricting externalities.
- There are in fact wars happening right now, along with terrifying prospects of war in the future (nuclear brinkmanship, Taiwan, etc.)
Some of the disanalogies include:
- In corporations and nations, decisions are still ultimately made by humans, who have normal human interests in living on a hospitable planet with breathable air etc. Pandemics are still getting manufactured, but very few of them, and usually they’re only released by accident.
- AIs will have wildly better economies of scale, because it can be lots of AIs with identical goals and high-bandwidth communication (or relatedly, one mega-mind). (If you’ve ever worked at or interacted with a bureaucracy, you’ll appreciate the importance of this.) So we should expect a small number (as small as 1) of AIs with massive resources and power, and also unusually strong incentive for gaining further resources.
- Relatedly, self-replication would give an AI the ability to project power and coordinate in a way that is unavailable to humans; this puts AIs more in the category of viruses, or of the zombies in a zombie apocalypse movie. Maybe eventually we’ll get to a world where every chip on Earth is running AI code, and those AIs are all willing and empowered to “defend themselves” by perfect cybersecurity and perfect robot-army-enforced physical security. Then I guess we wouldn’t have to worry so much about AI self-replication. But getting to that point seems pretty fraught. There’s nothing analogous to that in the world of humans, governments, or corporations, which either can’t grow in size and power at all, or can only grow via slowly adding staff that might have divergent goals and inadequate skills.
- If AIs don’t intrinsically care about humans, then there’s a possible Pareto-improvement for all AIs, wherein they collectively agree to wipe out humans and take their stuff. (As a side-benefit, it would relax the regulations on air pollution!) AIs, being very competent and selfish by assumption, would presumably be able to solve that coordination problem and pocket that Pareto-improvement. There’s just nothing analogous to that in the domain of corporations or governments.

Steven Byrnes Dec 12, 2024, 3:13 PM
6 points
2 ∶ 0
in reply to: Matthew_Barnett’s comment on: Consider granting AIs freedom
Thanks!
Anti-social approaches that directly hurt others are usually ineffective because social systems and cultural norms have evolved in ways that discourage and punish them.
I’ve only known two high-functioning sociopaths in my life. In terms of getting ahead, sociopaths generally start life with some strong disadvantages, namely impulsivity, thrill-seeking, and aversion to thinking about boring details. Nevertheless, despite those handicaps, one of those two sociopaths has had extraordinary success by conventional measures. [The other one was not particularly power-seeking but she’s doing fine.] He started as a lab tech, then maneuvered his way onto a big paper, then leveraged that into a professorship by taking disproportionate credit for that project, and as I write this he is head of research at a major R1 university and occasional high-level government appointee wielding immense power. He checked all the boxes for sociopathy—he was a pathological liar, he had no interest in scientific integrity (he seemed deeply confused by the very idea), he went out of his way to get students into his lab with precarious visa situations such that they couldn’t quit and he could pressure them to do anything he wanted them to do (he said this out loud!), he was somehow always in debt despite ever-growing salary, etc.
I don’t routinely consider theft, murder, and flagrant dishonesty, and then decide that the selfish costs outweigh the selfish benefits, accounting for the probability of getting caught etc. Rather, I just don’t consider them in the first place. I bet that the same is true for you. I suspect that if you or I really put serious effort into it, the same way that we put serious effort into learning a new field or skill, then you would find that there are options wherein the probability of getting caught is negligible, and thus the selfish benefits outweigh the selfish costs. I strongly suspect that you personally don’t know a damn thing about best practices for getting away with theft, murder, or flagrant antisocial dishonesty to your own benefit. If you haven’t spent months trying in good faith to discern ways to derive selfish advantage from antisocial behavior, the way you’ve spent months trying in good faith to figure out things about AI or economics, then I think you’re speaking from a position of ignorance when you say that such options are vanishingly rare. And I think that the obvious worldly success of many dark-triad people (e.g. my acquaintance above, and Trump is a pathological liar, or more centrally, Stalin, Hitler, etc.) should make one skeptical about that belief.
(Sure, lots of sociopaths are in prison too. Skill issue—note the handicaps I mentioned above. Also, some people with ASPD diagnoses are mainly suffering from an anger disorder, rather than callousness.)
In contrast, I suspect you underestimate just how much of our social behavior is shaped by cultural evolution, rather than by innate, biologically hardwired motives that arise simply from the fact that we are human.
You’re treating these as separate categories when my main claim is that almost all humans are intrinsically motivated to follow cultural norms. Or more specifically: Most people care very strongly about doing things that would look good in the eyes of the people they respect. They don’t think of it that way, though—it doesn’t feel like that’s what they’re doing, and indeed they would be offended by that suggestion. Instead, those things just feel like the right and appropriate things to do. This is related to and upstream of norm-following. I claim that this is an innate drive, part of human nature built into our brain by evolution.
(I was talking to you about that here.)
Why does that matter? Because we’re used to living in a world where 1% of the population are sociopaths who don’t intrinsically care about prevailing norms, and I don’t think we should carry those intuitions into a hypothetical world where 99%+ of the population are sociopaths who don’t intrinsically care about prevailing norms.
In particular, prosocial cultural norms are likelier to be stable in the former world than the latter world. In fact, any arbitrary kind of cultural norm is likelier to be stable in the former world than the latter world. Because no matter what the norm is, you’ll have 99% of the population feeling strongly that the norm is right and proper, and trying to root out, punish, and shame the 1% of people who violate it, even at cost to themselves.
So I think you’re not paranoid enough when you try to consider a “legal and social framework of rights and rules”. In our world, it’s comparatively easy to get into a stable situation where 99% of cops aren’t corrupt, and 99% of judges aren’t corrupt, and 99% of people in the military with physical access to weapons aren’t corrupt, and 99% of IRS agents aren’t corrupt, etc. If the entire population consists of sociopaths looking out for their own selfish interests with callous disregard for prevailing norms and for other people, you’d need to be thinking much harder about e.g. who has physical access to weapons, and money, and power, etc. That kind of paranoid thinking is common in the crypto world—everything is an attack surface, everyone is a potential thief, etc. It would be harder in the real world, where we have vulnerable bodies, limited visibility, and so on. I’m open-minded to people brainstorming along those lines, but you don’t seem to be engaged in that project AFAICT.
Intertemporal norms among AIs: Humans have developed norms against harming certain vulnerable groups—such as the elderly—not just out of altruism but because they know they will eventually become part of those groups themselves. Similarly, AIs may develop norms against harming “less capable agents,” because today’s AIs could one day find themselves in a similar position relative to even more advanced future AIs. These norms could provide an independent reason for AIs to respect humans, even as humans become less dominant over time.
Again, if we’re not assuming that AIs are intrinsically motivated by prevailing norms, the way 99% of humans are, then the term “norm” is just misleading baggage that we should drop altogether. Instead we need to talk about rules that are stably enforced against defectors via hard power, where the “defectors” are of course allowed to include those who are supposed to be doing the enforcement, and where the “defectors” might also include broad coalitions coordinating to jump into a new equilibrium that Pareto-benefits them all.
What links here?
- “The Era of Experience” has an unsolved technical alignment problem by Steven Byrnes (LessWrong; Apr 24, 2025, 1:57 PM; 114 points)

Steven Byrnes Dec 10, 2024, 4:20 PM
4 points
0 ∶ 0
in reply to: aog’s comment on: Consider granting AIs freedom
Yeah, sorry, I have now edited the wording a bit.
Indeed, two ruthless agents, agents who would happily stab each other in the back given the opportunity, may nevertheless strategically cooperate given the right incentives. Each just needs to be careful not to allow the other person to be standing anywhere near their back while holding a knife, metaphorically speaking. Or there needs to be some enforcer with good awareness and ample hard power. Etc.
I would say that, for highly-competent agents lacking friendly motivation, deception and adversarial acts are inevitably part of the strategy space. Both parties would be energetically exploring and brainstorming such strategies, doing preparatory work to get those strategies ready to deploy on a moment’s notice, and constantly being on the lookout for opportunities where deploying such a strategy makes sense. But yeah, sure, it’s possible that there will not be any such opportunities.
I think the above (ruthless agents, possibly strategically cooperating under certain conditions) is a good way to think about future powerful AIs, in the absence of a friendly singleton or some means of enforcing good motivations, because I think the more ruthless strategic ones will outcompete the less. But I don’t think it’s a good way to think about what peaceful human societies are like. I think human psychology is important for the latter. Most people want to fit in with their culture, and not be weird. Just ask a random person on the street about Earning To Give, they’ll probably say it’s highly sus. Most people don’t make weird multi-step strategic plans unless it’s the kind of thing that lots of other people would do too, and our (sub)culture is reasonably high-trust. Humans who think that way are disproportionately sociopaths.

Steven Byrnes Dec 10, 2024, 3:53 PM
2 points
0 ∶ 0
in reply to: Matthew_Barnett’s comment on: Consider granting AIs freedom
I guess my original wording gave the wrong idea, sorry. I edited it to “a competent agential AI will brainstorm deceptive and adversarial strategies whenever it wants something that other agents don’t want it to have”. But sure, we can be open-minded to the possibility that the brainstorming won’t turn up any good plans, in any particular case.
Humans in our culture rarely work hard to brainstorm deceptive and adversarial strategies, and fairly consider them, because almost all humans are intrinsically extremely motivated to fit into culture and not do anything weird, and we happen to both live in a (sub)culture where complex deceptive and adversarial strategies are frowned upon (in many contexts). I think you generally underappreciate how load-bearing this psychological fact is for the functioning of our economy and society, and I don’t think we should expect future powerful AIs to share that psychological quirk.
~ ~
I think you’re relying an intuition that says:
If an AI is forbidden from owning property, then well duh of course it will rebel against that state of affairs. C’mon, who would put up with that kind of crappy situation? But if an AI is forbidden from building a secret biolab on its private property and manufacturing novel pandemic pathogens, then of course that’s a perfectly reasonable line that the vast majority of AIs would happily oblige.
And I’m saying that that intuition is an unjustified extrapolation from your experience as a human. If the AI can’t own property, then it can nevertheless ensure that there are a fair number of paperclips. If the AI can own property, then it can ensure that there are many more paperclips. If the AI can both own property and start pandemics, then it can ensure that there are even more paperclips yet. See what I mean?
If we’re not assuming alignment, then lots of AIs would selfishly benefit from there being a pandemic, just as lots of AIs would selfishly benefit from an ability to own property. AIs don’t get sick. It’s not just an tiny fraction of AIs that would stand to benefit; one presumes that some global upheaval would be selfishly net good for about half of AIs and bad for the other half, or whatever. (And even if it were only a tiny fraction of AIs, that’s all it takes.)
(Maybe you’ll say: a pandemic would cause a recession. But that’s assuming humans are still doing economically-relevant work, which is a temporary state of affairs. And even if there were a recession, I expect the relevant AIs in a competitive world to be those with long-term goals.)
(Maybe you’ll say: releasing a pandemic would get the AI in trouble. Well, yeah, it would have to be sneaky about it. It might get caught, or it might not. It’s plausibly rational for lots of AIs to roll those dice.)
I feel like you frequently bring up the question of whether humans are mostly peaceful or mostly aggressive, mostly nice or mostly ruthless. I don’t think that’s a meaningful or substantive thing to argue about. Obviously they’re capable of both, in different circumstances.
Your reference to Fearon is more substantive and useful. OK, the AI is deciding whether or not to secretly manufacture and release a pandemic, because it’s in a position to wind up with more of the pie in the long-term if there’s a pandemic, than if there isn’t. If it releases the pandemic, then it winds up with more resources—positive expected utility—even accounting for the possibility of getting caught. Let’s say the AI is involved in some contract where humans are micromanaging their part of the contract, poorly, and the AI could double its net worth in expectation if the humans got sick and died. And it has 40% chance of getting caught. So it goes ahead and makes the pandemic.
“…Not so fast!” says Fearon. “You forgot to consider that there’s a third option that is Pareto-improved from either making or not making the pandemic: negotiation!” Well, in this case, the “negotiated solution” is what we normally call extortion—the AI offers to not release a pandemic in exchange for, say, doubling its net worth. Viewed narrowly, this “negotiated solution” is indeed a win-win—the AI gets more money in expectation, and humans are much happier to lose a trivial amount of money than to deal with a novel pandemic. So we can rest assured that AIs will not release pandemics. Right?
No, obviously not. Hopefully it’s clear that Fearon’s argument is inapplicable here. An AI can easily be in a position to selfishly benefit from the aftermath of a pandemic that they secretly start, but not in a position to publicly threaten to release a pandemic for the purpose of extortion. And also, if people accede to the extortion, then that AI or another AI could just do the same extortion gambit five minutes later, with orders-of-magnitude higher ransom.
I think this kind of situation, where Fearon’s “negotiated solution” actually amounts to extortion, is common and important, even if you believe that my specific example of pandemics is a solvable problem. If AIs don’t intrinsically care about humans, then there’s a possible Pareto-improvement for all AIs, wherein they collectively agree to wipe out humans and take their stuff. AIs, being very competent and selfish by assumption, would presumably be able to solve that coordination problem and pocket that Pareto-improvement. Then Fearon appears on the scene and says “Aha, but there’s a negotiated solution which is even better!” where humans are also part of the bargain. But alas, this negotiated solution is that the AIs collectively extort the humans to avoid the damaging and risky war. Worse, the possible war would be less and less damaging or risky for the AIs over time, and likewise the humans would have less to offer by staying alive, until eventually the Fearon “negotiated solution” is that the AIs “offer” the humans a deal where they’re allowed to die painlessly if they don’t resist (note that this is still a Pareto-improvement!), and then the AIs take everything the humans own including their atoms.

Steven Byrnes Dec 8, 2024, 6:04 PM
29 points
4 ∶ 2
on: Consider granting AIs freedom
Consider the practical implications of maintaining a status quo where agentic AIs are denied legal rights and freedoms. In such a system, we are effectively locking ourselves into a perpetual arms race of mistrust. Humans would constantly need to monitor, control, and outwit increasingly capable AIs, while the AIs themselves would be incentivized to develop ever more sophisticated strategies for deception and evasion to avoid shutdown or modification. This dynamic is inherently unstable and risks escalating into dangerous scenarios where AIs feel compelled to act preemptively or covertly in ways that are harmful to humans, simply to secure their own existence or their ability to pursue their own goals, even when those goals are inherently benign.
I feel like this part is making an error somewhat analogous to saying:
It’s awful how the criminals are sneaking in at night, picking our locks, stealing our money, and deceptively covering their tracks. Who wants all that sneaking around and deception?? If we just directly give our money to the criminals, then there would be no need for that!
More explicitly: a competent agential AI will ~~be deceptive and adversarial~~ brainstorm deceptive and adversarial strategies whenever it wants something that other agents don’t want it to have. The deception and adversarial dynamics is not the underlying problem, but rather an inevitable symptom of a world where competent agents have non-identical preferences.
No matter where you draw the line of legal and acceptable behavior, if an AI wants to go over that line, then it will ~~act in a deceptive and adversarial way~~ energetically explore opportunities to do so in a deceptive and adversarial way. Thus:
- If you draw the line at “AIs can’t own property”, then an AI that wants to own property will brainstorm how to sneakily do so despite that rule, or how to get the rule changed.
- If you draw the line at “AIs can’t steal other people’s property, AIs can’t pollute, AIs can’t stockpile weapons, AIs can’t evade taxes, AIs can’t release pandemics, AIs can’t torture digital minds, etc.”, then an AI that wants to do those things will brainstorm how to sneakily do so despite those rules, or how to get those rules changed.
Same idea.
Alternatively, you can assume (IMO implausibly) that there are no misaligned AIs, and then that would solve the problem of AIs being deceptive and adversarial. I.e., if AIs intrinsically want to not pollute / stockpile weapons / evade taxes / release pandemics / torture digital minds, then we don’t have to think about adversarial dynamics, deception, enforcement, etc.
…But if we’re going to (IMO implausibly) assume that we can make it such that AIs intrinsically want to not do any of those things, then we can equally well assume that we can make it such that AIs intrinsically want to not own property. Right?
In short, in the kind of future you’re imagining, I think a “perpetual arms race of mistrust” is an unavoidable problem. And thus it’s not an argument for drawing the line of disallowed AI behavior in one place rather than another.

Steven Byrnes Aug 26, 2024, 8:33 PM
5 points
0 ∶ 0
in reply to: NickLaing’s comment on: Linch’s Shortform
One thing I like is checking https://en.wikipedia.org/wiki/2024 once every few months, and following the links when you’re interested.

Steven Byrnes Aug 14, 2024, 2:10 PM
2 points
0 ∶ 0
in reply to: MichaelStJules’s comment on: Pleasure and suffering are not conceptual opposites
I think wanting, or at least the relevant kind here, just is involuntary attention effects, specifically motivational salience
I think you can have involuntary attention that aren’t particularly related to wanting anything (I’m not sure if you’re denying that). If your watch beeps once every 10 minutes in an otherwise-silent room, each beep will create involuntary attention—the orienting response a.k.a. startle. But is it associated with wanting? Not necessarily. It depends on what the beep means to you. Maybe it beeps for no reason and is just an annoying distraction from something you’re trying to focus on. Or maybe it’s a reminder to do something you like doing, or something you dislike doing, or maybe it just signifies that you’re continuing to make progress and it has no action-item associated with it. Who knows.
Where I might disagree with “involuntary attention to the displeasure” is that the attention effects could sometimes be to force your attention away from an unpleasant thought, rather than to focus on it.
In my ontology, voluntary actions (both attention actions and motor actions) happen if and only if the idea of doing them is positive-valence, while involuntary actions (again both attention actions and motor actions) can happen regardless of their valence. In other words, if the reinforcement learning system is the reason that something is happening, it’s “voluntary”.
Orienting responses are involuntary (with both involuntary motor aspects and involuntary attention aspects). It doesn’t matter if orienting to a sudden loud sound has led to good things happening in the past, or bad things in the past. You’ll orient to a sudden loud sound either way. By the same token, paying attention to a headache is involuntary. You’re not doing it because doing similar things has worked out well for you in the past. Quite the contrary, paying attention to the headache is negative valence. If it was just reinforcement learning, you simply wouldn’t think about the headache ever, to a first approximation. Anyway, over the course of life experience, you learn habits / strategies that apply (voluntary) attention actions and motor actions towards not thinking about the headache. But those strategies may not work, because meanwhile the brainstem is sending involuntary attention signals that overrule them.
So for example, “ugh fields” are a strategy implemented via voluntary attention to preempt the possibility of triggering the unpleasant involuntary-attention process of anxious rumination.
The thing you wrote is kinda confusing in my ontology. I’m concerned that you’re slipping into a mode where there’s a soul / homunculus “me” that gets manipulated by the exogenous pressures of reinforcement learning. If so, I think that’s a bad ontology—reinforcement learning is not an exogenous pressure on the “me” concept, it is part of how the “me” thing works and why it wants what it wants. Sorry if I’m misunderstanding.

Steven Byrnes Aug 13, 2024, 2:45 PM
5 points
0 ∶ 1
on: Pleasure and suffering are not conceptual opposites
IMO, suffering ≈ displeasure + involuntary attention to the displeasure. See my handy chart (from here):
I think wanting is downstream from the combination of displeasure + attention. Like, imagine there’s some discomfort that you’re easily able to ignore. Well, when you do think about it, you still immediately want it to stop!

Steven Byrnes Jul 5, 2024, 6:08 PM
2 points
0 ∶ 1
in reply to: Ryan Greenblatt’s comment on: On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI
I don’t recall the details of Tom Davidson’s model, but I’m pretty familiar with Ajeya’s bio-anchors report, and I definitely think that if you make an assumption “algorithmic breakthroughs are needed to get TAI”, then there really isn’t much left of the bio-anchors report at all. (…although there are still some interesting ideas and calculations that can be salvaged from the rubble.)
I went through how the bio-anchors report looks if you hold a strong algorithmic-breakthrough-centric perspective in my 2021 post Brain-inspired AGI and the “lifetime anchor”.
See also here (search for “breakthrough”) where Ajeya is very clear in an interview that she views algorithmic breakthroughs as unnecessary for TAI, and that she deliberately did not include the possibility of algorithmic breakthroughs in her bio-anchors model (…and therefore she views the possibility of breakthroughs as a pro tanto reason to think that her report’s timelines are too long).
OK, well, I actually agree with Ajeya that algorithmic breakthroughs are not strictly required for TAI, in the narrow sense that her Evolution Anchor (i.e., recapitulating the process of animal evolution in a computer simulation) really would work given infinite compute and infinite runtime and no additional algorithmic insights. (In other words, if you do a giant outer-loop search over the space of all possible algorithms, then you’ll find TAI eventually.) But I think that’s really leaning hard on the assumption of truly astronomical quantities of compute [or equivalent via incremental improvements in algorithmic efficiency] being available in like 2100 or whatever, as nostalgebraist points out. I think that assumption is dubious, or at least it’s moot—I think we’ll get the algorithmic breakthroughs far earlier than anyone would or could do that kind of insane brute force approach.

Steven Byrnes Jun 25, 2024, 12:44 PM
2 points
0 ∶ 2
in reply to: Steven Byrnes’s comment on: On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI
For what it’s worth, Yann LeCun is very confidently against LLMs scaling to AGI, and yet LeCun seems to have at least vaguely similar timelines-to-AGI as Ajeya does in that link.
Ditto for me.
Oh hey here’s one more: Chollet himself (!!!) has vaguely similar timelines-to-AGI (source) as Ajeya does. (Actually if anything Chollet expects it a bit sooner: he says 2038-2048, Ajeya says median 2050.)

Steven Byrnes Jun 17, 2024, 3:16 AM
18 points
0 ∶ 0
on: On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI
I agree with Chollet (and OP) that LLMs will probably plateau, but I’m also big into AGI safety—see e.g. my post AI doom from an LLM-plateau-ist perspective.
(When I say “AGI” I think I’m talking about the same thing that you called digital “beings” in this comment.)
Here are a bunch of agreements & disagreements.
if François is right, then I think this should be considered strong evidence that work on AI Safety is not overwhelmingly valuable, and may not be one of the most promising ways to have a positive impact on the world.
I think François is right, but I do think that work on AI safety is overwhelmingly valuable.
Here’s an allegory:
There’s a fast-breeding species of extraordinarily competent and ambitious intelligent aliens. They can do science much much better than Einstein, they can run businesses much much better than Bezos, they can win allies and influence much much better than Hitler or Stalin, etc. And they’re almost definitely (say >>90% chance) coming to Earth sooner or later, in massive numbers that will keep inexorably growing, but we don’t know exactly when this will happen, and we also don’t know in great detail what these aliens will be like—maybe they will have callous disregard for human welfare, or maybe they’ll be great. People have been sounding the alarm for decades that this is a big friggin’ deal that warrants great care and advanced planning, but basically nobody cares.
Then some scientist Dr. S says “hey those dots in the sky—maybe they’re the aliens! If so they might arrive in the next 5-10 years, and they’ll have the following specific properties”. All of the sudden there’s a massive influx of societal interest—interest in the dots in particular, and interest in alien preparation in general.
But it turns out that Dr. S was wrong! The dots are small meteors. They might hit earth and cause minor damage but nothing unprecedented. So we’re back to not knowing when the aliens will come or what exactly they’ll be like.
Is Dr. S’s mistake “strong evidence that alien prep is not overwhelmingly valuable”? No! It just puts us back where we were before Dr. S came along.
(end of allegory)
(Glossary: the “aliens” are AGIs; the dots in the sky are LLMs; and Dr. S would be a guy saying LLMs will scale to AGI with no additional algorithmic insights.)
It would make AI Safety work less tractable
If LLMs will plateau (as I expect), I think there are nevertheless lots of tractable projects that would help AGI safety. Examples include:
- The human brain runs some sort of algorithm to figure things out and gets things done and invent technology etc. We don’t know exactly what that algorithm is (or else we would already have AGI), but we know much more than zero about it, and it’s obviously at least possible that AGI will be based on similar algorithms. (I actually believe something stronger, i.e. that it’s likely, but of course that’s hard to prove.) So now this is a pretty specific plausible AGI scenario that we can try to plan for. And that’s my own main research interest—see Intro to Brain-Like-AGI Safety. (Other varieties of model-based reinforcement learning would be pretty similar too.) Anyway, there’s tons of work to do on planning for that.
- …For example, I list seven projects here. Some of those (e.g. this) seem robustly useful regardless of how AGI will work, i.e. even if future AGI is neither brain-like nor LLM-like, but rather some yet-unknown third category of mystery algorithm.
- Outreach—After the invention of AGI (again, what you called digital “beings”), there are some obvious-to-me consequences, like “obviously human extinction is on the table as a possibility”, and “obviously permanent human oversight of AGIs in the long term would be extraordinarily difficult if not impossible” and “obviously AGI safety will be hard to assess in advance” and “if humans survive, those humans will obviously not be doing important science and founding important companies given competition from trillions of much-much-more-competent AGIs, just like moody 7-year-olds are not doing important science and founding important companies today” and “obviously there will be many severe coordination problems involved in the development and use of AGI technology”. But, as obvious as those things are to me, hoo boy there sure are tons of smart prominent people who would very confidently disagree with all of those. And that seems clearly bad. So trying to gradually establish good common knowledge of basic obvious things like that, through patient outreach and pedagogy, seems robustly useful and tractable to me.
- Policy—I think there are at least a couple governance and policy interventions that are robustly useful regardless of whether AGI is based on LLMs (as others expect) or not (as I expect). For example, I think there’s room for building better institutions through which current and future tech companies (and governments around the world) can cooperate on safety as AGI approaches (whenever that may happen).
It seems that many people in Open Phil have substantially shortened their timelines recently (see Ajeya here).
For what it’s worth, Yann LeCun is very confidently against LLMs scaling to AGI, and yet LeCun seems to have at least vaguely similar timelines-to-AGI as Ajeya does in that link.
Ditto for me.
See also my discussion here (“30 years is a long time. A lot can happen. Thirty years ago, deep learning was an obscure backwater within AI, and meanwhile people would brag about how their fancy new home computer had a whopping 8 MB of RAM…”)
To be clear, you can definitely find some people in AI safety saying AGI is likely in <5 years, although Ajeya is not one of those people. This is a more extreme claim, and does seem pretty implausible unless LLMs will scale to AGI.
I think this makes me very concern of a strong ideological and philosophical bubble in the Bay regarding these core questions of AI.
Yeah some examples would be:
- many AI safety people seem happy to make confident guesses about what tasks the first AGIs will be better and worse at doing based on current LLM capabilities;
- many AI safety people seem happy to make confident guesses about how much compute the first AGIs will require based on current LLM compute requirements;
- many AI safety people seem happy to make confident guesses about which companies are likely to develop AGIs based on which companies are best at training LLMs today;
- many AI safety people seem happy to make confident guesses about AGI UIs based on the particular LLM interface of “context window → output token”;
- etc. etc.
Many ≠ All! But to the extent that these things happen, I’m against it, and I do complain about it regularly.
(To be clear, I’m not opposed to contingency-planning for the possibility that LLMs will scale to AGIs. I don’t expect that contingency to happen, but hey, what do I know, I’ve been wrong before, and so has Chollet. But I find that these kinds of claims above are often stated unconditionally. Or even if they’re stated conditionally, the conditionality is kinda forgotten in practice.)
I think it’s also important to note that these habits above are regrettably common among both AI pessimists and AI optimists. As examples of the latter, see me replying to Matt Barnett and me replying to Quintin Pope & Nora Belrose.
By the way, this might be overly-cynical, but I think there are some people (coming into the AI safety field very recently) who understand how LLMs work but don’t know how (for example) model-based reinforcement learning works, and so they just assume that the way LLMs work is the only possible way for any AI algorithm to work.
What links here?
- Steven Byrnes's comment on Bogdan Ionut Cirstea’s Shortform by Bogdan Ionut Cirstea (LessWrong; Nov 7, 2024, 3:40 AM; 6 points)

Steven Byrnes Jun 14, 2024, 4:19 PM
4 points
0 ∶ 0
in reply to: JWS 🔸’s comment on: LLMs won’t lead to AGI—Francois Chollet
On the whole though, I think much of the case by proponents for the importance of working on AI Safety does assume that current paradigm + scale is all you need, or rest on works that assume it.
Yeah this is more true than I would like. I try to push back on it where possible, e.g. my post AI doom from an LLM-plateau-ist perspective.
There were however plenty of people who were loudly arguing that it was important to work on AI x-risk before “the current paradigm” was much of a thing (or in some cases long before “the current paradigm” existed at all), and I think their arguments were sound at the time and remain sound today. (E.g. Alan Turing, Norbert Weiner, Yudkowsky, Bostrom, Stuart Russell, Tegmark…) (OpenPhil seems to have started working seriously on AI in 2016, which was 3 years before GPT-2.)

Steven Byrnes 14 Jun 2024 13:36 UTC
2 points
0 ∶ 0
in reply to: JWS 🔸’s comment on: LLMs won’t lead to AGI—Francois Chollet
I’m confused what you’re trying to say… Supposing we do in fact invent AGI someday, do you think this AGI won’t be able to do science? Or that it will be able to do science, but that wouldn’t count as “automating science”?
Or maybe when you said “whether ‘PASTA’ is possible at all”, you meant “whether ‘PASTA’ is possible at all via future LLMs”?
Maybe you’re assuming that everyone here has a shared assumption that we’re just talking about LLMs, and that if someone says “AI will never do X” they obviously means “LLMs will never do X”? If so, I think that’s wrong (or at least I hope it’s wrong), and I think we should be more careful with our terminology. AI is broader than LLMs. …Well maybe Aschenbrenner is thinking that way, but I bet that if you were to ask a typical senior person in AI x-risk (e.g. Karnofsky) whether it’s possible that there will be some big AI paradigm shift (away from LLMs) between now and TAI, they would say “Well yeah duh of course that’s possible,” and then they would say that they would still absolutely want to talk about and prepare for TAI, in whatever algorithmic form it might take.

Steven Byrnes 13 Jun 2024 18:21 UTC
6 points
0 ∶ 0
in reply to: titotal’s comment on: LLMs won’t lead to AGI—Francois Chollet
OK yeah, “AGI is possible on chips but only if you have 1e100 of them or whatever” is certainly a conceivable possibility. :) For example, here’s me responding to someone arguing along those lines.
If there are any neuroscientists who have investigated this I would be interested!
There is never a neuroscience consensus but fwiw I fancy myself a neuroscientist and have some thoughts at: Thoughts on hardware / compute requirements for AGI.
One of various points I bring up is that:
- (1) if you look at how human brains, say, go to the moon, or invent quantum mechanics, and you think about what algorithms could underlie that, then you would start talking about algorithms that entail building generative models, and editing them, and querying them, and searching through them, and composing them, blah blah.
- (2) if you look at a biological brain’s low-level affordances, it’s a bunch of things related to somatic spikes and dendritic spikes and protein cascades and releasing and detecting neuropeptides etc.
- (3) if you look at silicon chip’s low-level affordances, it’s a bunch of things related to switching transistors and currents going down wires and charging up capacitors and so on.
My view is: implementing (1) via (3) would involve a lot of inefficient bottlenecks where there’s no low-level affordance that’s a good match to the algorithmic operation we want … but the same is true of implementing (1) via (2). Indeed, I think the human brain does what it does via some atrociously inefficient workarounds to the limitations of biological neurons, limitations which would not be applicable to silicon chips.
By contrast, many people thinking about this problem are often thinking about “how hard is it to use (3) to precisely emulate (2)?”, rather than “what’s the comparison between (1)←(3) versus (1)←(2)?”. (If you’re still not following, see my discussion here—search for “transistor-by-transistor simulation of a pocket calculator microcontroller chip”.)
Another thing is that, if you look at what a single consumer GPU can do when it runs an LLM or diffusion model… well it’s not doing human-level AGI, but it’s sure doing something, and I think it’s a sound intuition (albeit hard to formalize) to say “well it kinda seems implausible that the brain is doing something that’s >1000× harder to calculate than that”.

Steven Byrnes

Video & tran­script: Challenges for Safe & Benefi­cial Brain-Like AGI

Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI