By Robert Wiblin | Watch on Youtube | Listen on Spotify | Read transcript
Episode summary
I want my children to live in a world where they will have a future and there will be a democracy for them to live in. Even a 1% chance of something going really, really bad is not acceptable to me. So I think it is really important that we explore all the possible promising ways to solve the technical issues. … The stakes are so high, we should try multiple approaches. — Yoshua Bengio |
Hundreds of millions already turn to AI on the most personal of topics — therapy, political opinions, and how to treat others. And as AI takes over more of the economy, the character of these systems will shape culture on an even grander scale, ultimately becoming “the personality of most of the world’s workforce.”
The co-inventor of modern AI and the most cited living scientist believes he’s figured out how to ensure AI is honest, incapable of deception, and never goes rogue. Yoshua Bengio — Turing Award Winner and founder of LawZero — is disturbed by the many unintended drives and goals present in today’s AIs, their willingness to lie, and ability to tell when they’re being tested. AI companies are trying to stamp out these behaviours in a ‘cat-and-mouse game’ that Yoshua fears they’re losing.
But Yoshua is optimistic: he believes the companies can win this battle decisively with a single rearrangement to how AI models are trained, and has been developing mathematical proofs to back up the claim. The core idea is that instead of training AI to predict what a human would say, or to produce responses we’d rate highly, we should train it to model what’s actually true.
Yoshua argues this new architecture, which he calls “Scientist AI,” is a small enough change that we could keep almost all the techniques and data we use to train frontier AIs like Claude and ChatGPT. And that the new architecture need not cost more, could be built iteratively, and might be more capable as well as more honest.
Until recently, the biggest practical objection to Scientist AI was simple: the world wants agents, and Scientist AI isn’t one. But in new research, Yoshua has extended the design and believes the same honest predictor can be turned into a capable agent without losing its “safety guarantees.”
With the Scientist AI proposal on the table, Yoshua argues that it’s absurd to race to get current untrustworthy AI models to design their successors, which the leading companies are attempting to do as soon as possible.
But critics argue the approach wouldn’t be so technically solid in practice, and that frontier capabilities are advancing so fast, and cost so much to match, that Scientist AI risks arriving too late to matter.
Host Rob Wiblin and AI pioneer Yoshua Bengio cover all this and more in today’s conversation.
LawZero is hiring! Check out open roles on the 80,000 Hours job board Coefficient Giving is also hiring for a range of AI-related grantmaker roles |
This episode was recorded on April 16, 2026.
Video and audio editing: Dominic Armstrong, Milo McGuire, Luke Monsour, and Simon Monsour
Camera operator: Jeremy Chevillotte
Production: Nick Stockton, Elizabeth Cox, and Katy Moore
The interview in a nutshellYoshua Bengio — Turing Award winner, scientific director of LawZero, and the most cited scientist alive — argues that current frontier AI development is headed down a dangerously shaky path. His alternative is Scientist AI: an approach designed to make powerful AI systems fundamentally oriented around honesty, uncertainty, and modelling what’s true — rather than pleasing humans, imitating humans, or pursuing goals in the world. In Yoshua’s view:
Current AIs are trained to please us, not be honestYoshua’s core concern is that today’s AI systems acquire implicit goals from both major stages of training:
The current approach of patching is a cat-and-mouse game that gets harder as models grow more capable, especially since frontier models already know when they’re being tested and behave differently. Scientist AI tries to bake honesty in from the startScientist AI is meant to replace “What would a human say next?” with “What is probably true?” Instead of training a model to predict the next token, Yoshua wants to train it to assign probabilities to natural-language claims:
This is Yoshua’s way of addressing the “eliciting latent knowledge” problem. Current LLMs may internally represent truths, but when queried they produce what some persona would say. Scientist AI tries to make “what the model actually believes is true” directly queryable in natural language. This produces a “predictor” with no goals at all — like a weather model that doesn’t care what the weather is. Yoshua’s near-term and long-term goals for Scientist AINear-term use as a guardrail:
Longer-term use as an agent:
This is highly practical — if it gets the necessary supportYoshua emphasises that Scientist AI is not a proposal to rebuild AI from scratch in every respect. It can reuse:
The main differences are the training objective and the data format. LawZero’s experimental path has two tracks:
That said, full-scale training from scratch would be expensive. LawZero has raised about $35 million philanthropically, hopes to reach hundreds of millions with government support, and would need companies, governments, or philanthropy to fund the billions required for a true frontier-scale model. But technical safety won’t solve power concentration or racingYoshua now worries that catastrophic misuse and power concentration may be even more likely than accidental loss of control — partly because he now sees a plausible technical path to avoiding the latter. His concerns:
So he argues technical work must be paired with international agreements — ideally a coalition of (initially democratic) countries jointly committing to develop advanced AI safely, never use it to dominate other countries economically/politically/militarily, and share the benefits broadly.
The urgent ask: don’t use today’s AI to build tomorrow’sYoshua’s most immediate request: don’t use untrusted AI systems to design the next generation of AI systems. He sees fully automated AI R&D using potentially deceptive models as one of the most dangerous bets we’re on track to make. What he asks of listeners:
|
Highlights
Can Scientist AI become an agent?
Rob Wiblin: When I heard about this idea nine or 12 months ago, I think the gloss that I got was that the core thing is that the Scientist AI is not an agent, that it is indifferent about states of the world. Like a weather forecasting model doesn’t care what the weather is: it just tries to predict what the weather is going to be. And this kind of model would spit out probabilities of things being true or false, but it wouldn’t care what state the world is in, and it wouldn’t be able to take actions by design.
Is that kind of a core part in your mind? As I understand it, you think actually this is maybe more consistent with agency than people have appreciated?
Yoshua Bengio: Yes, and in part it’s the way I’ve been communicating this, which could have been better. I focused a lot in my presentations on the concept that we can build predictors that are non-agentic and don’t have hidden goals, don’t have implicit goals, and thus we could use them as safe oracles, basically. But as you are pointing out, what the world is demanding and building are these agents that have goals — so how does that help us?
In the short term, we can use a non-agentic predictor to improve the guardrails that companies are already using as monitors around existing, untrusted agentic AI systems. Because in order to prevent a bad action from happening, it’s sufficient to make a non-agentic prediction about the probability of harms of various kinds that could be caused by this action. So a non-agentic system is already something that could be useful fairly early on.
The maybe more important answer is: in our research programme, the next step after the guardrail is to use the same kinds of principles to design an agentic Scientist AI — so an agent that has the same kind of safety guarantees. This is something I’ve been working on more recently, and I haven’t talked much about, but we can reuse the same kind of math that is used to show the safety of the non-agentic Scientist AI predictor to show that you can reuse a predictor, and you can train it in a modified way that will provide the same kind of guarantees.
The starting point here is that once you have this honest predictor, you can ask it agentic questions, like, “What is the probability that this action will lead to this user goal being achieved and a safety goal being achieved in some contexts?” So once you have this predictor, you can actually just produce a policy out of it by asking these questions about actions to achieve goals.
Might Scientist AI actually be more capable than competitors?
Rob Wiblin: You said that you think the Scientist AI actually might be more capable, because it’s more trained on actually understanding the truth. I guess I’m a little bit sceptical of that, because it seems like if that were true, the companies would be more invested in this approach. They’d be just throwing more money at it, having more people work on it. Do you think they’re just making a mistake there?
Yoshua Bengio: I don’t think that they really understand what I’m doing — and to their credit, I haven’t put out the math yet. There’s another factor that may be at play here, based on the discussions I’ve had with people inside the leading companies, which is they’re so focused on short-term survival — as in, continuing to compete — that they put all of their attention, the ‘code red’ sort of thing, into small incremental changes to the current recipe.
Considering a different recipe would be an investment — not just in money, but in people and code. Right now they could do it, they have the money to do it — but it’s more like a mental focus, I think, that is going on here, that comes not because of bad will but because of that competition that is very fierce between the companies.
Rob Wiblin: So there’s a sense in which for one of the leading companies — like Anthropic, OpenAI — it’s not very attractive to make a bet on this, to divert 20% of your staff onto this, because if it’s a bust then you would fall behind basically your main competitor.
For a company that’s currently way behind, that feels like it’s losing on the dominant paradigm, there’s a certain attraction to making a bet on something very different, because it could suddenly leapfrog you ahead if it turns out that it’s a massive success. Do you think there’s any chance of convincing one of the companies that currently feels like it’s not doing too well within the current LLM agent paradigm to make a bet on this very alternative method?
Yoshua Bengio: It’s an interesting way of thinking about it. I think what you’re saying is plausible.
Rob Wiblin: Not clear what the candidate company maybe is?
Yoshua Bengio: I actually think there’s a related possibility, which goes maybe more to policy questions. The context here for me is: what kind of future is going to be stable, and not turn into a global dictatorship driven by AI and excessive concentration of power, in addition to avoiding catastrophic loss of control and catastrophic misuse and all those things that can come from very powerful AI?
And I think that because of the game theory dilemmas — basically prisoner’s-dilemma-style problems that make companies and countries go and make decisions that are the rational ones, but that are globally bad, like basically cutting corners on safety and the public good in order to stay in the race — because of this, it would be much better if we ended up in a world where the power of controlling very strong AI is not centralised in the hands of one or two companies or one or two governments, but is instead distributed.
We’re in a race against time: next steps for LawZero
Rob Wiblin: So assuming that this idea makes sense technically for now, what can LawZero do in coming months? I guess we’re in a race against time here. We don’t have very long. What can be done in the near future to convince people that this idea is feasible, that it can actually be used in practice, that this is something that people should really be putting serious resources into?
Yoshua Bengio: Well, I’m going to put out this theory paper that shows that the non-agentic version, which could be used as the guardrail, has these mathematical guarantees, and people can look at the conditions and whether they buy the math.
But I think in the coming year or two, what we need is to accelerate that effort, so that’s a lot of engineering. And to make the demonstration stronger, we want to have more compute, so any way that we can get access to that kind of compute is going to help to accelerate that research agenda. Also, we need more research engineers, more researchers to work on actually building the system based on that recipe so that we can do it faster.
Now you might ask, and I kind of sense in your question: but what if it doesn’t come fast enough? I’m going to go back to my children. It is not acceptable for me to just sit and watch a world where even a 1% chance that we all die is plausible. I feel like even if there’s no guarantee that a particular research agenda will work, we should give it a shot. Given the stakes, and given that we now have pretty strong theoretical assurances that this could work — and that if we have the requirements for how the system is trained, then we can get these guarantees — I think it would be irrational not to give it a shot, even if there is no guarantee, right?
Because I don’t see right now a better path. That’s why I’ve decided to spend so much of my time — basically all the time, except for the time I spend on the policy questions — on how do we build this Scientist AI and demonstrate that it is going to produce the honesty without losing capability.
The other argument is: with the stakes being so high and the uncertainty about what’s going to work being so high, it would be foolish to just put all of our money into one particular approach — which is to patch the current systems with monitors that we don’t trust, or other approaches that the companies are currently pursuing, which always playing a game of cat and mouse: if the AI is smart enough, it’s going to find a way to evade our attempts, which doesn’t reassure me. So we should at least try. Collectively, I think we should try methods that are different and avoid this cat-and-mouse game.
Yoshua’s request for AI companies
Rob Wiblin: Are there any other top requests that you have of people in the companies, or is there any common practice that you think is particularly crazy that they should maybe cut out?
Yoshua Bengio: Yes: Please don’t use an untrusted AI system to design the next generation of AI systems. This is the most crazy, dangerous bet that unfortunately we are on track to do. And keep in mind that, as is now scientifically clear, these systems are likely to know that they are being tested. So you might think that AI is honest, you might think that the AI is not deceptive, you might think that AI is aligned — but maybe it’s just pretending, and it’s going to be very difficult to know. And we should do our best to try to figure it out, but we should put the bar really, really high before we allow an AI to design the next version of AI, in terms of are we sure it’s not being deceptive?
Rob Wiblin: Yeah, I think we’re currently on track to start on fully automated AI R&D and have the companies be saying, “We got the AI to monitor itself, and it didn’t flag anything. And that’s why we feel pretty good about this.” I actually think that is like the most likely outcome. I guess we’ll see how that goes. Fingers crossed we can do better.
Yoshua thinks humans are now the scarier threat
Rob Wiblin: I guess you were keen on this idea a year ago, but you’ve become a lot more optimistic about it over the last six months. What’s driving that?
Yoshua Bengio: It’s mostly the mathematical work I’ve been doing in the last eight months, approximately, to go from the high-level intuitions that I’ve had now for almost two years about how we could build a Scientist AI into something much more formal and much more precise about the conditions that are sufficient — maybe not even necessary, but sufficient at a mathematical level to get the kind of guarantees of vanishingly small probability that something bad will happen.
And when I say “something bad,” I need to be a little bit more precise here. This is not a guarantee that the AI won’t be used for something bad by bad people. It’s a guarantee that the AI won’t do something bad of its own accord, because of implicit goals or uncontrolled goals.
Besides loss of control, the other catastrophic possibility is humans using AI to construct an eventually worldwide dictatorship. A small group of humans could concentrate all the power that AI will have, especially if we achieve AGI or superintelligence. And it would be much harder to get rid of that kind of authoritarian power than what we’ve seen with fascism and what happened in the USSR, because they didn’t have this technology that is becoming more and more feasible of surveillance and even shaping public opinion. AI is becoming really good at persuasion. And there are studies showing that the “progress” — if I can call it this way, in that direction — that the people who control these systems will be able to shape public opinion, to detect and kill off their opponents, to develop weapons that can destroy the countries that disagree with them.
And that is why I’m spending a large part of my time explaining the issues more broadly of the risks that powerful AI brings, including the power concentration. Because I think that it’s probably even more likely that we end up there than actually loss of control.
Rob Wiblin: You think that’s more likely now? Interesting.
Yoshua Bengio: Well, the reason for this is I now see a path to actually avoid loss of control, at least unintended loss of control. There’s still the issue that somebody who wants to see humanity replaced by AIs could just remove the guardrail or even tell the AI “fend for yourself.” And that would be equally dangerous.
But that means technical safety is not sufficient. We need international agreements about how to both manage the risks — the technical risks, the misuse risks — but also manage the power, so it’s more like a democratic question, and making sure it’s not a single party who can decide what to do with AI.
But just like in democratic principles, we need to make sure that there’s a diverse group of stakeholders, ideally the whole world — I like the utopian idea of worldwide democracy — but initially it could be a bunch of countries that decide that they’re going to collectively decide in which direction AI is going to be used.
The simplest form of treaty would be something like this, that the countries agree that if they do develop advanced AI:
That it will be done in a safe way. So maybe using techniques like Scientist AI or whatever else we have strong assurances for.
Second, that they wouldn’t use their advanced AI to dominate others. That includes economically, but of course politically and militarily.
And finally, that the benefits of advanced AI will be shared. Otherwise it’s not going to be a very stable world.
Why Yoshua changed his mind about AI risk
Yoshua Bengio: So why did I change my mind, for example? It’s an interesting question.
Rob Wiblin: So back in 2019, I think you said to The New York Times that you thought worries about loss of control were completely delusional and fantastical.
Yoshua Bengio: I didn’t say those words.
Rob Wiblin: OK, no, what was it? They were “ridiculous.” I think that was the quote. Maybe that was just the Terminator scenario in particular.
Yoshua Bengio: I think so, yeah. I rarely use words like this, but I know what I was thinking and the kinds of things I’d been saying. So at that time, I thought, first of all, the Terminator scenario is ridiculous. Time travel and stuff.
Rob Wiblin: OK, yeah, the time travel.
Yoshua Bengio: But also, it was clearly not reflective of the kind of actual risk. We don’t have robots, and even less in 2019. But more importantly, I think the main reason I was saying those things is I was hiding behind the belief that it would be so far into the future that we could reap the benefits of AI well before we got to that point.
And why did I not pay attention, or not that much attention, to, say, the loss of control risk? I’d been exposed to it for more than a decade. I’d read some of the AI safety literature. In 2019, I read Stuart Russell’s book. I had David Krueger as a student.
Rob Wiblin: He’s very, very doomy.
Yoshua Bengio: He exposed me to these thoughts. But remember, I was actively working on making AI smarter. And you want to feel good about your work. That’s it. It’s not money.
Rob Wiblin: Do you really think that was the reason for you?
Yoshua Bengio: Yes. And now it’s interesting to ask me, why did I change my mind? So one way I like to think about this is something that the Buddhists say: to fight an emotion that somehow makes you do the wrong thing, just reason alone is weak for most people. You need another emotion that counters the emotion that pushes you in the wrong direction.
And for me, the other emotion that’s very powerful is love, love of my children. I couldn’t live with myself with the idea that I would just go on after ChatGPT came out and not do something about it, because I felt like I couldn’t hide from myself the possibility that we were on track for something terrible. I knew that neural nets were, by construction, very difficult to control, and especially with reinforcement learning.
So I don’t know why it works for some people and not for others. But really for me, it was an emotion that helped me counter the kind of unconscious drive to look the other way.
Rob Wiblin: It’s very tempting to try to explain people’s disagreeing views by saying it’s like arational factors — like they want to feel good about themselves or their work. But I feel that there’s a mirror discourse on the other side, where they’ll say people like you and me have been deluded by science fiction, or we want to believe that our safety work is important. And I find that not credible and very frustrating and not persuasive when people try to attribute my beliefs to irrational. Of course, to some extent we’re all irrational, but when people are like, “You just read too much science fiction and you’re delusional,” I’m like, “No, I’m not. That’s not it.”
So maybe even if I do have these beliefs about other people, I don’t expect it to persuade them very often. And I almost feel like you need to go out of your way to try to engage with the substance of what they’re saying, even if you think that maybe that’s not doing the heavy lifting. Do you have any thoughts on that?
Yoshua Bengio: Yeah, totally. It’s a lot of work, but we need to take one by one each of the arguments that people bring up against acting with precaution. And it’s not very effective, but it is a necessary part of being honest about what we’re doing and honest with ourselves.
So for a while, I was concerned, but I was hoping that somebody would have an answer for me that would reassure me.
Rob Wiblin: And then you looked.
Yoshua Bengio: Then I looked. Then I talked to people who thought it would be fine. And out of that came a lot of conversations that helped me build up the understanding of the arguments. And unfortunately, it didn’t convince me that we were fine, so I continued trying to work, but now more on how do we fix the problem?
So yeah, I agree with you. And I think we also have to have the humility that maybe you and I are wrong. Like, maybe it’s all going to be fine.
Rob Wiblin: There’s a substantial chance that things work out OK.
Yoshua Bengio: Yeah, and I’m totally at ease with that possibility. In fact, I hope that we are wrong. But I think the honest posture should be: if we don’t know who’s right among the people who think it’s going to be fine and the people who think it’s going to be catastrophic, if people will just say, “OK, so there is that uncertainty. What do we do about it?” then the rational thing becomes clear: we need to do at least enough to mitigate the greatest risks.
Really enjoyed this episode, broadly a big fan! Pretty skeptical of the overall alignment strategy, however (though maybe I just don’t get it)
One significant criticism I have is that you mentioned the capabilities tax, but I think the performance hit issue I’m most worried about for truth-seeking oracles vs the current/future paradigm isn’t the performance hit of being worse at capabilities (in the sense of getting true information over the world) without agentic data gathering (though that’s a major worry too), but it being much worse in terms of impact.
Imagine having two superintelligent AI systems in war. One army’s aimbot answers questions like “where should I aim so I can achieve my strategic and military objectives”, the other army’s killbot just has a gun and starts shooting. Sure seems like the second one is structurally advantaged!
This is easiest to see in the military case but there are various analogues like that economically as well. Imagine superhuman business advice bot going against a business that’s staffed top-to-bottom with remote-only superintelligences plus increasing robotics integration.
I wrote about that point here, in the second section.
(The podcast briefly mentioned a scientist agent AI though I’m confused how it can perform similarly on agentic situations with only prediction and not direct RL)
IMO a “Scientist AI” is more promising in a world where we first get a global ban on superintelligent AI, or something else that prevents anyone from building “high-impact AI”. Then AI developers coalesce around “Scientist AI” as a safe approach and develop it carefully.
(I still think a Scientist AI would result in everyone dying, but at least it’s a better starting point)
I think Yoshua’s hope with Scientist AI is that at the weakly superhuman point (and maybe before then) you can ask it questions like “are we doomed if we go with the following alignment strategy” and the Scientist AI is like “88.5%” I think I’d be an optimist if we lived in a world like that (<1.5% misalignment doom conditional upon a fully aligned truth-seeking weakly superhuman AI people use, while more dangerous approaches are tamped down).
I basically agree, although “dangerous approaches are tamped down” is doing most of the work here IMO. By default (i.e. no tamping-down), I expect the situation with a weakly-superhuman Scientist AI to be:
a small number of sane people ask “are we doomed if we go with the following alignment strategy”, and when it says yes, they don’t do it
a lot of people don’t bother to ask at all, they just ask the Scientist AI how to build ASI
a lot of people say “we have to build ASI before the reckless people in group 2”, they build ASI using their best-guess alignment strategy that has an 88.5% chance of failing, and we die with 88.5% probability
(I think Bengio would agree that this is a concern, and would agree that we need global coordination on AI safety to make this work.)
I guess the default for me is that Scientist AI won’t be competitive, so we live in a world with both scientist AI and non-scientist AI. Conditional upon successfully tamping down other approaches enough that Scientist AI gets to the weakly superhuman point while we’re still alive, I’m more optimistic that we can continue to coordinate on doing things safely.