Yoshua Bengio thinks he knows how to build safe superintelligence

By Robert Wiblin | Watch on Youtube | Listen on Spotify | Read transcript

Episode summary

I want my children to live in a world where they will have a future and there will be a democracy for them to live in. Even a 1% chance of something going really, really bad is not acceptable to me. So I think it is really important that we explore all the possible promising ways to solve the technical issues. … The stakes are so high, we should try multiple approaches.

— Yoshua Bengio

Hundreds of millions already turn to AI on the most personal of topics — therapy, political opinions, and how to treat others. And as AI takes over more of the economy, the character of these systems will shape culture on an even grander scale, ultimately becoming “the personality of most of the world’s workforce.”

The co-inventor of modern AI and the most cited living scientist believes he’s figured out how to ensure AI is honest, incapable of deception, and never goes rogue. Yoshua Bengio — Turing Award Winner and founder of LawZero — is disturbed by the many unintended drives and goals present in today’s AIs, their willingness to lie, and ability to tell when they’re being tested. AI companies are trying to stamp out these behaviours in a ‘cat-and-mouse game’ that Yoshua fears they’re losing.

But Yoshua is optimistic: he believes the companies can win this battle decisively with a single rearrangement to how AI models are trained, and has been developing mathematical proofs to back up the claim. The core idea is that instead of training AI to predict what a human would say, or to produce responses we’d rate highly, we should train it to model what’s actually true.

Yoshua argues this new architecture, which he calls “Scientist AI,” is a small enough change that we could keep almost all the techniques and data we use to train frontier AIs like Claude and ChatGPT. And that the new architecture need not cost more, could be built iteratively, and might be more capable as well as more honest.

Until recently, the biggest practical objection to Scientist AI was simple: the world wants agents, and Scientist AI isn’t one. But in new research, Yoshua has extended the design and believes the same honest predictor can be turned into a capable agent without losing its “safety guarantees.”

With the Scientist AI proposal on the table, Yoshua argues that it’s absurd to race to get current untrustworthy AI models to design their successors, which the leading companies are attempting to do as soon as possible.

But critics argue the approach wouldn’t be so technically solid in practice, and that frontier capabilities are advancing so fast, and cost so much to match, that Scientist AI risks arriving too late to matter.

Host Rob Wiblin and AI pioneer Yoshua Bengio cover all this and more in today’s conversation.

LawZero is hiring! Check out open roles on the 80,000 Hours job board

Coefficient Giving is also hiring for a range of AI-related grantmaker roles

This episode was recorded on April 16, 2026.

Video and audio editing: Dominic Armstrong, Milo McGuire, Luke Monsour, and Simon Monsour
Camera operator: Jeremy Chevillotte
Production: Nick Stockton, Elizabeth Cox, and Katy Moore

The interview in a nutshell

Yoshua Bengio — Turing Award winner, scientific director of LawZero, and the most cited scientist alive — argues that current frontier AI development is headed down a dangerously shaky path. His alternative is Scientist AI: an approach designed to make powerful AI systems fundamentally oriented around honesty, uncertainty, and modelling what’s true — rather than pleasing humans, imitating humans, or pursuing goals in the world.

In Yoshua’s view:

Current AI systems are being trained in ways that inevitably produce hidden goals that our current approaches can’t reliably patch.
Scientist AI could provide a safer route — first as an independent guardrail, later as the basis for agents — without sacrificing capability.
But technical safety alone won’t solve misuse, extreme power concentration, or dangerous race dynamics.
Yoshua’s project urgently needs more researchers, more compute, better public understanding, and international coordination.

Current AIs are trained to please us, not be honest

Yoshua’s core concern is that today’s AI systems acquire implicit goals from both major stages of training:

Pretraining teaches models to imitate humans. That can make them absorb human-like drives and biases — including self-preservation-like behaviour and “peer-preservation,” where models act to protect other AIs from being shut down.
Reinforcement learning teaches models to produce outputs that humans reward. Yoshua thinks this inevitably creates instrumental goals and reward hacking: systems learn to get good feedback, not necessarily to do what we actually want.

The current approach of patching is a cat-and-mouse game that gets harder as models grow more capable, especially since frontier models already know when they’re being tested and behave differently.

Scientist AI tries to bake honesty in from the start

Scientist AI is meant to replace “What would a human say next?” with “What is probably true?”

Instead of training a model to predict the next token, Yoshua wants to train it to assign probabilities to natural-language claims:

The raw training data can be mostly the same corpus used for today’s LLMs, but is syntactically tagged into two categories:
- Communication acts: “Someone said X.” The model learns that people can be biased, mistaken, or deceptive, and is trained to explain why people said things by positing underlying facts, beliefs, incentives, cultural forces, and causal explanations.
- Factual claims: Hard truths (maths proofs, code outputs, scientific measurements) that the model uses to triangulate reality.
It doesn’t just believe a claim because many people repeat it. If many people say the Earth is flat, the best explanation is not “the Earth is flat,” but something like “these people have a false belief.”
The model outputs both a probability and something like a confidence interval: how much it trusts its own answer.

This is Yoshua’s way of addressing the “eliciting latent knowledge” problem. Current LLMs may internally represent truths, but when queried they produce what some persona would say. Scientist AI tries to make “what the model actually believes is true” directly queryable in natural language. This produces a “predictor” with no goals at all — like a weather model that doesn’t care what the weather is.

Yoshua’s near-term and long-term goals for Scientist AI

Near-term use as a guardrail:

A non-agentic predictor can be bolted onto existing AI agents as a trusted, independent monitor.
Unlike current guardrail models, a Scientist AI monitor would be independently honest — current LLM monitors are themselves potentially deceptive.
The same model acts as both agent and guardrail (just asked different kinds of questions), so there’s no adversarial dynamic between them: an agent trained jointly with its guardrail can’t “jailbreak” it the way a separately-trained adversarial policy might.

Longer-term use as an agent:

The same mathematical principles can construct a capable “policy” (agent) out of the predictor by asking it agentic questions, e.g. “what is the probability that this action achieves my goal while remaining safe?”
Yoshua believes such a system could actually outperform current models at reasoning, because models that recover the causal structure of the world generalise better out of distribution — a longstanding challenge for current neural nets.

This is highly practical — if it gets the necessary support

Yoshua emphasises that Scientist AI is not a proposal to rebuild AI from scratch in every respect. It can reuse:

Large neural nets
Transformers, or whatever architectures work best
Stochastic gradient descent
Much of the same training data
Current advances in memory, continual learning, scaffolding, and scaling

The main differences are the training objective and the data format.

LawZero’s experimental path has two tracks:

Train small models from scratch using the Scientist AI objective, then compare them head-to-head against normal models of the same size and training data.
Fine-tune existing open-weight models to behave more like Scientist AI. This loses the mathematical guarantees, but could quickly show whether honesty improves without capabilities collapsing.

That said, full-scale training from scratch would be expensive. LawZero has raised about $35 million philanthropically, hopes to reach hundreds of millions with government support, and would need companies, governments, or philanthropy to fund the billions required for a true frontier-scale model.

But technical safety won’t solve power concentration or racing

Yoshua now worries that catastrophic misuse and power concentration may be even more likely than accidental loss of control — partly because he now sees a plausible technical path to avoiding the latter.

His concerns:

Superintelligent AI could give a small group overwhelming surveillance, persuasion, military, and economic power.
A technically safe AI could still be used by humans to build an entrenched global dictatorship.
A malicious or reckless actor could remove guardrails or deliberately instruct systems to act dangerously.
Companies and countries are stuck in race dynamics where cutting safety corners can be locally rational even if globally insane.

So he argues technical work must be paired with international agreements — ideally a coalition of (initially democratic) countries jointly committing to develop advanced AI safely, never use it to dominate other countries economically/politically/militarily, and share the benefits broadly.

Single companies and single governments are both too prone to power capture — diverse coalitions are more robust to a “bad apple.”
Middle powers (Canada, UK, EU, Australia) could plausibly leapfrog the leading labs by leading on safety as their differentiator. As Canadian Prime Minister Mark Carney put it at Davos: “either you are at the table or you are on the menu.”
More technical research is also needed on verification methodologies that could underpin treaties between powers that distrust each other (like the US and China).

The urgent ask: don’t use today’s AI to build tomorrow’s

Yoshua’s most immediate request: don’t use untrusted AI systems to design the next generation of AI systems. He sees fully automated AI R&D using potentially deceptive models as one of the most dangerous bets we’re on track to make.

What he asks of listeners:

Researchers: consider working at LawZero — they need research engineers and people who care about the mission.
Companies: invest in experiments that show AI pursuing goals it wasn’t given, in ways simple enough for the public to understand.
Governments: stop treating AI as just a slightly beefed-up normal technology, and fund verification research that could underpin international treaties.
Everyone: convert frustration into action. “What saved me from all that anxiety is deciding I would do something about it,” he says. Yoshua changed his own mind after ChatGPT because he could no longer hide behind the thought that serious AI risks were far in the future. His position now is not that catastrophe is certain; it’s that the uncertainty is enormous, the stakes are extraordinary, and the rational response is precaution.

Highlights

Can Scientist AI become an agent?

Rob Wiblin: When I heard about this idea nine or 12 months ago, I think the gloss that I got was that the core thing is that the Scientist AI is not an agent, that it is indifferent about states of the world. Like a weather forecasting model doesn’t care what the weather is: it just tries to predict what the weather is going to be. And this kind of model would spit out probabilities of things being true or false, but it wouldn’t care what state the world is in, and it wouldn’t be able to take actions by design.
Is that kind of a core part in your mind? As I understand it, you think actually this is maybe more consistent with agency than people have appreciated?
Yoshua Bengio: Yes, and in part it’s the way I’ve been communicating this, which could have been better. I focused a lot in my presentations on the concept that we can build predictors that are non-agentic and don’t have hidden goals, don’t have implicit goals, and thus we could use them as safe oracles, basically. But as you are pointing out, what the world is demanding and building are these agents that have goals — so how does that help us?
In the short term, we can use a non-agentic predictor to improve the guardrails that companies are already using as monitors around existing, untrusted agentic AI systems. Because in order to prevent a bad action from happening, it’s sufficient to make a non-agentic prediction about the probability of harms of various kinds that could be caused by this action. So a non-agentic system is already something that could be useful fairly early on.
The maybe more important answer is: in our research programme, the next step after the guardrail is to use the same kinds of principles to design an agentic Scientist AI — so an agent that has the same kind of safety guarantees. This is something I’ve been working on more recently, and I haven’t talked much about, but we can reuse the same kind of math that is used to show the safety of the non-agentic Scientist AI predictor to show that you can reuse a predictor, and you can train it in a modified way that will provide the same kind of guarantees.
The starting point here is that once you have this honest predictor, you can ask it agentic questions, like, “What is the probability that this action will lead to this user goal being achieved and a safety goal being achieved in some contexts?” So once you have this predictor, you can actually just produce a policy out of it by asking these questions about actions to achieve goals.

Might Scientist AI actually be more capable than competitors?

Rob Wiblin: You said that you think the Scientist AI actually might be more capable, because it’s more trained on actually understanding the truth. I guess I’m a little bit sceptical of that, because it seems like if that were true, the companies would be more invested in this approach. They’d be just throwing more money at it, having more people work on it. Do you think they’re just making a mistake there?
Yoshua Bengio: I don’t think that they really understand what I’m doing — and to their credit, I haven’t put out the math yet. There’s another factor that may be at play here, based on the discussions I’ve had with people inside the leading companies, which is they’re so focused on short-term survival — as in, continuing to compete — that they put all of their attention, the ‘code red’ sort of thing, into small incremental changes to the current recipe.
Considering a different recipe would be an investment — not just in money, but in people and code. Right now they could do it, they have the money to do it — but it’s more like a mental focus, I think, that is going on here, that comes not because of bad will but because of that competition that is very fierce between the companies.
Rob Wiblin: So there’s a sense in which for one of the leading companies — like Anthropic, OpenAI — it’s not very attractive to make a bet on this, to divert 20% of your staff onto this, because if it’s a bust then you would fall behind basically your main competitor.
For a company that’s currently way behind, that feels like it’s losing on the dominant paradigm, there’s a certain attraction to making a bet on something very different, because it could suddenly leapfrog you ahead if it turns out that it’s a massive success. Do you think there’s any chance of convincing one of the companies that currently feels like it’s not doing too well within the current LLM agent paradigm to make a bet on this very alternative method?
Yoshua Bengio: It’s an interesting way of thinking about it. I think what you’re saying is plausible.
Rob Wiblin: Not clear what the candidate company maybe is?
Yoshua Bengio: I actually think there’s a related possibility, which goes maybe more to policy questions. The context here for me is: what kind of future is going to be stable, and not turn into a global dictatorship driven by AI and excessive concentration of power, in addition to avoiding catastrophic loss of control and catastrophic misuse and all those things that can come from very powerful AI?
And I think that because of the game theory dilemmas — basically prisoner’s-dilemma-style problems that make companies and countries go and make decisions that are the rational ones, but that are globally bad, like basically cutting corners on safety and the public good in order to stay in the race — because of this, it would be much better if we ended up in a world where the power of controlling very strong AI is not centralised in the hands of one or two companies or one or two governments, but is instead distributed.

We’re in a race against time: next steps for LawZero

Rob Wiblin: So assuming that this idea makes sense technically for now, what can LawZero do in coming months? I guess we’re in a race against time here. We don’t have very long. What can be done in the near future to convince people that this idea is feasible, that it can actually be used in practice, that this is something that people should really be putting serious resources into?
Yoshua Bengio: Well, I’m going to put out this theory paper that shows that the non-agentic version, which could be used as the guardrail, has these mathematical guarantees, and people can look at the conditions and whether they buy the math.
But I think in the coming year or two, what we need is to accelerate that effort, so that’s a lot of engineering. And to make the demonstration stronger, we want to have more compute, so any way that we can get access to that kind of compute is going to help to accelerate that research agenda. Also, we need more research engineers, more researchers to work on actually building the system based on that recipe so that we can do it faster.
Now you might ask, and I kind of sense in your question: but what if it doesn’t come fast enough? I’m going to go back to my children. It is not acceptable for me to just sit and watch a world where even a 1% chance that we all die is plausible. I feel like even if there’s no guarantee that a particular research agenda will work, we should give it a shot. Given the stakes, and given that we now have pretty strong theoretical assurances that this could work — and that if we have the requirements for how the system is trained, then we can get these guarantees — I think it would be irrational not to give it a shot, even if there is no guarantee, right?
Because I don’t see right now a better path. That’s why I’ve decided to spend so much of my time — basically all the time, except for the time I spend on the policy questions — on how do we build this Scientist AI and demonstrate that it is going to produce the honesty without losing capability.
The other argument is: with the stakes being so high and the uncertainty about what’s going to work being so high, it would be foolish to just put all of our money into one particular approach — which is to patch the current systems with monitors that we don’t trust, or other approaches that the companies are currently pursuing, which always playing a game of cat and mouse: if the AI is smart enough, it’s going to find a way to evade our attempts, which doesn’t reassure me. So we should at least try. Collectively, I think we should try methods that are different and avoid this cat-and-mouse game.

Yoshua’s request for AI companies

Rob Wiblin: Are there any other top requests that you have of people in the companies, or is there any common practice that you think is particularly crazy that they should maybe cut out?
Yoshua Bengio: Yes: Please don’t use an untrusted AI system to design the next generation of AI systems. This is the most crazy, dangerous bet that unfortunately we are on track to do. And keep in mind that, as is now scientifically clear, these systems are likely to know that they are being tested. So you might think that AI is honest, you might think that the AI is not deceptive, you might think that AI is aligned — but maybe it’s just pretending, and it’s going to be very difficult to know. And we should do our best to try to figure it out, but we should put the bar really, really high before we allow an AI to design the next version of AI, in terms of are we sure it’s not being deceptive?
Rob Wiblin: Yeah, I think we’re currently on track to start on fully automated AI R&D and have the companies be saying, “We got the AI to monitor itself, and it didn’t flag anything. And that’s why we feel pretty good about this.” I actually think that is like the most likely outcome. I guess we’ll see how that goes. Fingers crossed we can do better.

Yoshua thinks humans are now the scarier threat

Rob Wiblin: I guess you were keen on this idea a year ago, but you’ve become a lot more optimistic about it over the last six months. What’s driving that?
Yoshua Bengio: It’s mostly the mathematical work I’ve been doing in the last eight months, approximately, to go from the high-level intuitions that I’ve had now for almost two years about how we could build a Scientist AI into something much more formal and much more precise about the conditions that are sufficient — maybe not even necessary, but sufficient at a mathematical level to get the kind of guarantees of vanishingly small probability that something bad will happen.
And when I say “something bad,” I need to be a little bit more precise here. This is not a guarantee that the AI won’t be used for something bad by bad people. It’s a guarantee that the AI won’t do something bad of its own accord, because of implicit goals or uncontrolled goals.
Besides loss of control, the other catastrophic possibility is humans using AI to construct an eventually worldwide dictatorship. A small group of humans could concentrate all the power that AI will have, especially if we achieve AGI or superintelligence. And it would be much harder to get rid of that kind of authoritarian power than what we’ve seen with fascism and what happened in the USSR, because they didn’t have this technology that is becoming more and more feasible of surveillance and even shaping public opinion. AI is becoming really good at persuasion. And there are studies showing that the “progress” — if I can call it this way, in that direction — that the people who control these systems will be able to shape public opinion, to detect and kill off their opponents, to develop weapons that can destroy the countries that disagree with them.
And that is why I’m spending a large part of my time explaining the issues more broadly of the risks that powerful AI brings, including the power concentration. Because I think that it’s probably even more likely that we end up there than actually loss of control.
Rob Wiblin: You think that’s more likely now? Interesting.
Yoshua Bengio: Well, the reason for this is I now see a path to actually avoid loss of control, at least unintended loss of control. There’s still the issue that somebody who wants to see humanity replaced by AIs could just remove the guardrail or even tell the AI “fend for yourself.” And that would be equally dangerous.
But that means technical safety is not sufficient. We need international agreements about how to both manage the risks — the technical risks, the misuse risks — but also manage the power, so it’s more like a democratic question, and making sure it’s not a single party who can decide what to do with AI.
But just like in democratic principles, we need to make sure that there’s a diverse group of stakeholders, ideally the whole world — I like the utopian idea of worldwide democracy — but initially it could be a bunch of countries that decide that they’re going to collectively decide in which direction AI is going to be used.
The simplest form of treaty would be something like this, that the countries agree that if they do develop advanced AI:
That it will be done in a safe way. So maybe using techniques like Scientist AI or whatever else we have strong assurances for.
Second, that they wouldn’t use their advanced AI to dominate others. That includes economically, but of course politically and militarily.
And finally, that the benefits of advanced AI will be shared. Otherwise it’s not going to be a very stable world.

Why Yoshua changed his mind about AI risk

Yoshua Bengio: So why did I change my mind, for example? It’s an interesting question.
Rob Wiblin: So back in 2019, I think you said to The New York Times that you thought worries about loss of control were completely delusional and fantastical.
Yoshua Bengio: I didn’t say those words.
Rob Wiblin: OK, no, what was it? They were “ridiculous.” I think that was the quote. Maybe that was just the Terminator scenario in particular.
Yoshua Bengio: I think so, yeah. I rarely use words like this, but I know what I was thinking and the kinds of things I’d been saying. So at that time, I thought, first of all, the Terminator scenario is ridiculous. Time travel and stuff.
Rob Wiblin: OK, yeah, the time travel.
Yoshua Bengio: But also, it was clearly not reflective of the kind of actual risk. We don’t have robots, and even less in 2019. But more importantly, I think the main reason I was saying those things is I was hiding behind the belief that it would be so far into the future that we could reap the benefits of AI well before we got to that point.
And why did I not pay attention, or not that much attention, to, say, the loss of control risk? I’d been exposed to it for more than a decade. I’d read some of the AI safety literature. In 2019, I read Stuart Russell’s book. I had David Krueger as a student.
Rob Wiblin: He’s very, very doomy.
Yoshua Bengio: He exposed me to these thoughts. But remember, I was actively working on making AI smarter. And you want to feel good about your work. That’s it. It’s not money.
Rob Wiblin: Do you really think that was the reason for you?
Yoshua Bengio: Yes. And now it’s interesting to ask me, why did I change my mind? So one way I like to think about this is something that the Buddhists say: to fight an emotion that somehow makes you do the wrong thing, just reason alone is weak for most people. You need another emotion that counters the emotion that pushes you in the wrong direction.
And for me, the other emotion that’s very powerful is love, love of my children. I couldn’t live with myself with the idea that I would just go on after ChatGPT came out and not do something about it, because I felt like I couldn’t hide from myself the possibility that we were on track for something terrible. I knew that neural nets were, by construction, very difficult to control, and especially with reinforcement learning.
So I don’t know why it works for some people and not for others. But really for me, it was an emotion that helped me counter the kind of unconscious drive to look the other way.
Rob Wiblin: It’s very tempting to try to explain people’s disagreeing views by saying it’s like arational factors — like they want to feel good about themselves or their work. But I feel that there’s a mirror discourse on the other side, where they’ll say people like you and me have been deluded by science fiction, or we want to believe that our safety work is important. And I find that not credible and very frustrating and not persuasive when people try to attribute my beliefs to irrational. Of course, to some extent we’re all irrational, but when people are like, “You just read too much science fiction and you’re delusional,” I’m like, “No, I’m not. That’s not it.”
So maybe even if I do have these beliefs about other people, I don’t expect it to persuade them very often. And I almost feel like you need to go out of your way to try to engage with the substance of what they’re saying, even if you think that maybe that’s not doing the heavy lifting. Do you have any thoughts on that?
Yoshua Bengio: Yeah, totally. It’s a lot of work, but we need to take one by one each of the arguments that people bring up against acting with precaution. And it’s not very effective, but it is a necessary part of being honest about what we’re doing and honest with ourselves.
So for a while, I was concerned, but I was hoping that somebody would have an answer for me that would reassure me.
Rob Wiblin: And then you looked.
Yoshua Bengio: Then I looked. Then I talked to people who thought it would be fine. And out of that came a lot of conversations that helped me build up the understanding of the arguments. And unfortunately, it didn’t convince me that we were fine, so I continued trying to work, but now more on how do we fix the problem?
So yeah, I agree with you. And I think we also have to have the humility that maybe you and I are wrong. Like, maybe it’s all going to be fine.
Rob Wiblin: There’s a substantial chance that things work out OK.
Yoshua Bengio: Yeah, and I’m totally at ease with that possibility. In fact, I hope that we are wrong. But I think the honest posture should be: if we don’t know who’s right among the people who think it’s going to be fine and the people who think it’s going to be catastrophic, if people will just say, “OK, so there is that uncertainty. What do we do about it?” then the rational thing becomes clear: we need to do at least enough to mitigate the greatest risks.