My first question [meant to open a friendly conversation even though it is phrased in a direct way] is “why do you think this won’t kill us all?”
Specifically, sounds like you’re doing a really good job creating an AI that is capable of planning through complicated vague problems. That’s exactly what we’re afraid of, no?
Our mission is to automate and scale open-ended reasoning
I’d say what we’re afraid of is that we’ll have AI systems that are capable of sophisticated planning but that we don’t know how to channel those capabilities into aligned thinking on vague complicated problems. Ought’s work is about avoiding this outcome.
I just want to highlight the key assumption Ajeya’s argument rests on: The system is end-to-end optimized on a feedback signal (generally from human evaluations), i.e. all its compute is optimizing a signal that has now way to separate “fake it while in training” from “have the right intent” and so can lead to catastrophic outcomes when the system is deployed.
How does Ought’s work help avoid that outcome?
We’re breaking down complex reasoning into processes with parts that are not jointly end-to-end optimized. This makes it possible to use smaller models for individual parts, makes the computation more transparent, and makes it easier to verify that the parts are indeed implementing the function that we (or future models) think they’re implementing.
You can think of it as interpretability-by-construction: Instead of training a model end-to-end and then trying to see what circuits it learned and whether they’re implementing the right thing, take smaller models that you know are implementing the right thing and compose them (with AI help) into larger systems that are correct not primarily based on empirical performance but based on a priori reasoning.
This is complementary to traditional bottom-up interpretability work: The more decomposition can limit the amount of black-box compute and uninterpretable intermediate state, the less weight rests on circuits-style interpretability and ELK-style proposals.
We don’t think we’ll be able to fully avoid end-to-end training (it’s ML’s magic juice, after all), but we think that reducing it is helpful even on the margin. From our post on supervising process, which has a lot more detail on the points in this comment: “Inner alignment failures are most likely in cases where models don’t just know a few facts we don’t but can hide extensive knowledge from us, akin to developing new branches of science that we can’t follow. With limited compute and limited neural memory, the risk is lower.”
And my second worry is that the “big AI” (the collection of sub models) will be so good that you could ask it to perform a task and it will be exceedingly effective at it, in some misaligned-to-our-values (misaligned-to-what-we-actually-meant) way
For AGI there isn’t much of a distinction between giving advice and taking actions, so this isn’t part of our argument for safety in the long run. But in the time between here and AGI it’s better to focus on supporting reasoning to help us figure out how to manage this precarious situation.
Do I understand correctly: “safety in the long run” is unrelated to what you’re currently doing in any negative way—you don’t think you’re advancing AGI-relevant capabilities (and so there is no need to try to align-or-whatever your forever-well-below-AGI system), do I understand correctly?
The things that make submodels easier to align that we’re aiming for:
(Inner alignment) Smaller models, making it less likely that there’s scheming happening that we’re not aware of; making the bottom-up interpretability problem easier
(Outer alignment) More well-specified tasks, making it easier to generate a lot of in-distribution feedback data; making it easier to do targetted red-teaming
Would you share with me some typical example tasks that you’d give a submodel and typical good responses it might give back? (as a vision, so I’ll know what you’re talking about when you’re saying things like “well specified tasks”—I’m not sure if we’re imagining the same thing there. It doesn’t need to be something that already works today)
In a research assistant setting, you could imagine the top-level task being something like “Was this a double-blind study?”, which we might factor out as:
Were the participants blinded?
Was there a placebo?
Which paragraphs relate to placebos?
Does this paragraph state there was a placebo?
…
Did the participants know if they were in the placebo group?
…
Were the researchers blinded?
…
In this example, by the time we get to the “Does this paragraph state there was a placebo?” level, a submodel is given a fairly tractable question-answering task over a given paragraph. A typical response for this example might be a confidence level and text spans pointing to the most relevant phrases.
The goal for Elicit is for it to be a research assistant, leading to more and higher quality research. Literature review is only one small part of that: we would like to add functionality like brainstorming research directions, finding critiques, identifying potential collaborators, …
Beyond that, we believe that factored cognition could scale to lots of knowledge work. Anywhere the tasks are fuzzy, open-ended, or have long feedback loops, we think Elicit (or our next product) could be a fit. Journalism, think-tanks, policy work.
It is, very much. Answering so-called strength of evidence questions accounts for big chunks of researchers’ time today.
This research prioritizes reasoning over military robots
If you answer (from here) is:
The goal of our work is to channel this growth toward good reasoning. We want AI to be more helpful for qualitative research, long-term forecasting, planning, and decision-making than for persuasion, keeping people engaged, and military robotics.
Then:
Open ended reasoning could be used for persuasion, military robotics (and creating paperclips) too, no?
We’re aiming to shift the balance towards supporting high-quality reasoning. Every tool has some non-zero usefulness for non-central use cases, but seems unlikely that it will be as useful as tools that were made for those use cases.
Every tool has some non-zero usefulness for non-central use cases, but seems unlikely that it will be as useful as tools that were made for those use cases.
I agree!
supporting high-quality reasoning
This sounds to me like almost the most generic-problem-solving thing someone could aim for, capable of doing many things without going outside the general use case.
As a naive example, couldn’t someone use “high quality reasoning” to plan how to make military robotics? (though the examples I’m actually worried about are more like “use high quality reasoning to create paperclips”, but I’m happy to use your one)
In other words, I’m not really worried about a chess robot being used for other things [update: wait, Alpha Zero seems to be more general purpose than expected], but I wouldn’t feel as safe with something intentionally meant for “high quality reasoning”
[again, just sharing my concern, feel free to point out all the ways I’m totally missing it!]
I agree that misuse is a concern. Unlike alignment, I think it’s relatively tractable because it’s more similar to problems people are encountering in the world right now.
To address it, we can monitor and restrict usage as needed. The same tools that Elicit provides for reasoning can also be used to reason about whether a use case constitutes misuse.
This isn’t to say that we might not need to invest a lot of resources eventually, and it’s interestingly related to alignment (“misuse” is relative to some values), but it feels a bit less open-ended.
Yay, I was really looking forward to this! <3
My first question [meant to open a friendly conversation even though it is phrased in a direct way] is “why do you think this won’t kill us all?”
Specifically, sounds like you’re doing a really good job creating an AI that is capable of planning through complicated vague problems. That’s exactly what we’re afraid of, no?
ref
My next questions would depend on your answer here, but I’ll guess a few follow ups in sub-comments
Epistemic Status: I have no idea what I’m talking about, just trying to form initial opinions
I’d say what we’re afraid of is that we’ll have AI systems that are capable of sophisticated planning but that we don’t know how to channel those capabilities into aligned thinking on vague complicated problems. Ought’s work is about avoiding this outcome.
At this point we could chat about why it’s plausible that we’ll have such capable but unaligned AI systems, or about how Ought’s work is aimed at reducing the risk of such systems. The former isn’t specific to Ought, so I’ll point to Ajeya’s post Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover.
I just want to highlight the key assumption Ajeya’s argument rests on: The system is end-to-end optimized on a feedback signal (generally from human evaluations), i.e. all its compute is optimizing a signal that has now way to separate “fake it while in training” from “have the right intent” and so can lead to catastrophic outcomes when the system is deployed.
How does Ought’s work help avoid that outcome?
We’re breaking down complex reasoning into processes with parts that are not jointly end-to-end optimized. This makes it possible to use smaller models for individual parts, makes the computation more transparent, and makes it easier to verify that the parts are indeed implementing the function that we (or future models) think they’re implementing.
You can think of it as interpretability-by-construction: Instead of training a model end-to-end and then trying to see what circuits it learned and whether they’re implementing the right thing, take smaller models that you know are implementing the right thing and compose them (with AI help) into larger systems that are correct not primarily based on empirical performance but based on a priori reasoning.
This is complementary to traditional bottom-up interpretability work: The more decomposition can limit the amount of black-box compute and uninterpretable intermediate state, the less weight rests on circuits-style interpretability and ELK-style proposals.
We don’t think we’ll be able to fully avoid end-to-end training (it’s ML’s magic juice, after all), but we think that reducing it is helpful even on the margin. From our post on supervising process, which has a lot more detail on the points in this comment: “Inner alignment failures are most likely in cases where models don’t just know a few facts we don’t but can hide extensive knowledge from us, akin to developing new branches of science that we can’t follow. With limited compute and limited neural memory, the risk is lower.”
I simply agree, no need to convince me there 👍
Ought’s approach:
Instead of giving a training signal after the entire AI gives an output,
Do give a signal after each sub-module gives an output.
Yes?
My worry: The sub-modules will themselves be misaligned.
Is your suggestion: Limit compute and neural memory of sub-models in order to lower the risk
?
And my second worry is that the “big AI” (the collection of sub models) will be so good that you could ask it to perform a task and it will be exceedingly effective at it, in some misaligned-to-our-values (misaligned-to-what-we-actually-meant) way
The product you are building only gives advice, it doesn’t take actions
If this would be enough, couldn’t we make a normal AGI and ask only ask it for advice without giving it the capability to take actions?
For AGI there isn’t much of a distinction between giving advice and taking actions, so this isn’t part of our argument for safety in the long run. But in the time between here and AGI it’s better to focus on supporting reasoning to help us figure out how to manage this precarious situation.
Do I understand correctly: “safety in the long run” is unrelated to what you’re currently doing in any negative way—you don’t think you’re advancing AGI-relevant capabilities (and so there is no need to try to align-or-whatever your forever-well-below-AGI system), do I understand correctly?
Please feel free to correct me!
No, it’s that our case for alignment doesn’t rest on “the system is only giving advice” as a step. I sketched the actual case in this comment.
Sub-models will be aligned / won’t be dangerous:
1
If your answer is “sub models only run for 15 minutes” [need to find where I read this] :
If that would help aligning [black box] sub-models, then couldn’t we use it to align an entire [black box] AI?
Seems to me like the sub models might still be unaligned.
2
If your answer is “[black box] sub models will get human feedback or supervision”—would that be enough to align a [black box] AI?
The things that make submodels easier to align that we’re aiming for:
(Inner alignment) Smaller models, making it less likely that there’s scheming happening that we’re not aware of; making the bottom-up interpretability problem easier
(Outer alignment) More well-specified tasks, making it easier to generate a lot of in-distribution feedback data; making it easier to do targetted red-teaming
Would you share with me some typical example tasks that you’d give a submodel and typical good responses it might give back? (as a vision, so I’ll know what you’re talking about when you’re saying things like “well specified tasks”—I’m not sure if we’re imagining the same thing there. It doesn’t need to be something that already works today)
In a research assistant setting, you could imagine the top-level task being something like “Was this a double-blind study?”, which we might factor out as:
Were the participants blinded?
Was there a placebo?
Which paragraphs relate to placebos?
Does this paragraph state there was a placebo?
…
Did the participants know if they were in the placebo group?
…
Were the researchers blinded?
…
In this example, by the time we get to the “Does this paragraph state there was a placebo?” level, a submodel is given a fairly tractable question-answering task over a given paragraph. A typical response for this example might be a confidence level and text spans pointing to the most relevant phrases.
Thank you, this was super informative! My understanding of Ought just improved a lot
Once you’re able to answer questions like that, what do you build next?
Is “Was this a double-blind study?” an actual question that your users/customers are very interested in?
If not, could you give me some other example that is?
You’re welcome!
The goal for Elicit is for it to be a research assistant, leading to more and higher quality research. Literature review is only one small part of that: we would like to add functionality like brainstorming research directions, finding critiques, identifying potential collaborators, …
Beyond that, we believe that factored cognition could scale to lots of knowledge work. Anywhere the tasks are fuzzy, open-ended, or have long feedback loops, we think Elicit (or our next product) could be a fit. Journalism, think-tanks, policy work.
It is, very much. Answering so-called strength of evidence questions accounts for big chunks of researchers’ time today.
This research prioritizes reasoning over military robots
If you answer (from here) is:
Then:
Open ended reasoning could be used for persuasion, military robotics (and creating paperclips) too, no?
We’re aiming to shift the balance towards supporting high-quality reasoning. Every tool has some non-zero usefulness for non-central use cases, but seems unlikely that it will be as useful as tools that were made for those use cases.
I agree!
This sounds to me like almost the most generic-problem-solving thing someone could aim for, capable of doing many things without going outside the general use case.
As a naive example, couldn’t someone use “high quality reasoning” to plan how to make military robotics? (though the examples I’m actually worried about are more like “use high quality reasoning to create paperclips”, but I’m happy to use your one)
In other words, I’m not really worried about a chess robot being used for other things [update: wait, Alpha Zero seems to be more general purpose than expected], but I wouldn’t feel as safe with something intentionally meant for “high quality reasoning”
[again, just sharing my concern, feel free to point out all the ways I’m totally missing it!]
I agree that misuse is a concern. Unlike alignment, I think it’s relatively tractable because it’s more similar to problems people are encountering in the world right now.
To address it, we can monitor and restrict usage as needed. The same tools that Elicit provides for reasoning can also be used to reason about whether a use case constitutes misuse.
This isn’t to say that we might not need to invest a lot of resources eventually, and it’s interestingly related to alignment (“misuse” is relative to some values), but it feels a bit less open-ended.
[debugging further]
Do you think misuse is a concern—to the point that if you couldn’t monitor and restrict usage—you’d think twice about this product direction?
Or is this more “this is a small issue, and we can even monitor and restrict usage, but even if we couldn’t then we wouldn’t really mind”?
What are your views on whether speeding up technological development is, in general, a good thing?
I’m thinking of arguments like https://forum.effectivealtruism.org/posts/gB2ad4jYANYirYyzh/a-note-about-differential-technological-development, that make me wonder if we should try to slow research instead of speeding it up.
Or do you think that Elicit will not speed up AGI capabilities research in a meaningful way? (Maybe because it will count as misuse)
It’s something I’m really uncertain about personally, that’s going to heavily influence my decisions/life, so I’m really curious about your thoughts!