3c. Other research, especially “learning to reason from humans,” looks more promising than HRAD (75%?)
I haven’t thought about this in detail, but you might think that whether the evidence in this section justifies the claim in 3c might depend, in part, on what you think the AI Safety project is trying to achieve.
On first pass, the “learning to reason from humans” project seems like it may be able to quickly and substantially reduce the chance of an AI catastrophe by introducing human guidance as a mechanism for making AI systems more conservative.
However, it doesn’t seem like a project that aims to do either of the following:
(1) Reduce the risk of an AI catastrophe to zero (or near zero)
(2) Produce an AI system that can help create an optimal world
If you think either (1) or (2) are the goals of AI Safety, then you might not be excited about the “learning to reason from humans” project.
You might think that “learning to reason from humans” doesn’t accomplish (1) because a) logic and mathematics seem to be the only methods we have for stating things with extremely high certainty, and b) you probably can’t rule out AI catastrophes with high certainty unless you can “peer inside the machine” so to speak. HRAD might allow you to peer inside the machine and make statements about what the machine will do with extremely high certainty.
You might think that “learning to reason from humans” doesn’t accomplish (2) because it makes the AI human-limited. If we want an advanced AI to help us create the kind of world that humans would want “if we knew more, thought faster, were more the people we wished we were” etc. then the approval of actual humans might, at some point, cease to be helpful.
You might think that “learning to reason from humans” doesn’t accomplish (2) because it makes the AI human-limited. If we want an advanced AI to help us create the kind of world that humans would want “if we knew more, thought faster, were more the people we wished we were” etc. then the approval of actual humans might, at some point, cease to be helpful.
A human can spend an hour on a task, and train an AI to do that task in milliseconds.
Similarly, an aligned AI can spend an hour on a task, and train its successor to do that task in milliseconds.
So you could hope to have a sequence of nice AI’s, each significantly smarter than the last, eventually reaching the limits of technology while still reasoning in a way that humans would endorse if they knew more and thought faster.
(This is the kind of approach I’ve outlined and am working on, and I think that most work along the lines of “learn from human reasoning” will make a similar move.)
FWIW, I don’t think (1) or (2) plays a role in why MIRI researchers work on the research they do, and I don’t think they play a role in why people at MIRI think “learning to reason from humans” isn’t likely to be sufficient. The shape of the “HRAD is more promising than act-based agents” claim is more like what Paul Christiano said here:
As far as I can tell, the MIRI view is that my work is aimed at [a] problem which is not possible, not that it is aimed at a problem which is too easy. [...] One part of this is the disagreement about whether the overall approach I’m taking could possibly work, with my position being “something like 50-50” the MIRI position being “obviously not” [...]
There is a broader disagreement about whether any “easy” approach can work, with my position being “you should try the easy approaches extensively before trying to rally the community behind a crazy hard approach” and the MIRI position apparently being something like “we have basically ruled out the easy approaches, but the argument/evidence is really complicated and subtle.”
With a clarification I made in the same thread:
I think Paul’s characterization is right, except I think Nate wouldn’t say “we’ve ruled out all the prima facie easy approaches,” but rather something like “part of the disagreement here is about which approaches are prima facie ‘easy.’” I think his model says that the proposed alternatives to MIRI’s research directions by and large look more difficult than what MIRI’s trying to do, from a naive traditional CS/Econ standpoint. E.g., I expect the average game theorist would find a utility/objective/reward-centered framework much less weird than a recursive intelligence bootstrapping framework. There are then subtle arguments for why intelligence bootstrapping might turn out to be easy, which Nate and co. are skeptical of, but hashing out the full chain of reasoning for why a daring unconventional approach just might turn out to work anyway requires some complicated extra dialoguing. Part of how this is framed depends on what problem categories get the first-pass “this looks really tricky to pull off” label.
I’m going to try to answer these questions, but there’s some danger that I could be taken as speaking for MIRI or Paul or something, which is not the case :) With that caveat:
I’m glad Rob sketched out his reasoning on why (1) and (2) don’t play a role in MIRI’s thinking. That fits with my understanding of their views.
(1) You might think that “learning to reason from humans” doesn’t accomplish (1) because a) logic and mathematics seem to be the only methods we have for stating things with extremely high certainty, and b) you probably can’t rule out AI catastrophes with high certainty unless you can “peer inside the machine” so to speak. HRAD might allow you to peer inside the machine and make statements about what the machine will do with extremely high certainty.
My current take on this is that whatever we do, we’re going to fall pretty far short of proof-strength “extremely high certainty”—the approaches I’m familiar with, including HRAD, are after some mix of
a basic explanation of why an AI system designed a certain way should be expected to be aligned, corrigible, or some mix or other similar property
theoretical and empirical understanding that makes us think that an actual implementation follows that story robustly / reliably
HRAD makes trade-offs than other approaches do, and it does seem to me like successfully-done HRAD would be more likely to be amenable to formal arguments that cover some parts of our confidence gap, but it doesn’t look to me like “HRAD offers proof-level certainty, other approaches offer qualitatively less”.
(2) Produce an AI system that can help create an optimal world… You might think that “learning to reason from humans” doesn’t accomplish (2) because it makes the AI human-limited. If we want an advanced AI to help us create the kind of world that humans would want “if we knew more, thought faster, were more the people we wished we were” etc. then the approval of actual humans might, at some point, cease to be helpful.
It’s true that I’m more focused on “make sure human values keep steering the future” than on the direct goal of “optimize the world”; I think that making sure human values keep steering the future is the best leverage point for creating an optimal world.
My hope is that for some decisions, actual humans (like us) would approve of “make this decision on the basis of something CEV-like—do things we’d approve of if we knew more, thought faster, etc., where those approvals can be predicted with high confidence, don’t pose super-high risk of lock-in to a suboptimal future, converge among different people, etc.” If you and I think this is a good idea, it seems like an AI system trained on us could think this as well.
Another way of thinking about this is that the world is currently largely steered by human values, AI threatens to introduce another powerful steering force, and we’re just making sure that that power is aligned with us at each timestep. A not-great outcome is that we end up with the world humans would have made if AI were not possible in the first place, but we don’t get toward optimality very quickly; a more optimistic outcome is that the additional steering power accelerates us very significantly along the track to an optimal world, steered by human values along the way.
I haven’t thought about this in detail, but you might think that whether the evidence in this section justifies the claim in 3c might depend, in part, on what you think the AI Safety project is trying to achieve.
On first pass, the “learning to reason from humans” project seems like it may be able to quickly and substantially reduce the chance of an AI catastrophe by introducing human guidance as a mechanism for making AI systems more conservative.
However, it doesn’t seem like a project that aims to do either of the following:
(1) Reduce the risk of an AI catastrophe to zero (or near zero) (2) Produce an AI system that can help create an optimal world
If you think either (1) or (2) are the goals of AI Safety, then you might not be excited about the “learning to reason from humans” project.
You might think that “learning to reason from humans” doesn’t accomplish (1) because a) logic and mathematics seem to be the only methods we have for stating things with extremely high certainty, and b) you probably can’t rule out AI catastrophes with high certainty unless you can “peer inside the machine” so to speak. HRAD might allow you to peer inside the machine and make statements about what the machine will do with extremely high certainty.
You might think that “learning to reason from humans” doesn’t accomplish (2) because it makes the AI human-limited. If we want an advanced AI to help us create the kind of world that humans would want “if we knew more, thought faster, were more the people we wished we were” etc. then the approval of actual humans might, at some point, cease to be helpful.
A human can spend an hour on a task, and train an AI to do that task in milliseconds.
Similarly, an aligned AI can spend an hour on a task, and train its successor to do that task in milliseconds.
So you could hope to have a sequence of nice AI’s, each significantly smarter than the last, eventually reaching the limits of technology while still reasoning in a way that humans would endorse if they knew more and thought faster.
(This is the kind of approach I’ve outlined and am working on, and I think that most work along the lines of “learn from human reasoning” will make a similar move.)
FWIW, I don’t think (1) or (2) plays a role in why MIRI researchers work on the research they do, and I don’t think they play a role in why people at MIRI think “learning to reason from humans” isn’t likely to be sufficient. The shape of the “HRAD is more promising than act-based agents” claim is more like what Paul Christiano said here:
With a clarification I made in the same thread:
Thanks for linking to that conversation—I hadn’t read all of the comments on that post, and I’m glad I got linked back to it.
I’m going to try to answer these questions, but there’s some danger that I could be taken as speaking for MIRI or Paul or something, which is not the case :) With that caveat:
I’m glad Rob sketched out his reasoning on why (1) and (2) don’t play a role in MIRI’s thinking. That fits with my understanding of their views.
My current take on this is that whatever we do, we’re going to fall pretty far short of proof-strength “extremely high certainty”—the approaches I’m familiar with, including HRAD, are after some mix of
a basic explanation of why an AI system designed a certain way should be expected to be aligned, corrigible, or some mix or other similar property
theoretical and empirical understanding that makes us think that an actual implementation follows that story robustly / reliably
HRAD makes trade-offs than other approaches do, and it does seem to me like successfully-done HRAD would be more likely to be amenable to formal arguments that cover some parts of our confidence gap, but it doesn’t look to me like “HRAD offers proof-level certainty, other approaches offer qualitatively less”.
It’s true that I’m more focused on “make sure human values keep steering the future” than on the direct goal of “optimize the world”; I think that making sure human values keep steering the future is the best leverage point for creating an optimal world.
My hope is that for some decisions, actual humans (like us) would approve of “make this decision on the basis of something CEV-like—do things we’d approve of if we knew more, thought faster, etc., where those approvals can be predicted with high confidence, don’t pose super-high risk of lock-in to a suboptimal future, converge among different people, etc.” If you and I think this is a good idea, it seems like an AI system trained on us could think this as well.
Another way of thinking about this is that the world is currently largely steered by human values, AI threatens to introduce another powerful steering force, and we’re just making sure that that power is aligned with us at each timestep. A not-great outcome is that we end up with the world humans would have made if AI were not possible in the first place, but we don’t get toward optimality very quickly; a more optimistic outcome is that the additional steering power accelerates us very significantly along the track to an optimal world, steered by human values along the way.