Daniel_Dewey comments on My current thoughts on MIRI’s “highly reliable agent design” work

Daniel_Dewey 8 Jul 2017 5:26 UTC
4 points
0 ∶ 0
I’m going to try to answer these questions, but there’s some danger that I could be taken as speaking for MIRI or Paul or something, which is not the case :) With that caveat:

I’m glad Rob sketched out his reasoning on why (1) and (2) don’t play a role in MIRI’s thinking. That fits with my understanding of their views.

(1) You might think that “learning to reason from humans” doesn’t accomplish (1) because a) logic and mathematics seem to be the only methods we have for stating things with extremely high certainty, and b) you probably can’t rule out AI catastrophes with high certainty unless you can “peer inside the machine” so to speak. HRAD might allow you to peer inside the machine and make statements about what the machine will do with extremely high certainty.

My current take on this is that whatever we do, we’re going to fall pretty far short of proof-strength “extremely high certainty”—the approaches I’m familiar with, including HRAD, are after some mix of
- a basic explanation of why an AI system designed a certain way should be expected to be aligned, corrigible, or some mix or other similar property
- theoretical and empirical understanding that makes us think that an actual implementation follows that story robustly / reliably
HRAD makes trade-offs than other approaches do, and it does seem to me like successfully-done HRAD would be more likely to be amenable to formal arguments that cover some parts of our confidence gap, but it doesn’t look to me like “HRAD offers proof-level certainty, other approaches offer qualitatively less”.

(2) Produce an AI system that can help create an optimal world… You might think that “learning to reason from humans” doesn’t accomplish (2) because it makes the AI human-limited. If we want an advanced AI to help us create the kind of world that humans would want “if we knew more, thought faster, were more the people we wished we were” etc. then the approval of actual humans might, at some point, cease to be helpful.

It’s true that I’m more focused on “make sure human values keep steering the future” than on the direct goal of “optimize the world”; I think that making sure human values keep steering the future is the best leverage point for creating an optimal world.

My hope is that for some decisions, actual humans (like us) would approve of “make this decision on the basis of something CEV-like—do things we’d approve of if we knew more, thought faster, etc., where those approvals can be predicted with high confidence, don’t pose super-high risk of lock-in to a suboptimal future, converge among different people, etc.” If you and I think this is a good idea, it seems like an AI system trained on us could think this as well.

Another way of thinking about this is that the world is currently largely steered by human values, AI threatens to introduce another powerful steering force, and we’re just making sure that that power is aligned with us at each timestep. A not-great outcome is that we end up with the world humans would have made if AI were not possible in the first place, but we don’t get toward optimality very quickly; a more optimistic outcome is that the additional steering power accelerates us very significantly along the track to an optimal world, steered by human values along the way.