Thanks for taking the time to write this, Ezra—I found it useful.
Daniel_Dewey
I just heard about this via a John Green video, and immediately came here to check whether it’d been discussed. Glad to see that it’s been posted—thanks for doing that! (Strong-upvoted, because this is the kind of thing I like to see on the EA forum.)
I don’t have the know-how to evaluate the 100x claim, but it’s huge if true—hopefully if it pops up on the forum like this now and then, especially as more evidence comes in from the organization’s work, we’ll eventually get the right people looking to evaluate this as an opportunity.
I think this is a good point; you may also be interested in Michelle’s post about beneficiary groups, my comment about beneficiary subgroups, and Michelle’s follow-up about finding more effective causes.
Thanks Tobias.
In a hard / unexpected takeoff scenario, it’s more plausible that we need to get everything more or less exactly right to ensure alignment, and that we have only one shot at it. This might favor HRAD because a less principled approach makes it comparatively unlikely that we get all the fundamentals right when we build the first advanced AI system.
FWIW, I’m not ready to cede the “more principled” ground to HRAD at this stage; to me, it seems like the distinction is more about which aspects of an AI system’s behavior we’re specifying manually, and which aspects we’re setting it up to learn. As far as trying to get everything right the first time, I currently favor a corrigibility kind of approach, as I described in 3c above—I’m worried that trying to solve everything formally ahead of time will actually expose us to more risk.
Thanks for these thoughts. (Your second link is broken, FYI.)
On empirical feedback: my current suspicion is that there are some problems where empirical feedback is pretty hard to get, but I actually think we could get more empirical feedback on how well HRAD can be used to diagnose and solve problems in AI systems. For example, it seems like many AI systems implicitly do some amount of logical-uncertainty-type reasoning (e.g. AlphaGo, which is really all about logical uncertainty over the result of expensive game-tree computations) -- maybe HRAD could be used to understand how those systems could fail?
I’m less convinced that the “ignored physical aspect of computation” is a very promising direction to follow, but I may not fully understand the position you’re arguing for.
My guess is that the capability is extremely likely, and the main difficulties are motivation and reliability of learning (since in other learning tasks we might be satisfied with lower reliability that gets better over time, but in learning human preferences unreliable learning could result in a lot more harm).
Thanks for this suggestion, Kaj—I think it’s an interesting comparison!
I am very bullish on the Far Future EA Fund, and donate there myself. There’s one other possible nonprofit that I’ll publicize in the future if it gets to the stage where it can use donations (I don’t want to hype this up as an uber-solution, just a nonprofit that I think could be promising).
I unfortunately don’t spend a lot of time thinking about individual donation opportunities, and the things I think are most promising often get partly funded through Open Phil (e.g. CHAI and FHI), but I think diversifying the funding source for orgs like CHAI and FHI is valuable, so I’d consider them as well.
I think there’s something to this—thanks.
To add onto Jacob and Paul’s comments, I think that while HRAD is more mature in the sense that more work has gone into solving HRAD problems and critiquing possible solutions, the gap seems much smaller to me when it comes to the justification for thinking HRAD is promising vs justification for Paul’s approach being promising. In fact, I think the arguments for Paul’s work being promising are more solid than those for HRAD, despite it only being Paul making those arguments—I’ve had a much harder time understanding anything more nuanced than the basic case for HRAD I gave above, and a much easier time understanding why Paul thinks his approach is promising.
My perspective on this is a combination of “basic theory is often necessary for knowing what the right formal tools to apply to a problem are, and for evaluating whether you’re making progress toward a solution” and “the applicability of Bayes, Pearl, etc. to AI suggests that AI is the kind of problem that admits of basic theory.” An example of how this relates to HRAD is that I think that Bayesian justifications are useful in ML, and that a good formal model of rationality in the face of logical uncertainty is likely to be useful in analogous ways. When I speak of foundational understanding making it easy to design the right systems, I’m trying to point at things like the usefulness of Bayesian justifications in modern ML. (I’m unclear on whether we miscommunicated about what sort of thing I mean by “basic insights”, or whether we have a disagreement about how useful principled justifications are in modern practice when designing high-reliability systems.)
Just planting a flag to say that I’m thinking more about this so that I can respond well.
Thanks Nate!
The end goal is to prevent global catastrophes, but if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are “your team runs into a capabilities roadblock and can’t achieve AGI” or “your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time.”
This is particularly helpful to know.
We worry about “unknown unknowns”, but I’d probably give them less emphasis here. We often focus on categories of failure modes that we think are easy to foresee. As a rule of thumb, when we prioritize a basic research problem, it’s because we expect it to help in a general way with understanding AGI systems and make it easier to address many different failure modes (both foreseen and unforeseen), rather than because of a one-to-one correspondence between particular basic research problems and particular failure modes.
Can you give an example or two of failure modes or “categories of failure modes that are easy to foresee” that you think are addressed by some HRAD topic? I’d thought previously that thinking in terms of failure modes wasn’t a good way to understand HRAD research.
As an example, the reason we work on logical uncertainty isn’t that we’re visualizing a concrete failure that we think is highly likely to occur if developers don’t understand logical uncertainty. We work on this problem because any system reasoning in a realistic way about the physical world will need to reason under both logical and empirical uncertainty, and because we expect broadly understanding how the system is reasoning about the world to be important for ensuring that the optimization processes inside the system are aligned with the intended objectives of the operators.
I’m confused by this as a follow-up to the previous paragraph. This doesn’t look like an example of “focusing on categories of failure modes that are easy to foresee,” it looks like a case where you’re explicitly not using concrete failure modes to decide what to work on.
“how do we ensure the system’s cognitive work is being directed at solving the right problems, and at solving them in the desired way?”
I feel like this fits with the “not about concrete failure modes” narrative that I believed before reading your comment, FWIW.
Thanks!
I’m going to try to answer these questions, but there’s some danger that I could be taken as speaking for MIRI or Paul or something, which is not the case :) With that caveat:
I’m glad Rob sketched out his reasoning on why (1) and (2) don’t play a role in MIRI’s thinking. That fits with my understanding of their views.
(1) You might think that “learning to reason from humans” doesn’t accomplish (1) because a) logic and mathematics seem to be the only methods we have for stating things with extremely high certainty, and b) you probably can’t rule out AI catastrophes with high certainty unless you can “peer inside the machine” so to speak. HRAD might allow you to peer inside the machine and make statements about what the machine will do with extremely high certainty.
My current take on this is that whatever we do, we’re going to fall pretty far short of proof-strength “extremely high certainty”—the approaches I’m familiar with, including HRAD, are after some mix of
a basic explanation of why an AI system designed a certain way should be expected to be aligned, corrigible, or some mix or other similar property
theoretical and empirical understanding that makes us think that an actual implementation follows that story robustly / reliably
HRAD makes trade-offs than other approaches do, and it does seem to me like successfully-done HRAD would be more likely to be amenable to formal arguments that cover some parts of our confidence gap, but it doesn’t look to me like “HRAD offers proof-level certainty, other approaches offer qualitatively less”.
(2) Produce an AI system that can help create an optimal world… You might think that “learning to reason from humans” doesn’t accomplish (2) because it makes the AI human-limited. If we want an advanced AI to help us create the kind of world that humans would want “if we knew more, thought faster, were more the people we wished we were” etc. then the approval of actual humans might, at some point, cease to be helpful.
It’s true that I’m more focused on “make sure human values keep steering the future” than on the direct goal of “optimize the world”; I think that making sure human values keep steering the future is the best leverage point for creating an optimal world.
My hope is that for some decisions, actual humans (like us) would approve of “make this decision on the basis of something CEV-like—do things we’d approve of if we knew more, thought faster, etc., where those approvals can be predicted with high confidence, don’t pose super-high risk of lock-in to a suboptimal future, converge among different people, etc.” If you and I think this is a good idea, it seems like an AI system trained on us could think this as well.
Another way of thinking about this is that the world is currently largely steered by human values, AI threatens to introduce another powerful steering force, and we’re just making sure that that power is aligned with us at each timestep. A not-great outcome is that we end up with the world humans would have made if AI were not possible in the first place, but we don’t get toward optimality very quickly; a more optimistic outcome is that the additional steering power accelerates us very significantly along the track to an optimal world, steered by human values along the way.
Thanks for linking to that conversation—I hadn’t read all of the comments on that post, and I’m glad I got linked back to it.
Thanks!
Conditional on MIRI’s view that a hard or unexpected takeoff is likely, HRAD is more promising (though it’s still unclear).
Do you mean more promising than other technical safety research (e.g. concrete problems, Paul’s directions, MIRI’s non-HRAD research)? If so, I’d be interested in hearing why you think hard / unexpected takeoff differentially favors HRAD.
Thanks Tara! I’d like to do more writing of this kind, and I’m thinking about how to prioritize it. It’s useful to hear that you’d be excited about those topics in particular.
Thanks Kerry, Benito! Glad you found it helpful.
My current thoughts on MIRI’s “highly reliable agent design” work
Welcome! :)
I think your argument totally makes sense, and you’re obviously free to use your best judgement to figure out how to do as much good as possible. However, a couple of other considerations seem important, especially for things like what a “true effective altruist” would do.
1) One factor of your impact is your ability to stick with your giving; this could give you a reason to adopt something less scary and demanding. By analogy, it might seem best for fitness to commit to intense workouts 5 days a week, strict diet changes, and no alcohol, but in practice trying to do this may result in burning out and not doing anything for your fitness, while a less-demanding plan might be easier to stick with and result in better fitness over the length of your life.
Personally, the prospect of giving up retirement doesn’t seem too demanding; I like working, and retirement is so far away that it’s hard to take seriously. However, I’d understand if others didn’t feel this way, and I wouldn’t want to push them into a commitment they won’t be able to keep.
2) Another factor of your impact is the other people you influence who may start giving, and would not have done so without your example—in fact, it doesn’t seem implausible that this could make up the majority of your impact over your life. To the extent that giving is a really significant cost for people, it’s harder to spread the idea (e.g. many more people are vegetarian than vegan [citation needed]), and asking people to give up major parts of their life story like retirement (or a wedding, or occasional luxuries, or christmas gifts for their families, etc.) comes with real costs that could be measured in dollars (with lots of uncertainty). More broadly, the norms that we establish as a community affect the growth of the community, which directly affects total giving—if people see us as a super-hardcore group that requires great sacrifice, I just expect less money to be given.
For these reasons, I prefer to follow and encourage norms that say something like “Hey, guess what—you can help other people a huge amount without sacrificing anything huge! Your life can be just as you thought it would be, and also help other people a lot!” I actually anticipate these norms to have better consequences in terms of helping people than more strict norms (like “don’t retire”) do, mostly for reasons 1 and 2.
There’s still a lot of discussion on these topics, and I could imagine finding out that I’m wrong—for example, I’ve heard that there’s evidence of more demanding religions being more successful at creating a sense of community and therefore being more satisfying and attractive. However, my best guess is that “don’t retire” is too demanding.
(I looked for an article saying something like this but better to link to, but I didn’t quickly find one—if anyone knows where one is, feel free to link!)
I was going to “heart” this, but that seemed ambiguous. So I’m just commenting to say, I hear you.