So8res

Karma: 3,505

So8res 7 Feb 2023 21:52 UTC
99 points
22 ∶ 0
in reply to: Holly_Elmore’s comment on: A personal reflection on SBF
Do you think EA’s self-reflection about this is at all productive, considering most people had even less information than you?
I don’t have terribly organized thoughts about this. (And I am still not paying all that much attention—I have much more patience for picking apart my own reasoning processes looking for ways to improve them, than I have for reading other people’s raw takes :-p)
But here’s some unorganized and half-baked notes:
I appreciated various expressions of emotion. Especially when they came labeled as such.
I think there was also a bunch of other stuff going on in the undertones that I don’t have a good handle on yet, and that I’m not sure about my take on. Stuff like… various people implicitly shopping around proposals about how to readjust various EA-internal political forces, in light of the turmoil? But that’s not a great handle for it, and I’m not terribly articulate about it.
There’s a phenomenon where a gambler places their money on 32, and then the roulette wheel comes up 23, and they say “I’m such a fool; I should have bet 23”.
More useful would be to say “I’m such a fool; I should have noticed that the EV of this gamble is negative.” Now at least you aren’t asking for magic lottery powers.
Even more useful would be to say “I’m such a fool; I had three chances to notice that this bet was bad: when my partner was trying to explain EV to me; when I snuck out of the house and ignored a sense of guilt; and when I suppressed a qualm right before placing the bet. I should have paid attention in at least one of those cases and internalized the arguments about negative EV, before gambling my money.” Now at least you aren’t asking for magic cognitive powers.
My impression is that various EAs respond to crises in a manner that kinda rhymes with saying “I wish I had bet 23”, or at best “I wish I had noticed this bet was negative EV”, and in particular does not rhyme with saying “my second-to-last chance to do better (as far as I currently recall) was the moment that I suppressed the guilt from sneaking out of the house”.
(I think this is also true of the general population, to be clear. Perhaps even moreso.)
I have a vague impression that various EAs perform self-flagellation, while making no visible attempt to trace down where, in their own mind, they made a misstep. (Not where they made a good step that turned out in this instance to have a bitter consequence, but where they made a wrong step of the general variety that they could realistically avoid in the future.)
(Though I haven’t gone digging up examples, and in lieu of examples, for all I know this impression is twisted by influence from the zeitgeist.)
My guess is that most EAs didn’t make mental missteps of any import.
And, of course, most folk on this forum aren’t rushing to self-flagellate. Lots of people who didn’t make any mistake, aren’t saying anything about their non-mistakes, as seems entirely reasonable.
I think the scrupulous might be quick to object that, like, they had some flicker of unease about EA being over-invested in crypto, that they should have expounded upon. And so surely they, too, erred.
And, sure, they’d’ve gotten more coolness points if they’d joined the ranks of people who aired that concern in advance.
And there is, I think, a healthy chain of thought from there to the hypothesis that the community needs better mechanisms for incentivizing and aggregating distributed knowledge.
(For instance: some people did air that particular concern in advance, and it didn’t do much. There’s perhaps something to be said for the power that a thousand voices would have had when ten didn’t suffice, but an easier fix than finding 990 voices is probably finding some other way to successfully heed the 10, which requires distinguishing them from the background noise—and distinguishing them as something actionable—before it’s too late, and then routing the requisite action to the people who can do something about it. etc.)
I hope that some version of this conversation is happening somewhere, and it seems vaguely plausible that there’s a variant happening behind closed doors at CEA or something.
I think that maybe a healthier form of community reflection would have gotten to a public and collaborative version of that discussion by now. Maybe we’ll still get there.
(I caveat, though, that it seems to me that many good things die from the weight of the policies they adopt in attempts to win the last war, with a particularly egregious example that springs to mind being the TSA. But that’s getting too much into the object-level weeds.)
(I also caveat that I in fact know a pair of modestly-high-net-worth EA friends who agreed, years ago, that the community was overexposed to crypto, and that at most one of them should be exposed to crypto. The timing of this thought is such that the one who took the non-crypto fork is now significantly less comparatively wealthy. This stuff is hard to get right in real life.)
(And I also caveat that I’m not advocating design-by-community-committee when it comes to community coordination mechanisms. I think that design-by-committee often fails. I also think there’s all sorts of reasons why public attempts to discuss such things can go off the rails. Trying to have smaller conversations, or in-person conversations, seems eminently reasonable to me.)
I think that another thing that’s been going on is that there are various rumors around that “EA leaders” knew something about all this in advance, and this has caused a variety of people to feel (justly) perturbed and uneasy.
Insofar as someone’s thinking is influenced by a person with status in their community, I think it’s fair to ask what they knew and when, as is relevant to the question of whether and how to trust them in the future.
And insofar as other people are operating the de-facto community coordination mechanisms, I think it’s also fair to ask what they knew and when, as is relevant to the question of how (as a community) to fix or change or add or replace some coordination mechanisms.
I don’t particularly have a sense that the public EA discourse around FTX stuff was headed in a healthy and productive direction.
It’s plausible to me that there are healthy and productive processes going on behind closed doors, among the people who operate the de-facto community coordination mechanisms.
Separately, it kinda feels to me like there’s this weird veil draped over everything, where there’s rumors that EA-leader-ish folk knew some stuff but nobody in that reference class is just, like, coming clean.
This post is, in part, an attempt to just pierce the damn veil (at least insofar as I personally can, as somebody who’s at least EA-leader-adjacent).
I can at least show some degree to which the rumors were true (I run an EA org, and Alameda did start out in the offices downstairs from ours, and I was privy to a bunch more data than others) versus false (I know of no suspicion that Sam was defrauding customers, nor have I heard any hint of any coverup).
One hope I have is that this will spark some sort of productive conversation.
For instance, my current hypothesis is that we’d do well to look for better community mechanisms for aggregating hints and acting on them. (Where I’m having trouble visualizing ways of doing it that don’t also get totally blindsided by the next crisis, when it turns out that the next war is not exactly the same as the last one. But this, again, is getting more into the object-level.)
Regardless of whether that theory is right, it’s at least easier to discuss in light of a bunch of the raw facts. Whether or not everybody was completely blindsided, vs whether we had a bunch of hints that we failed to assemble, vs whether there was a fraudulent conspiracy we tried to cover up, matters quite a bit as to how we should react!
It’s plausible to me that a big part of the reason why the discussion hasn’t yet produced Nate!legible fruit, is because it just wasn’t working with all that many details. This post is intended in part to be a contribution towards that end.
(Though I of course also entertain the hypotheses that there’s all sorts of different forces pushing the conversation off the rails (such that this post won’t help much), and the hypothesis that the conversation is happening just fine behind closed doors somewhere (such that this post isn’t all that necessary).)
(And I note, again, that insofar as this post does help the convo, Rob Bensinger gets a share of the credit. I was happy to shelve this post indefinitely, and wouldn’t have dug it out of my drafts folder if he hadn’t argued that it had a chance of rerailing the conversation.)

So8res 8 Jul 2017 21:10 UTC
28 points
0 ∶ 0
on: My current thoughts on MIRI’s “highly reliable agent design” work
Thanks for this solid summary of your views, Daniel. For others’ benefit: MIRI and Open Philanthropy Project staff are in ongoing discussion about various points in this document, among other topics. Hopefully some portion of those conversations will be made public at a later date. In the meantime, a few quick public responses to some of the points above:

2) If we fundamentally “don’t know what we’re doing” because we don’t have a satisfying description of how an AI system should reason and make decisions, then we will probably make lots of mistakes in the design of an advanced AI system.

3) Even minor mistakes in an advanced AI system’s design are likely to cause catastrophic misalignment.

I think this is a decent summary of why we prioritize HRAD research. I would rephrase 3 as “There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” I’d compare these mistakes to the “small” decision in the early 1970s to use null-terminated instead of length-prefixed strings in the C programming language, which continues to be a major source of software vulnerabilities decades later.

I’d also clarify that I expect any large software product to exhibit plenty of actually-trivial flaws, and that I don’t expect that AGI code needs to be literally bug-free or literally proven-safe in order to be worth running. Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea. The end goal is to prevent global catastrophes, but if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are “your team runs into a capabilities roadblock and can’t achieve AGI” or “your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time.”

This case does not revolve around any specific claims about specific potential failure modes, or their relationship to specific HRAD subproblems. This case revolves around the value of fundamental understanding for avoiding “unknown unknown” problems.

We worry about “unknown unknowns”, but I’d probably give them less emphasis here. We often focus on categories of failure modes that we think are easy to foresee. As a rule of thumb, when we prioritize a basic research problem, it’s because we expect it to help in a general way with understanding AGI systems and make it easier to address many different failure modes (both foreseen and unforeseen), rather than because of a one-to-one correspondence between particular basic research problems and particular failure modes.

As an example, the reason we work on logical uncertainty isn’t that we’re visualizing a concrete failure that we think is highly likely to occur if developers don’t understand logical uncertainty. We work on this problem because any system reasoning in a realistic way about the physical world will need to reason under both logical and empirical uncertainty, and because we expect broadly understanding how the system is reasoning about the world to be important for ensuring that the optimization processes inside the system are aligned with the intended objectives of the operators.

A big intuition behind prioritizing HRAD is that solutions to “how do we ensure the system’s cognitive work is being directed at solving the right problems, and at solving them in the desired way?” are likely to be particularly difficult to hack together from scratch late in development. An incomplete (empirical-side-only) understanding of what it means to optimize objectives in realistic environments seems like it will force designers to rely more on guesswork and trial-and-error in a lot of key design decisions.

I haven’t found any instances of complete axiomatic descriptions of AI systems being used to mitigate problems in those systems (e.g. to predict, postdict, explain, or fix them) or to design those systems in a way that avoids problems they’d otherwise face.

This seems reasonable to me in general. I’d say that AIXI has had limited influence in part because it’s combining several different theoretical insights that the field was already using (e.g., complexity penalties and backtracking tree search), and the synthesis doesn’t add all that much once you know about the parts. Sections 3 and 4 of MIRI’s Approach provide some clearer examples of what I have in mind by useful basic theory: Shannon, Turing, Bayes, etc.

My perspective on this is a combination of “basic theory is often necessary for knowing what the right formal tools to apply to a problem are, and for evaluating whether you’re making progress toward a solution” and “the applicability of Bayes, Pearl, etc. to AI suggests that AI is the kind of problem that admits of basic theory.” An example of how this relates to HRAD is that I think that Bayesian justifications are useful in ML, and that a good formal model of rationality in the face of logical uncertainty is likely to be useful in analogous ways. When I speak of foundational understanding making it easy to design the right systems, I’m trying to point at things like the usefulness of Bayesian justifications in modern ML. (I’m unclear on whether we miscommunicated about what sort of thing I mean by “basic insights”, or whether we have a disagreement about how useful principled justifications are in modern practice when designing high-reliability systems.)
What links here?

So8res 20 Feb 2016 3:32 UTC
21 points
0 ∶ 0
on: Let’s conduct a survey on the quality of MIRI’s implementation
Thanks for the write-up, Rob. OpenPhil actually decided to evaluate our technical agenda last summer, and Holden put Daniel Dewey on the job. The report isn’t done yet, in part because it has proven very time-intensive to fully communicate the reasoning behind our research priorities, even to someone with as much understanding of the AI landscape as Daniel Dewey. Separately, we have plans to get an independent evaluation of our organizational efficacy started later in 2016, which I expect to be useful for our admin team as well as prospective donors.

FYI, when it comes to evaluating our research progress, I doubt that the methods you propose would get you much Bayesian evidence. Our published output will look like round pegs shoved into square holes regardless of whether we’re doing our jobs well or poorly, because we’re doing research that doesn’t fit neatly into an existing academic niche. Our objective is to make direct progress on what appear to us to be the main neglected technical obstacles to developing reliable AI systems in the long term, with a goal of shifting the direction of AI research in a big way once we hit certain key research targets; and we’re specifically targeting research that isn’t compatible with industry’s economic incentives or academia’s publish-or-perish incentives. To get information about how well we’re doing our jobs, I think the key questions to investigate are (1) whether we’ve chosen good research targets; and (2) whether we’re making good progress towards them.

We’ve been focusing our communication efforts mainly on helping people evaluate (1): I’ve been working on explaining our approach and agenda, and OpenPhil is also on the job. To investigate (2), we’d need to spend a sizable chunk of time with mathematically adept evaluators — we still haven’t hit any of our key research targets, which means that evaluating our progress requires understanding our smaller results and why we think they’re progress towards the big results. In practice, we’ve found that explaining this usually requires explaining why we think the big targets are vital, as this informs (e.g.) which shortcuts are and are not acceptable. I plan to wait until after the OpenPhil report is finished before taking on another time-intensive eval.

Fortunately, (2) will become much easier to evaluate as we achieve (or persistently fail to achieve) those key targets. This also provides us with an opportunity to test our approach and methodology. People who understand our approach and find it uncompelling often predict that some of the results we’re shooting for cannot be achieved. This means we’ll get some evidence about (1) as we learn more about (2). For example, last year I mentioned “naturalized AIXI” as an ambitious 5-year research target. If we are not able to make concrete progress towards that goal, then over the next four years, I will lose confidence in our approach and eventually change our course dramatically. Conversely, if we make discoveries that are important pieces of that puzzle, I’ll update in favor of us being onto something, especially if we find puzzle pieces that knowledgeable critics predicted we wouldn’t find. This data will hopefully start rolling in soon, now that our research team is getting up to size.

(“Concrete progress” / “important puzzle pieces” in this case are satisfactory asymptotic algorithms for any of: (1) reasoning under logical uncertainty; (2) identifying the best available decision with respect to a utility function; (3) performing induction from inside an environment; (4) identifying the referents of goals in realistic world-models; and (5) reasoning about the behavior of smarter reasoners; the last of which is hopefully a subset of 1 and 2. The linked papers give rough descriptions of what counts as ‘satisfactory’ in each case; I’ll work to make the desiderata more explicit as time goes on.)
What links here?

So8res 7 Feb 2023 19:18 UTC
17 points
1 ∶ 0
in reply to: vaniver’s comment on: A personal reflection on SBF
Good point! Currently, I think the “pry more” lesson is supposed to account for a bunch of this.
Since making this update, I have in fact pried more into friends’ lives. In at least one instance I found some stuff that worried me, at which point I was naturally like “hey, this worries me; it pattern-matches to some bad situations I’ve seen; I feel wary and protective; I request an opportunity to share and/or put you in touch with people who’ve been through putatively-analogous situations (though I can also stfu if you’re sick of hearing people’s triggered takes about your life situation.)” And, as far as I can tell, that was a useful/helpful thing to have done in that situation (and didn’t involve any changes to my drama policy).
That said, that situation wasn’t one where the right move involved causing ripples in the community (e.g. by publicly airing concerns about what went down at Alameda). If we fight the last war again, my hope is that I’d be doing stuff more like “prod others to action”, or perhaps “plainly state aloud what I think” (as I’m doing now with this post).
There is something about this post that feels very “neutral tone” to me, and that makes it feel at home within my “don’t give drama the attention it needs to breathe” policy. I think my ability to have good effects by prying more & prodding more & having more backbone doesn’t require changes to my drama policy. (And perhaps the policy has shifted around somewhat, without me noticing it? Doesn’t feel like it, though.)
I am of course open to arguments that I’m failing to learn a lesson about my drama policy.

So8res 20 Aug 2015 22:48 UTC
11 points
0 ∶ 0
in reply to: Benjamin_Todd’s comment on: Peter Hurford thinks that a large proportion of people should earn to give long term
I want to push back a bit against point #1 (“Let’s divide problems into ‘funding constrained’ and ‘talent constrained’.) In my experience recruiting for MIRI, these constraints are tightly intertwined. To hire talent, you need money (and to get money, you often need results, which requires talent).

I think the “are they funding constrained or talent constrained?” model is incorrect, and potentially harmful. In the case of MIRI, imagine we’re trying to hire a world-class researcher for $50k/year, and can’t find one. Are we talent constrained, or funding constrained? (Our actual researcher salaries are higher than this, but they weren’t last year, and they still aren’t anywhere near competitive with industry rates.)

Furthermore, there are all sorts of things I could be doing to loosen the talent bottleneck, but only if I knew the money was going to be there. I could be setting up a researcher stewardship program, having seminars run at Berkeley and Stanford, and hiring dedicated recruiting-focused researchers who know the technical work very well and spend a lot of time practicing getting people excited—but I can only do this if I know we’re going to have the money to sustain that program alongside our core research team, and if I know we’re going to have the money to make hires. If we reliably bring in only enough funding to sustain modest growth, I’m going to have a very hard time breaking the talent constraint.

And that’s ignoring the opportunity costs of being under-funded, which I think are substantial. For example, at MIRI there are numerous additional programs we could be setting up, such as a visiting professor + postdoc program, or a separate team that is dedicated to working closely with all the major industry leaders, or a dedicated team that’s taking a different research approach, or any number of other projects that I’d be able to start if I knew the funding would appear. All those things would lead to new and different job openings, letting us draw from a wider pool of talented people (rather than the hyper-narrow pool we currently draw from), and so this too would loosen the talent constraint—but again, only if the funding was there.

Right now, we have more trouble finding top-notch math talent excited about our approach to technical AI alignment problems than we have raising money, but don’t let this fool you—the talent constraint would be much, much easier to address with more money, and there are many things we aren’t doing (for lack of funding) that I think would be high impact.
What links here?
- sapphire's comment on Simultaneous Shortage and Oversupply by Jeff Kaufman (26 Jan 2019 20:46 UTC; 18 points)

So8res 11 Jun 2015 23:08 UTC
11 points
0 ∶ 0
in reply to: Buck’s comment on: I am Nate Soares, AMA!
That post mixes a bunch of different assertions together, let me try to distill a few of them out and answer them in turn:

One of Peter’s first (implicit) points is that AI alignment is a speculative cause. I tend to disagree.

Imagine it’s 1942. The Manhattan project is well under way, Leo Szilard has shown that it’s possible to get a neutron chain reaction, and physicists are hard at work figuring out how to make an atom bomb. You suggest that this might be a fine time to start working on nuclear containment, so that, once humans are done bombing the everloving breath out of each other, they can harness nuclear energy for fun and profit. In this scenario, would nuclear containment be a “speculative cause”?

There are currently thousands of person-hours and billions of dollars going towards increasing AI capabilities every year. To call AI alignment a “speculative cause” in an environment such as this one seems fairly silly to me. In what sense is it speculative to work on improving the safety of the tools that other people are currently building as fast as they can? Now, I suppose you could argue that either (a) AI will never work or (b) it will be safe by default, but both those arguments seem pretty flimsy to me.

You might argue that it’s a bit weird for people to claim that the most effective place to put charitable dollars is towards some field of scientific study. Aren’t charitable dollars supposed to go to starving children? Isn’t the NSF supposed to handle scientific funding? And I’d like to agree, but society has kinda been dropping the ball on this one.

If we had strong reason to believe that humans could build strangelets, and society were pouring billions of dollars and thousands of human-years into making strangelets, and almost no money or effort was going towards strangelet containment, and it looked like humanity was likely to create a strangelet sometime in the next hundred years, then yeah, I’d say that “strangelet safety” would be an extremely worthy cause.

How worthy? Hard to say. I agree with Peter that it’s hard to figure out how to trade off “safety of potentially-very-highly-impactful technology that is currently under furious development” against “children are dying of malaria”, but the only way I know how to trade those things off is to do my best to run the numbers, and my back-of-the-envelope calculations currently say that AI alignment is further behind than the globe is poor.

Now that the EA movement is starting to look more seriously into high-impact interventions on the frontiers of science & mathematics, we’re going to need to come up with more sophisticated ways to assess the impacts and tradeoffs. I agree it’s hard, but I don’t think throwing out everything that doesn’t visibly pay off in the extremely short term is the answer.

Alternatively, you could argue that MIRI’s approach is unlikely to work. That’s one of Peter’s explicit arguments: it’s very hard to find interventions that reliably affect the future far in advance, especially when there aren’t hard objective metrics. I have three disagreements with Peter on this point.

First, I think he picks the wrong reference class: yes, humans have a really hard time generating big social shifts on purpose. But that doesn’t necessarily mean humans have a really hard time generating math—in fact, humans have a surprisingly good track record when it comes to generating math!

Humans actually seem to be pretty good at putting theoretical foundations underneath various fields when they try, and various people have demonstrably succeeded at this task (Church & Turing did this for computing, Shannon did this for information theory, Kolmogorov did a fair bit of this for probability theory, etc.). This suggests to me that humans are much better at producing technical progress in an unexplored field than they are at generating social outcomes in a complex economic environment. (I’d be interested in any attempt to quantitatively evaluate this claim.)

Second, I agree in general that any one individual team isn’t all that likely to solve the AI alignment problem on their own. But the correct response to that isn’t “stop funding AI alignment teams”—it’s “fund more AI alignment teams”! If you’re trying to ensure that nuclear power can be harnessed for the betterment of humankind, and you assign low odds to any particular research group solving the containment problem, then the answer isn’t “don’t fund any containment groups at all,” the answer is “you’d better fund a few different containment groups, then!”

Third, I object to the whole “there’s no feedback” claim. Did Kolmogorov have tight feedback when he was developing an early formalization of probability theory? It seems to me like the answer is “yes”—figuring out what was & wasn’t a mathematical model of the properties he was trying to capture served as a very tight feedback loop (mathematical theorems tend to be unambiguous), and indeed, it was sufficiently good feedback that Kolmogorov was successful in putting formal foundations underneath probability theory. We’re trying to do something similar with various other confusing aspects of good reasoning (such as logical uncertainty), and you’re welcome to raise concerns about whether we need to understand good reasoning under logical uncertainty in order to build an aligned AI, but saying that there’s “no feedback loop” seems to just misunderstand the approach.

So8res 29 Oct 2016 19:00 UTC
10 points
0 ∶ 0
in reply to: Gregory Lewis’s comment on: MIRI Update and Fundraising Case
Under whatever constraints Open Phil provided, I’d have sent the ‘best by academic lights’ papers I had.

We originally sent Nick Beckstead what we considered our four most important 2015 results, at his request; these were (1) the incompatibility of the “Inductive Coherence” framework and the “Asymptotic Convergence in Online Learning with Unbounded Delays” framework; (2) the demonstration in “Proof-Producing Reflection for HOL” that a non-pathological form of self-referential reasoning is possible in a certain class of theorem-provers; (3) the reflective oracles result presented in “A Formal Solution to the Grain of Truth Problem,” “Reflective Variants of Solomonoff Induction and AIXI,” and “Reflective Oracles”; (4) and Vadim Kosoy’s “Optimal Predictors” work. The papers we listed under 1, 2, and 4 then got used in an external review process they probably weren’t very well-suited for.

I think this was more or less just an honest miscommunication. I told Nick in advance that I only assigned an 8% probability to external reviewers thinking the “Asymptotic Convergence…” result was “good” on its own (and only a 20% probability for “Inductive Coherence”). My impression of what happened is that Open Phil staff interpreted my pushback as saying that I thought the external reviews wouldn’t carry much Bayesian evidence (but that the internal reviews still would), where what I was trying to communicate was that I thought the papers didn’t carry very much Bayesian evidence about our technical output (and that I thought the internal reviewers would need to speak to us about technical specifics in order to understand why we thought they were important). Thus, we were surprised when their grant decision and write-up put significant weight on the internal reviews of those papers (and they were surprised that we were surprised). This is obviously really unfortunate, and another good sign that I should have committed more time and care to clearly communicating my thinking from the outset.

Regarding picking better papers for external review: We only put out 10 papers directly related to our technical agendas between Jan 2015 and Mar 2016, so the option space is pretty limited, especially given the multiple constraints Open Phil wanted to meet. Optimizing for technical impressiveness and non-obviousness as a stand-alone result, I might have instead gone with Critch’s bounded Löb paper and the grain of truth problem paper over the AC/IC results. We did submit the grain of truth problem paper to Open Phil, but they decided not to review it because it didn’t meet other criteria they were interested in.

If MIRI is unable to convince someone like Dewey, the prospects of it making the necessary collaborations or partnerships with the wider AI community look grim.

I’m less pessimistic about building collaborations and partnerships, in part because we’re already on pretty good terms with other folks in the community, and in part because I think we have different models of how technical ideas spread. Regardless, I expect that with more and better communication, we can (upon re-evaluation) raise the probability of Open Phil staff that the work we’re doing is important.

More generally, though, I expect this task to get easier over time as we get better at communicating about our research. There’s already a body of AI alignment research (and, perhaps, methodology) that requires the equivalent of multiple university courses to understand, but there aren’t curricula or textbooks for teaching it. If we can convince a small pool of researchers to care about the research problems we think are important, this will let us bootstrap to the point where we have more resources for communicating information that requires a lot of background and sustained scholarship, as well as more of the institutional signals that this stuff warrants a time investment.

I can maybe make the time expenditure thus far less mysterious if I mention a couple more ways I erred in trying to communicate my model of MIRI’s research agenda:
1. My early discussion with Daniel was framed around questions like “What specific failure mode do you expect to be exhibited by advanced AI systems iff their programmers don’t understand logical uncertainty?” I made the mistake of attempting to give straight/non-evasive answers to those sorts of questions and let the discussion focus on that evaluation criterion, rather than promptly saying “MIRI’s research directions mostly aren’t chosen to directly address a specific failure mode in a notional software system” and “I don’t think that’s a good heuristic for identifying research that’s likely to be relevant to long-run AI safety.”
1. I fell prey to the transparency illusion pretty hard, and that was completely my fault. Mid-way through the process, Daniel made a write-up of what he had gathered so far; this write-up revealed a large number of miscommunications and places where I thought I had transmitted a concept of mine but Daniel had come away with a very different concept. It’s clear in retrospect that we should have spent a lot more time with me having Daniel try to explain what he thought I meant, and I had all the tools to predict this in foresight; but I foolishly assumed that wouldn’t be necessary in this case.
(I plan to blog more about the details of these later.)

I think these are important mistakes that show I hadn’t sufficiently clarified several concepts in my own head, or spent enough time understanding Daniel’s position. My hope is that I can do a much better job of avoiding these sorts of failures in the next round of discussion, now that I have a better model of where Open Phil’s staff and advisors are coming from and what the review process looks like.

(I am correct in that Yuan previously worked for you, right?)

Yeah, though that was before my time. He did an unpaid internship with us in the summer of 2013, and we’ve occasionally contracted him to tutor MIRI staff. Qiaochu’s also a lot socially closer to MIRI; he attended three of our early research workshops.

Unless and until then, I remain sceptical about MIRI’s value.

I think that’s a reasonable stance to take, and that there are other possible reasonable stances here too. Some of the variables I expect EAs to vary on include “level of starting confidence in MIRI’s mathematical intuitions about complicated formal questions” and “general risk tolerance.” A relatively risk-intolerant donor is right to wait until we have clearer demonstrations of success; and a relatively risk-tolerant donor who starts without a very high confidence in MIRI’s intuitions about formal systems might be pushed under a donation threshold by learning that an important disagreement has opened up between us and Daniel Dewey (or between us and other people at Open Phil).

Also, thanks for laying out your thinking in so much detail—I suspect there are other people who had more or less the same reaction to Open Phil’s grant write-up but haven’t spoken up about it. I’d be happy to talk more about this over email, too, including answering Qs from anyone else who wants more of my thoughts on this.

So8res 13 Oct 2016 0:03 UTC
10 points
0 ∶ 0
in reply to: Marylen’s comment on: Ask MIRI Anything (AMA)
In short: there’s a big difference between building a system that follows the letter of the law (but not the spirit), and a system that follows the intent behind a large body of law. I agree that the legal system is a large corpus of data containing information about human values and how humans currently want their civilization organized. In order to use that corpus, we need to be able to design systems that reliably act as intended, and I’m not sure how the legal corpus helps with that technical problem (aside from providing lots of training data, which I agree is useful).

In colloquial terms, MIRI is more focused on questions like “if we had a big corpus of information about human values, how could we design a system to learn from that corpus how to act as intended”, and less focused on the lack of corpus.

The reason that we have to work on corrigibility ourselves is that we need advanced learning systems to be corrigible before they’ve finished learning how to behave correctly from a large training corpus. In other words, there are lots of different training corpuses and goal systems where, if the system is fully trained and working correctly, we get corrigibility for free; the difficult part is getting the system to behave corrigibly before it’s smart enough to be doing corrigibility for the “right reasons”.

So8res 12 Oct 2016 20:25 UTC
10 points
0 ∶ 0
in reply to: Peter Wildeford’s comment on: Ask MIRI Anything (AMA)
I don’t think of our strategy as having changed much in the last year. For example, in the last AMA I said that the plan was to work on some big open problems (I named 5 here: asymptotically good reasoning under logical uncertainty, identifying the best available decision with respect to a predictive world-model and utility function, performing induction from inside an environment, identifying the referents of goals in realistic world-models, and reasoning about the behavior of smarter reasoners), and that I’d be thrilled if we could make serious progress on any of these problems within 5 years. Scott Garrabrant then promptly developed logical induction, which represents serious progress on two (maybe three) of the big open problems. I consider this to be a good sign of progress, and that set of research priorities remains largely unchanged.

Jessica Taylor is now leading a new research program, and we’re splitting our research time between this agenda and our 2014 agenda. I see this as a natural consequence of us bringing on new researchers with their own perspectives on various alignment problems, rather than as a shift in organizational strategy. Eliezer, Benya, and I drafted the agent foundations agenda when we were MIRI’s only full-time researchers; Jessica, Patrick, and Critch co-wrote a new agenda with their take once they were added to the team. The new agenda reflects a number of small changes: some updates that we’ve all made in response to evidence over the last couple of years, some writing-up of problems that we’d been thinking about for some time but which hadn’t made the cut into the previous agenda, and some legitimate differences in intuition and perspective brought to the table by Jessica, Patrick, and Critch. The overall strategy is still “do research that we think others won’t do,” and the research methods and intuitions we rely on continue to have a MIRI-ish character.

Regarding success probability, I think MIRI has a decent chance of success compared to other potential AI risk interventions, but AI risk is a hard problem. I’d guess that humanity as a whole has a fairly low probability of success, with wide error bars.

Unless I’m missing context, I think the “medium probability of success” language comes from old discussions on LessWrong about how to respond to Pascal’s mugging. (See Rob’s note about Pascalian reasoning here.) In that context, I think the main dichotomy Eliezer had in mind was “tiny” probabilities (that can be practically ignored, like gambling in the powerball) and strategically relevant probabilities like 1% or 10%. See Eliezer’s post here. I’m fine with calling the latter probabilities “medium-sized” in the context of lottery-style errors, and calling them “small” in other contexts. With respect to ensuring that the first AGI designs developed by AI scientists are easy to align, I don’t think MIRI’s odds are stellar, though I do feel comfortable saying that they’re higher than 1%. Let me know if I’ve misunderstood the question you had in mind here.

So8res 12 Oct 2016 17:45 UTC
10 points
0 ∶ 0
in reply to: poppingtonic’s comment on: Ask MIRI Anything (AMA)
I endorse Tsvi’s comment above. I’ll add that it’s hard to say how close we are to closing basic gaps in understanding of things like “good reasoning”, because mathematical insight is notoriously difficult to predict. All I can say is that logical induction does seem like progress to me, and we’re taking various different approaches on the remaining problems. Also, yeah, one of those avenues is a follow-up to PPRHOL. (One experiment we’re running now is an attempt to implement a cellular automaton in HOL that implements a reflective reasoner with access to the source code of the world, where the reasoner uses HOL to reason about the world and itself. The idea is to see whether we can get the whole stack to work simultaneously, and to smoke out all the implementation difficulties that arise in practice when you try to use a language like HOL for reasoning about HOL.)

So8res 11 Jun 2015 22:12 UTC
10 points
0 ∶ 0
in reply to: interstice’s comment on: I am Nate Soares, AMA!
That’s a good question: we don’t have a practical AGI to poke at, so why do we expect that we can do work today that’s likely to be relevant many years down the line?

I’ll answer in part with an analogy: Say you went back in time and dropped by to visit Kolmogorov back when he was trying to formalize probability theory, and you asked “working without concrete feedback, how are you planning to increase the chance that your probability theory will be relevant to people trying to reason probabilistically in the future?” It seems like the best response is for him to sort of cock his head and say “well, uh, I’m still trying to formalize what I mean by “chance” and “probability” and so on; once we’ve got those things ironed out, then we can chat.”

Similarly, we’re still trying to formalize the theory of advanced agents: right now, if you handed me unlimited computing power, I wouldn’t know how to program it to reliably and “intelligently” pursue a known goal, even a very simple goal, such as “produce as much diamond as possible.” There are parts of the problem of designing highly reliable advanced agents that we don’t understand even in principle yet. We don’t even know how to brute force the solution yet. We’re still trying to formalize the problems :-)

Also, note that working on theory doesn’t mean you can’t get feedback: we make various mathematical models that attempt to capture part of the problem, we investigate their behavior, we see which parts of the problems they do and don’t capture, and so on. (For example: Stuart Armstrong came up with a formal definition of a utility-indifferent agent; Benja responded by identifying a way Stuart’s agent succumbs to blackmail. I think this counts as pretty concrete feedback: it doesn’t get that much more concrete than “your idea provably doesn’t work”!)

As for relevance, there are definitely paths where this sort of work wouldn’t end up being relevant (jumping straight to whole-brain emulation, jumping straight to nanotech, etc.) but I currently don’t think those scenarios are all that likely. Other cases where it turns out these problems are irrelevant include (a) we needed the theory, but didn’t complete it in time, (b) it turns out you can build a safe AGI even if you don’t understand why it’s working, not even in theory, and (c) someone else got to the theory first. I’m trying to avoid (a), (b) doesn’t seem likely enough to bet the universe on it, and I’d count (c) as a win :-)

So8res 11 Jun 2015 22:01 UTC
10 points
0 ∶ 0
in reply to: Alex_Altair’s comment on: I am Nate Soares, AMA!
Policy work / international coordination. Figuring out how to build an aligned AI is only part of the problem. You also need to ensure that an aligned AI is built, and that’s a lot harder to do during an international arms race. (A race to the finish would be pretty bad, I think.)

I’d like to see a lot more people figuring out how to ensure global stability & coordination as we enter a time period that may be fairly dangerous.

So8res 13 Oct 2016 17:43 UTC
9 points
0 ∶ 0
in reply to: John_Maxwell’s comment on: Ask MIRI Anything (AMA)
Posts or comments on personal Twitter accounts, Facebook walls, etc. should not be assumed to represent any official or consensus MIRI position, unless noted otherwise. I’ll echo Rob’s comment here that “a good safety approach should be robust to the fact that the designers don’t have all the answers”. If an AI project hinges on the research team being completely free from epistemic shortcomings and moral failings, then the project is doomed (and should change how it’s doing alignment research).

I suspect we’re on the same page about it being important to err in the direction of system designs that don’t encourage arms races or other zero-sum conflicts between parties with different object-level beliefs or preferences. See also the CEV discussion above.

So8res 12 Oct 2016 17:26 UTC
9 points
0 ∶ 0
in reply to: poppingtonic’s comment on: Ask MIRI Anything (AMA)
Good question. The main effect is that I’ve increased my confidence in the vague MIRI mathematical intuitions being good, and the MIRI methodology for approaching big vague problems actually working. This doesn’t constitute a very large strategic shift, for a few reasons. One reason is that my strategy was already predicated on the idea that our mathematical intuitions and methodology are up to the task. As I said in last year’s AMA, visible progress on problems like logical uncertainty (and four other problems) were one of the key indicators of success that I was tracking; and as I said in February, failure to achieve results of this caliber in a 5-year timeframe would have caused me to lose confidence in our approach. (As of last year, that seemed like a real possibility.) The logical induction result increases my confidence in our current course, but it doesn’t shift it much.

Another reason logical induction doesn’t affect my strategy too much is that it isn’t that big a result. It’s one step on a path, and it’s definitely mathematically exciting, and it gives answers to a bunch of longstanding philosophical problems, but it’s not a tool for aligning AI systems on the object level. We’re building towards a better understanding of “good reasoning”, and we expect this to be valuable for AI alignment, and logical induction is a step in that direction, but it’s only one step. It’s not terribly useful in isolation, and so it doesn’t call for much change in course.

So8res 12 Jun 2015 2:15 UTC
9 points
0 ∶ 0
in reply to: Wei Dai’s comment on: I am Nate Soares, AMA!
All right, I’ll come back for one more question. Thanks, Wei. Tough question. Briefly,

(1) I can’t see that many paths to victory. The only ones I can see go through either (a) aligned de-novo AGI (which needs to be at least powerful enough to safely prevent maligned systems from undergoing intelligence explosions) or (b) very large amounts of global coordination (which would be necessary to either take our time & go cautiously, or to leap all the way to WBE without someone creating a neuromorph first). Both paths look pretty hard to walk, but in short, (a) looks slightly more promising to me. (Though I strongly support any attempts to widen path (b)!)

(2) It seems to me that the default path leads almost entirely to UFAI: insofar as MIRI research makes it easier for others to create UFAI, most of that effect isn’t replacing wins with losses, it’s just making the losses happen sooner. By contrast, this sort of work seems necessary in order to keep path (a) open. I don’t see many other options. (In other words, I think it’s net positive because it creates some wins and moves some losses sooner, and that seems like a fair trade to me.)

To make that a bit more concrete, consider logical uncertainty: if we attain a good formal understanding of logically uncertain reasoning, that’s quite likely to shorten AI timelines. But I think I’d rather have a 10-year time horizon and be dealing with practical systems built upon solid foundations that come from a decade’s worth of formally understanding what good logically uncertain reasoning looks like, rather than a 20-year time horizon where we have to deal with systems built using 19 years of hacks and 1 year of patches bolted on at the end.

(In other words, the possibility of improving AI capabilities is the price you have to pay to keep path (a) open.)

A bunch of other factors also play into my considerations (including a heuristic which says “the best way to figure out which problems are the real problems is to start solving the things that appear to be the problems,” and another heuristic which says “if you see a big fire, try to put it out, and don’t spend too much time worrying about whether putting it out might actually start worse fires elsewhere”, and a bunch of others), but those are the big considerations, I think.
What links here?
- Daniel_Dewey's comment on I am Nate Soares, AMA! by So8res (12 Jun 2015 13:57 UTC; 3 points)

So8res 11 Jun 2015 22:15 UTC
9 points
0 ∶ 0
in reply to: Evan_Gaensbauer’s comment on: I am Nate Soares, AMA!
(1) number of FAIs produced ;-)

Other important metrics include:
- number of agent foundations forum posts produced
- number of papers written
- number of papers published in conferences/journals
- number of papers published in high-prestige conferences/journals (a fuzzy metric)
- number of conferences attended
- number of collaborative papers written
- number of research associates
- number of people who have attended a workshop
- number of non-MIRI-employees who have produced a technical result
- amount of progress on core technical problems (a very fuzzy metric, which is why it’s important to also track the more concrete numbers above)
- size of research team
I also of course keep my eye on “number of dollars available.”

So8res 11 Jun 2015 22:04 UTC
9 points
0 ∶ 0
in reply to: Owen Cotton-Barratt’s comment on: I am Nate Soares, AMA!
(a) how many major insights remain between us and strong AI? (b) how many of those insights will come from thinking hard, and how many will come from examining the brain? (c) how many more AI winters will there be? (d) how far ahead will the frontrunner be? (e) will there be an arms race?, to name a few.

So8res 12 Oct 2016 19:19 UTC
8 points
0 ∶ 0
in reply to: InquilineKea’s comment on: Ask MIRI Anything (AMA)
I largely endorse Jessica’s comment. I’ll add that I think the ideal MIRI researcher has their own set of big-picture views about what’s required to design aligned AI systems, and that their vision holds up well under scrutiny. (I have a number of heuristics for what makes me more or less excited about a given roadmap.)

That is, the ideal researcher isn’t just working on whatever problems catch their eye or look interesting; they’re working toward a solution of the whole alignment problem, and that vision regularly affects their research priorities.

So8res 10 Jul 2017 22:46 UTC
7 points
0 ∶ 0
in reply to: Daniel_Dewey’s comment on: My current thoughts on MIRI’s “highly reliable agent design” work
Can you give an example or two of failure modes or “categories of failure modes that are easy to foresee” that you think are addressed by some HRAD topic? I’d thought previously that thinking in terms of failure modes wasn’t a good way to understand HRAD research.

I want to steer clear of language that might make it sound like we’re saying:
- X ‘We can’t make broad-strokes predictions about likely ways that AGI could go wrong.’
- X ‘To the extent we can make such predictions, they aren’t important for informing research directions.’
- X ‘The best way to address AGI risk is just to try to advance our understanding of AGI in a general and fairly undirected way.’
The things I do want to communicate are:
- All of MIRI’s research decisions are heavily informed by a background view in which there are many important categories of predictable failure, e.g., ‘the system is steering toward edges of the solution space’, ‘the function the system is optimizing correlates with the intended function at lower capability levels but comes uncorrelated at high capability levels’, ‘the system has incentives to obfuscate and mislead programmers to the extent it models its programmers’ beliefs and expects false programmer beliefs to result in it better-optimizing its objective function.’
- The main case for HRAD problems is that we expect them to help in a gestalt way with many different known failure modes (and, plausibly, unknown ones). E.g., ‘developing a basic understanding of counterfactual reasoning improves our ability to understand the first AGI systems in a general way, and if we understand AGI better it’s likelier we can build systems to address deception, edge instantiation, goal instability, and a number of other problems’.
- There usually isn’t a simple relationship between a particular open problem and a particular failure mode, but if we thought there were no way to predict in advance any of the ways AGI systems can go wrong, or if we thought a very different set of failures were likely instead, we’d have different research priorities.
What links here?
- Plausible cases for HRAD work, and locating the crux in the “realism about rationality” debate by riceissa (LessWrong; 22 Jun 2020 1:10 UTC; 85 points)

So8res 12 Jun 2015 0:00 UTC
7 points
0 ∶ 0
in reply to: Diego_Caleiro’s comment on: I am Nate Soares, AMA!
1) The things we have no idea how to do aren’t the implicit assumptions in the technical agenda, they’re the explicit subject headings: decision theory, logical uncertainty, Vingean reflection, corrigibility, etc :-)

We’ve tried to make it very clear in various papers that we’re dealing with very limited toy models that capture only a small part of the problem (see, e.g., basically all of section 6 in the corrigibility paper).

Right now, we basically have a bunch of big gaps in our knowledge, and we’re trying to make mathematical models that capture at least part of the actual problem—simplifying assumptions are the norm, not the exception. All I can easily say that common simplifying assumptions include: you have lots of computing power, there is lots of time between actions, you know the action set, you’re trying to maximize a given utility function, etc. Assumptions tend to be listed in the paper where the model is described.

2) The FLI folks aren’t doing any research; rather, they’re administering a grant program. Most FHI folks are focused more on high-level strategic questions (What might the path to AI look like? What methods might be used to mitigate xrisk? etc.) rather than object-level AI alignment research. And remember that they look at a bunch of other X-risks as well, and that they’re also thinking about policy interventions and so on. Thus, the comparison can’t easily be made. (Eric Drexler’s been doing some thinking about the object-level FAI questions recently, but I’ll let his latest tech report fill you in on the details there. Stuart Armstrong is doing AI alignment work in the same vein as ours. Owain Evans might also be doing object-level AI alignment work, but he’s new there, and I haven’t spoken to him recently enough to know.)

Insofar as FHI folks would say we’re making assumptions, I doubt they’d be pointing to assumptions like “UDT knows the policy set” or “assume we have lots of computing power” (which are obviously simplifying assumptions on toy models), but rather assumptions like “doing research on logical uncertainty now will actually improve our odds of having a working theory of logical uncertainty before it’s needed.”

(3) I think most of the FHI folks & FLI folks would agree that it’s important to have someone hacking away at the technical problems, but just to make the arguments more explicit, I think that there are a number of problems that it’s hard to even see unless you have your “try to solve FAI” goggles on. Consider: people have been working on some of these problems for decades (logical uncertainty) or even centuries (decision theory) without solving the AI-alignment-relevant parts.

We’re still very much trying to work out the initial theory of highly reliable advanced agents. This involves taking various vague philosophical problems (“what even is logical uncertainty?”) and turning them into concrete mathematical models (akin to the concrete model of probability theory attained by Kolmogorov & co).

We’re still in the preformal stage, and if we can get this theory to the formal stage, I expect we may be able to get a lot more eyes on the problem, because the ever-crawling feelers of academia seem to be much better at exploring formalized problems than they are at formalizing preformal problems.

Then of course there’s the heuristic of “it’s fine to shout ‘model uncertainty!’ and hover on the sidelines, but it wasn’t the armchair philosophers who did away with the epicycles, it was Kepler, who was up to his elbows in epicycle data.” One of the big ways that you identify the things that need working on is by trying to solve the problem yourself. By asking how to actually build an aligned superintelligence, MIRI has generated a whole host of open technical problems, and I predict that that host will be a very valuable asset now that more and more people are turning their gaze towards AI alignment.