Thanks for this solid summary of your views, Daniel. For others’ benefit: MIRI and Open Philanthropy Project staff are in ongoing discussion about various points in this document, among other topics. Hopefully some portion of those conversations will be made public at a later date. In the meantime, a few quick public responses to some of the points above:
2) If we fundamentally “don’t know what we’re doing” because we don’t have a satisfying description of how an AI system should reason and make decisions, then we will probably make lots of mistakes in the design of an advanced AI system.
3) Even minor mistakes in an advanced AI system’s design are likely to cause catastrophic misalignment.
I think this is a decent summary of why we prioritize HRAD research. I would rephrase 3 as “There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” I’d compare these mistakes to the “small” decision in the early 1970s to use null-terminated instead of length-prefixed strings in the C programming language, which continues to be a major source of software vulnerabilities decades later.
I’d also clarify that I expect any large software product to exhibit plenty of actually-trivial flaws, and that I don’t expect that AGI code needs to be literally bug-free or literally proven-safe in order to be worth running. Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea. The end goal is to prevent global catastrophes, but if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are “your team runs into a capabilities roadblock and can’t achieve AGI” or “your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time.”
This case does not revolve around any specific claims about specific potential failure modes, or their relationship to specific HRAD subproblems. This case revolves around the value of fundamental understanding for avoiding “unknown unknown” problems.
We worry about “unknown unknowns”, but I’d probably give them less emphasis here. We often focus on categories of failure modes that we think are easy to foresee. As a rule of thumb, when we prioritize a basic research problem, it’s because we expect it to help in a general way with understanding AGI systems and make it easier to address many different failure modes (both foreseen and unforeseen), rather than because of a one-to-one correspondence between particular basic research problems and particular failure modes.
As an example, the reason we work on logical uncertainty isn’t that we’re visualizing a concrete failure that we think is highly likely to occur if developers don’t understand logical uncertainty. We work on this problem because any system reasoning in a realistic way about the physical world will need to reason under both logical and empirical uncertainty, and because we expect broadly understanding how the system is reasoning about the world to be important for ensuring that the optimization processes inside the system are aligned with the intended objectives of the operators.
A big intuition behind prioritizing HRAD is that solutions to “how do we ensure the system’s cognitive work is being directed at solving the right problems, and at solving them in the desired way?” are likely to be particularly difficult to hack together from scratch late in development. An incomplete (empirical-side-only) understanding of what it means to optimize objectives in realistic environments seems like it will force designers to rely more on guesswork and trial-and-error in a lot of key design decisions.
I haven’t found any instances of complete axiomatic descriptions of AI systems being used to mitigate problems in those systems (e.g. to predict, postdict, explain, or fix them) or to design those systems in a way that avoids problems they’d otherwise face.
This seems reasonable to me in general. I’d say that AIXI has had limited influence in part because it’s combining several different theoretical insights that the field was already using (e.g., complexity penalties and backtracking tree search), and the synthesis doesn’t add all that much once you know about the parts. Sections 3 and 4 of MIRI’s Approach provide some clearer examples of what I have in mind by useful basic theory: Shannon, Turing, Bayes, etc.
My perspective on this is a combination of “basic theory is often necessary for knowing what the right formal tools to apply to a problem are, and for evaluating whether you’re making progress toward a solution” and “the applicability of Bayes, Pearl, etc. to AI suggests that AI is the kind of problem that admits of basic theory.” An example of how this relates to HRAD is that I think that Bayesian justifications are useful in ML, and that a good formal model of rationality in the face of logical uncertainty is likely to be useful in analogous ways. When I speak of foundational understanding making it easy to design the right systems, I’m trying to point at things like the usefulness of Bayesian justifications in modern ML. (I’m unclear on whether we miscommunicated about what sort of thing I mean by “basic insights”, or whether we have a disagreement about how useful principled justifications are in modern practice when designing high-reliability systems.)
I don’t have terribly organized thoughts about this. (And I am still not paying all that much attention—I have much more patience for picking apart my own reasoning processes looking for ways to improve them, than I have for reading other people’s raw takes :-p)
But here’s some unorganized and half-baked notes:
I appreciated various expressions of emotion. Especially when they came labeled as such.
I think there was also a bunch of other stuff going on in the undertones that I don’t have a good handle on yet, and that I’m not sure about my take on. Stuff like… various people implicitly shopping around proposals about how to readjust various EA-internal political forces, in light of the turmoil? But that’s not a great handle for it, and I’m not terribly articulate about it.
There’s a phenomenon where a gambler places their money on 32, and then the roulette wheel comes up 23, and they say “I’m such a fool; I should have bet 23”.
More useful would be to say “I’m such a fool; I should have noticed that the EV of this gamble is negative.” Now at least you aren’t asking for magic lottery powers.
Even more useful would be to say “I’m such a fool; I had three chances to notice that this bet was bad: when my partner was trying to explain EV to me; when I snuck out of the house and ignored a sense of guilt; and when I suppressed a qualm right before placing the bet. I should have paid attention in at least one of those cases and internalized the arguments about negative EV, before gambling my money.” Now at least you aren’t asking for magic cognitive powers.
My impression is that various EAs respond to crises in a manner that kinda rhymes with saying “I wish I had bet 23”, or at best “I wish I had noticed this bet was negative EV”, and in particular does not rhyme with saying “my second-to-last chance to do better (as far as I currently recall) was the moment that I suppressed the guilt from sneaking out of the house”.
(I think this is also true of the general population, to be clear. Perhaps even moreso.)
I have a vague impression that various EAs perform self-flagellation, while making no visible attempt to trace down where, in their own mind, they made a misstep. (Not where they made a good step that turned out in this instance to have a bitter consequence, but where they made a wrong step of the general variety that they could realistically avoid in the future.)
(Though I haven’t gone digging up examples, and in lieu of examples, for all I know this impression is twisted by influence from the zeitgeist.)
My guess is that most EAs didn’t make mental missteps of any import.
And, of course, most folk on this forum aren’t rushing to self-flagellate. Lots of people who didn’t make any mistake, aren’t saying anything about their non-mistakes, as seems entirely reasonable.
I think the scrupulous might be quick to object that, like, they had some flicker of unease about EA being over-invested in crypto, that they should have expounded upon. And so surely they, too, erred.
And, sure, they’d’ve gotten more coolness points if they’d joined the ranks of people who aired that concern in advance.
And there is, I think, a healthy chain of thought from there to the hypothesis that the community needs better mechanisms for incentivizing and aggregating distributed knowledge.
(For instance: some people did air that particular concern in advance, and it didn’t do much. There’s perhaps something to be said for the power that a thousand voices would have had when ten didn’t suffice, but an easier fix than finding 990 voices is probably finding some other way to successfully heed the 10, which requires distinguishing them from the background noise—and distinguishing them as something actionable—before it’s too late, and then routing the requisite action to the people who can do something about it. etc.)
I hope that some version of this conversation is happening somewhere, and it seems vaguely plausible that there’s a variant happening behind closed doors at CEA or something.
I think that maybe a healthier form of community reflection would have gotten to a public and collaborative version of that discussion by now. Maybe we’ll still get there.
(I caveat, though, that it seems to me that many good things die from the weight of the policies they adopt in attempts to win the last war, with a particularly egregious example that springs to mind being the TSA. But that’s getting too much into the object-level weeds.)
(I also caveat that I in fact know a pair of modestly-high-net-worth EA friends who agreed, years ago, that the community was overexposed to crypto, and that at most one of them should be exposed to crypto. The timing of this thought is such that the one who took the non-crypto fork is now significantly less comparatively wealthy. This stuff is hard to get right in real life.)
(And I also caveat that I’m not advocating design-by-community-committee when it comes to community coordination mechanisms. I think that design-by-committee often fails. I also think there’s all sorts of reasons why public attempts to discuss such things can go off the rails. Trying to have smaller conversations, or in-person conversations, seems eminently reasonable to me.)
I think that another thing that’s been going on is that there are various rumors around that “EA leaders” knew something about all this in advance, and this has caused a variety of people to feel (justly) perturbed and uneasy.
Insofar as someone’s thinking is influenced by a person with status in their community, I think it’s fair to ask what they knew and when, as is relevant to the question of whether and how to trust them in the future.
And insofar as other people are operating the de-facto community coordination mechanisms, I think it’s also fair to ask what they knew and when, as is relevant to the question of how (as a community) to fix or change or add or replace some coordination mechanisms.
I don’t particularly have a sense that the public EA discourse around FTX stuff was headed in a healthy and productive direction.
It’s plausible to me that there are healthy and productive processes going on behind closed doors, among the people who operate the de-facto community coordination mechanisms.
Separately, it kinda feels to me like there’s this weird veil draped over everything, where there’s rumors that EA-leader-ish folk knew some stuff but nobody in that reference class is just, like, coming clean.
This post is, in part, an attempt to just pierce the damn veil (at least insofar as I personally can, as somebody who’s at least EA-leader-adjacent).
I can at least show some degree to which the rumors were true (I run an EA org, and Alameda did start out in the offices downstairs from ours, and I was privy to a bunch more data than others) versus false (I know of no suspicion that Sam was defrauding customers, nor have I heard any hint of any coverup).
One hope I have is that this will spark some sort of productive conversation.
For instance, my current hypothesis is that we’d do well to look for better community mechanisms for aggregating hints and acting on them. (Where I’m having trouble visualizing ways of doing it that don’t also get totally blindsided by the next crisis, when it turns out that the next war is not exactly the same as the last one. But this, again, is getting more into the object-level.)
Regardless of whether that theory is right, it’s at least easier to discuss in light of a bunch of the raw facts. Whether or not everybody was completely blindsided, vs whether we had a bunch of hints that we failed to assemble, vs whether there was a fraudulent conspiracy we tried to cover up, matters quite a bit as to how we should react!
It’s plausible to me that a big part of the reason why the discussion hasn’t yet produced Nate!legible fruit, is because it just wasn’t working with all that many details. This post is intended in part to be a contribution towards that end.
(Though I of course also entertain the hypotheses that there’s all sorts of different forces pushing the conversation off the rails (such that this post won’t help much), and the hypothesis that the conversation is happening just fine behind closed doors somewhere (such that this post isn’t all that necessary).)
(And I note, again, that insofar as this post does help the convo, Rob Bensinger gets a share of the credit. I was happy to shelve this post indefinitely, and wouldn’t have dug it out of my drafts folder if he hadn’t argued that it had a chance of rerailing the conversation.)