My current thoughts on MIRI’s “highly reliable agent design” work

Interpreting this writeup:

I lead the Open Philanthropy Project’s work on technical AI safety research. In our MIRI grant writeup last year, we said that we had strong reservations about MIRI’s research, and that we hoped to write more about MIRI’s research in the future. This writeup explains my current thinking about the subset of MIRI’s research referred to as “highly reliable agent design” in the Agent Foundations Agenda. My hope is that this writeup will help move the discussion forward, but I definitely do not consider it to be any kind of final word on highly reliable agent design. I’m posting the writeup here because I think this is the most appropriate audience, and I’m looking forward to reading the comments (though I probably won’t be able to respond to all of them).

After writing the first version of this writeup, I received comments from other Open Phil staff, technical advisors, and MIRI staff. Many comments were disagreements with arguments or credences stated here; some of these disagreements seem plausible to me, some comments disagree with one another, and I place significant weight on all of them because of my confidence in the commentators. Based on these comments, I think it’s very likely that some aspects of this writeup will turn out to have been miscalibrated or mistaken – i.e. incorrect given the available evidence, and not just cases where I assign a reasonable credence or make a reasonable argument that may turn out to be wrong – but I’m not sure which aspects these will turn out to be.

I considered spending a lot of time heavily revising this writeup to take these comments into account. However, it seems pretty likely to me that I could continue this comment/​revision process for a long time, and this process offers very limited opportunities for others outside of a small set of colleagues to engage with my views and correct me where I’m wrong. I think there’s significant value in instead putting an imperfect writeup into the public record, and giving others a chance to respond in their own words to an unambiguous snapshot of my beliefs at a particular point in time.

Contents

  1. What is “highly reliable agent design”?

  2. What’s the basic case for HRAD?

  3. What do I think about HRAD?

    1. Low credence that HRAD will be applicable (25%?)

    2. HRAD has few advocates among AI researchers

    3. Other research, especially “learning to reason from humans,” looks more promising than HRAD (75%?)

    4. MIRI staff are thoughtful, aligned with our values, and have a good track record

  4. How much should Open Phil support HRAD work?

1. What is “highly reliable agent design”?

I understand MIRI’s “highly reliable agent design” work (coined in this research agenda, “HRAD” for short) as work that aims to describe basic aspects of reasoning and decision-making in a complete, principled, and theoretically satisfying way. Here’s a non-exhaustive list of research topics in this area:

  • Epistemology: developing a formal theory of induction that accounts for the facts that an AI system will be implemented in the physical world it is reasoning about (“naturalistic world models”) and that other intelligent agents may be simulating the AI system (“benign universal prior”).

  • Decision theory: developing a decision theory that behaves appropriately when an agent’s decisions are logically entangled with other parts of the environment (e.g. in the presence of other copies of the agent, other very similar systems, or other agents that can predict the agent), and that can’t be profitably threatened by other agents.

  • Logical uncertainty: developing a rigorous, satisfying theory of probabilistic reasoning over facts that are logical consequences of an agent’s current beliefs, but that are too expensive to reason out deductively.

  • Vingean reflection: developing a theory of formal reasoning that allows an agent to reason with high reliability about similar agents, including agents with considerably more computational resources, without simulating those agents.

To be really satisfying, it should be possible to put these descriptions together into a full and principled description of an AI system that reasons and makes decisions in pursuit of some goal in the world, not taking into account issues of efficiency; this description might be understandable as a modified/​expanded version of AIXI. Ideally this research would also yield rigorous explanations of why no other description is satisfying.

2. What’s the basic case for HRAD?

My understanding is that MIRI (or at least Nate and Eliezer) believe that if there is not significant progress on many problems in HRAD, the probability that an advanced AI system will cause catastrophic harm is very high. (They reserve some probability for other approaches being found that could render HRAD unnecessary, but they aren’t aware of any such approaches.)

I’ve engaged in many conversations about why MIRI believes this, and have often had trouble coming away with crisply articulated reasons. So far, the basic case that I think is most compelling and most consistent with the majority of the conversations I’ve had is something like this (phrasing is mine /​ Holden’s):

  1. Advanced AI systems are going to have a huge impact on the world, and for many plausible systems, we won’t be able to intervene after they become sufficiently capable.

  2. If we fundamentally “don’t know what we’re doing” because we don’t have a satisfying description of how an AI system should reason and make decisions, then we will probably make lots of mistakes in the design of an advanced AI system.

  3. Even minor mistakes in an advanced AI system’s design are likely to cause catastrophic misalignment.

  4. Because of 1, 2, and 3, if we don’t have a satisfying description of how an AI system should reason and make decisions, we’re likely to make enough mistakes to cause a catastrophe. The right way to get to advanced AI that does the right thing instead of causing catastrophes is to deeply understand what we’re doing, starting with a satisfying description of how an AI system should reason and make decisions.

  5. This case does not revolve around any specific claims about specific potential failure modes, or their relationship to specific HRAD subproblems. This case revolves around the value of fundamental understanding for avoiding “unknown unknown” problems.

I also find it helpful to see this case as asserting that HRAD is one kind of “basic science” approach to understanding AI. Basic science in other areas – i.e. work based on some sense of being intuitively, fundamentally confused and unsatisfied by the lack of explanation for something – seems to have an outstanding track record of uncovering important truths that would have been hard to predict in advance, including the work of Faraday/​Maxwell, Einstein, Nash, and Turing. Basic science can also provide a foundation for high-reliability engineering, e.g. by giving us a language to express guarantees about how an engineered system will perform in different circumstances or by improving an engineer’s ability to design good empirical tests. Our lack of satisfying explanations for how an AI system should reason and make decisions and the importance of “knowing what we’re doing” in AI make a basic science approach appealing, and HRAD is one such approach. (I don’t think MIRI would say that there couldn’t be other kinds of basic science that could be done in AI, but they don’t know of similarly valuable-looking approaches.)

We’ve spent a lot of effort (100+ hours) trying to write down more detailed cases for HRAD work. This time included conversations with MIRI, conversation among Open Phil staff and technical advisors, and writing drafts of these arguments. These other cases didn’t feel like they captured MIRI’s views very well and were not very understandable or persuasive to me and other Open Phil staff members, so I’ve fallen back on this simpler case for now when thinking about HRAD work.

3. What do I think about HRAD?

I have several points of agreement with MIRI’s basic case:

  • I agree that existing formalisms like AIXI, Solomonoff induction, and causal decision theory are unsatisfying as descriptions of how an AI system should reason and make decisions, and I agree with most (maybe all) of the ways that MIRI thinks they are unsatisfying.

  • I agree that advanced AI is likely to have a huge impact on the world, and that for certain advanced AI systems there will be a point after which we won’t be able to intervene.

  • I agree that some plausible kinds of mistakes in an AI system’s design would cause catastrophic misalignment.

  • I agree that without some kind of description of “what an advanced AI system is doing” that makes us confident that it will be aligned, we should be very worried that it will cause a catastrophe.

The fact that MIRI researchers (who are thoughtful, very dedicated to this problem, aligned with our values, and have a good track record in thinking about existential risks from AI) and some others in the effective altruism community are significantly more positive than I am about HRAD is an extremely important factor to me in favor of HRAD. These positive views significantly raise the minimum credence I’m willing to put on HRAD research being very helpful.

In addition to these positive factors, I have several reservations about HRAD work. In relation to the basic case, these reservations make me think that HRAD isn’t likely to be significantly helpful for getting a confidence-generating description of how an advanced AI system reasons and makes decisions.

1. It seems pretty likely that early advanced AI systems won’t be understandable in terms of HRAD’s formalisms, in which case HRAD won’t be useful as a description of how these systems should reason and make decisions.

Note: I’m not sure to what extent MIRI and I disagree about how likely HRAD is to be applicable to early advanced AI systems. It may be that our overall disagreement about HRAD is more about the feasibility of other AI alignment research options (see 3 below), or possibly about strategic questions outside the scope of this document (e.g. to what extent we should try to address potential risks from advanced AI through strategy, policy, and outreach rather than through technical research).

2. HRAD has gained fewer strong advocates among AI researchers than I’d expect it to if it were very promising—including among AI researchers whom I consider highly thoughtful about the relevant issues, and whom I’d expect to be more excited if HRAD were likely to be very helpful.

Together, these two concerns give me something like a 20% credence that if HRAD work reached a high level of maturity (and relatively little other AI alignment research were done) HRAD would significantly help AI researchers build aligned AI systems around the time it becomes possible to build any advanced AI system.

3. The above considers HRAD in a vacuum, instead of comparing it to other AI alignment research options. My understanding is that MIRI thinks it is very unlikely that other AI alignment research can make up for a lack of progress in HRAD. I disagree; HRAD looks significantly less promising to me (in terms of solving object-level alignment problems, ignoring factors like field-building value) than learning to reason and make decisions from human-generated data (described more below), and HRAD seems unlikely to be helpful on the margin if reasonable amounts of other AI alignment research is done.

This reduces my credence in HRAD being very helpful to around 10%. I think this is the decision-relevant credence.

In the next few sections, I’ll go into more detail about the factors I just described. Afterward, I’ll say what I think this implies about how much we should support HRAD research, briefly summarizing the other factors that I think are most relevant.

3a. Low credence that HRAD will be applicable (25%?)

The basic case for HRAD being helpful depends on HRAD producing a description of how an AI system should reason and make decisions that can be productively applied to advanced AI systems. In this section, I’ll describe my reasons for thinking this is not likely. (As noted above, I’m not sure to what extent MIRI and I disagree about how likely HRAD is to be applicable to early advanced AI systems; nevertheless, it’s an important factor in my current beliefs about the value of HRAD work.)

I understand HRAD work as aiming to describe basic aspects of reasoning and decision-making in a complete, principled, and theoretically satisfying way, and ideally to have arguments that no other description is more satisfying. I’ll refer to this as a “complete axiomatic approach,” meaning that an end result of HRAD-style research on some aspect of reasoning would be a set of axioms that completely describe that aspect and that are chosen for their intrinsic desirability or for the desirability of the properties they entail. This property of HRAD work is the source of several of my reservations:

  • I haven’t found any instances of complete axiomatic descriptions of AI systems being used to mitigate problems in those systems (e.g. to predict, postdict, explain, or fix them) or to design those systems in a way that avoids problems they’d otherwise face. AIXI and Solomonoff induction are particularly strong examples of work that is very close to HRAD, but don’t seem to have been applicable to real AI systems. While I think the most likely explanation for this lack of precedent is that complete axiomatic description is not a very promising approach, it could be that not enough effort has been spent in this direction for contingent reasons; I think that attempts at this would be very informative about HRAD’s expected usefulness, and seem like the most likely way that I’ll increase my credence in HRAD’s future applicability. (Two very accomplished machine learning researchers have told me that AIXI is a useful source of inspiration for their work; I think it’s plausible that e.g. logical uncertainty could serve a similar role, but this is a much weaker case for HRAD than the one I understand MIRI as making.) If HRAD work were likely to be applicable to advanced AI systems, it seems likely to me that some complete axiomatic descriptions (or early HRAD results) should be applicable to current AI systems, especially if advanced AI systems are similar to today’s.

  • From conversations with researchers and from my own familiarity with the literature, my understanding is that it would be extremely difficult to relate today’s cutting-edge AI systems to complete axiomatic descriptions. It seems to me that very few researchers think this approach is promising relative to other kinds of theory work, and that when researchers have tried to describe modern machine learning methods in this way, their work has generally not been very successful (compared to other theoretical and experimental work) in increasing researchers’ understanding of the AI systems they are developing.

  • It seems plausible that the kinds of axiomatic descriptions that HRAD work could produce would be too taxing to be usefully applied to any practical AI system. HRAD results would have to be applied to actual AI systems via theoretically satisfying approximation methods, and it seems plausible that this will not be possible (or that the approximation methods will not preserve most of the desirable properties entailed by the axiomatic descriptions). I haven’t gathered evidence about this question.

  • It seems plausible that the conceptual framework and axioms chosen during HRAD work will be very different from the conceptual framework that would best describe how early advanced AI systems work. In theory, it may be possible to describe a recurrent neural network learning to predict future inputs as a particular approximation of Solomonoff induction, but in practice the differences in conceptual framework may be significant enough that this description would not actually be useful for understanding how neural networks work or how they might fail.

Overall, this makes me think it’s unlikely that HRAD work will apply well to advanced AI systems, especially if advanced AI is reached soon (which would make it more likely to resemble today’s machine learning methods). A large portion of my credence in HRAD being applicable to advanced AI systems comes from the possibility that advanced AI systems won’t look much like today’s. I don’t know how to gain much evidence about HRAD’s applicability in this case.

3b. HRAD has few advocates among AI researchers

HRAD has gained fewer strong advocates among AI researchers than I’d expect it to if it were very promising, despite other aspects of MIRI’s research (the alignment problem, value specification, corrigibility) being strongly supported by a few prominent researchers. Our review of five of MIRI’s HRAD papers last year provided more detailed examples of how a small number of AI researchers (seven computer science professors, one graduate student, and our technical advisors) respond to HRAD research; these reviews made it seem to us that HRAD research has little potential to decrease potential risks from advanced AI relative to other technical work with the same goal, though we noted that this conclusion was “particularly tentative, and some of our advisors thought that versions of MIRI’s research direction could have significant value if effectively pursued”.

I interpret these unfavorable reviews and lack of strong advocates as evidence that:

  1. HRAD is less likely to be good basic science of AI; I’d expect a reasonable number of external AI researchers recognize good basic science of AI, even if its aesthetic is fairly different from the most common aesthetics in AI research.

  2. HRAD is less likely to be applicable to AI systems that are similar to today’s; I would expect applicability to AI systems similar to today’s to make HRAD research significantly more interesting to AI researchers, and our technical advisors agreed strongly that HRAD is especially unlikely to apply to AI systems that are similar to today’s.

I’m frankly not sure how many strong advocates among AI researchers it would take to change my mind on these points – I think a lot would depend on details of who they were and what story they told about their interest in HRAD.

I do believe that some of this lack of interest should be explained by social dynamics and communication difficulties – MIRI is not part of the academic system, and the way MIRI researchers write about their work and motivation is very different from many academic papers, and both of these could cause mainstream AI researchers to be less interested in HRAD research than they would be if these factors weren’t in play. However, I think our review process and conversations with our technical advisors each provide some evidence that this isn’t likely to be sufficient to explain AI researchers’ low interest in HRAD.

Reviewers’ descriptions of the papers’ main questions, conclusions, and intended relationship to potential risks from advanced AI generally seemed thoughtful and (as far as I can tell) accurate, and in several cases (most notably Fallenstein and Kumar 2015) some reviewers thought the work was novel and impressive; if reviewers’ opinions were more determined by social and communication issues, I would expect reviews to be less accurate, less nuanced, and more broadly dismissive.

I only had enough interaction with external reviewers to be moderately confident that their opinions weren’t significantly attributable to social or communication issues. I’ve had much more extensive, in-depth interaction with our technical advisors, and I’m significantly more confident that their views are mostly determined by their technical knowledge and research taste. I think our technical advisors are among the very best-qualified outsiders to assess MIRI’s work, and that they have genuine understanding of the importance of alignment as well as being strong researchers by traditional standards. Their assessment is probably the single biggest data point for me in this section.

Outside of HRAD, some other research topics that MIRI has proposed have been the subject of much more interest from AI researchers. For example, researchers and students at CHAI have published papers on and are continuing to work on value specification and error-tolerance (particularly corrigibility), these topics have consistently seemed more promising to our technical advisors, and Stuart Russell has adopted the value alignment problem as a central theme of his work. In light of this, I am more inclined to take AI researchers’ lack of interest in HRAD as evidence about its promisingness than as evidence of severe social or communication issues.

The most convincing argument I know of for not treating other researchers’ lack of interest as significant evidence about the promisingness of HRAD research is:

  1. I’m pretty sure that MIRI’s work on decision theory is a very significant step forward for philosophical decision theory. This is based mostly on conversations with a very small number of philosophers who I know to have seriously evaluated MIRI’s work, partially on an absence of good objections to their decision theory work, and a little on my own assessment of the work (which I’d discard if the first two considerations had gone the other way).

  2. MIRI’s decision theory work has gained significantly fewer advocates among professional philosophers than I’d expect it to if it were very promising.

I’m strongly inclined to resolve this conflict by continuing to believe that MIRI’s decision theory work is good philosophy, and to explain 2 by appealing to social dynamics and communication difficulties. I think it’s reasonable to consider an analogous situation with HRAD and AI researchers to be plausible a priori, but the analogue of point 1 above doesn’t apply to HRAD work, and the other reasons I’ve given in this section lead me to think that this is not likely.

3c. Other research, especially “learning to reason from humans,” looks more promising than HRAD (75%?)

How promising does HRAD look compared to other AI alignment research options? The most significant factor to me is the apparent promisingness of designing advanced AI systems to reason and make decisions from human-generated data (“learning to reason from humans”); if an approach along these lines is successful, it doesn’t seem to me that much room would be left for HRAD to help on the margin. My views here are heavily based on Paul Christiano’s writing on this topic, but I’m not claiming to represent his overall approach, and in particular I’m trying to sketch out a broader set of approaches that includes Paul’s. It’s plausible to me that other kinds of alignment research could play a similar role, but I have a much less clear picture of how that would work, and finding out about significant problems with learning to reason from humans would make me both more pessimistic about technical work on AI alignment in general and more optimistic that HRAD would be helpful. The arguments in this section are pretty loose, but the basic idea seems promising enough to me to justify high credence that something in this general area will work.

“Learning to reason from humans” is different from the most common approaches in AI today, where decision-making methods are implicitly learned in the process of approximating some function – e.g. a reward-maximizing policy, an imitative policy, a Q-function or model of the world, etc. Instead, learning to reason from humans would involve directly training a system to reason in ways that match human demonstrations or are approved of by human feedback, as in Paul’s article here.

If we are able to become confident that an AI system is learning to reason in ways that meet human approval or match human demonstrations, it seems to me that we could also become confident that the AI system would be aligned overall; a very harmful decision would need to be generated by a series of human-endorsed reasoning steps (and unless human reasoning endorses a search for edge cases, edge cases won’t be sought). Human endorsement of reasoning and decision-making could not only incorporate valid instrumental reasoning (in parts of epistemology and decision theory that we know how to formalize), but also rules of thumb and sanity checks that allow humans to navigate uncertainty about which epistemology and decision theory are correct, as well as human value judgements about which decisions, actions, short-term consequences, and long-term consequences are desirable, undesirable, or of uncertain value.

Another factor that is important to me here is the potential to design systems to reason and make decisions in ways that are calibrated or conservative. The idea here is that we can become more confident that AI systems will not make catastrophic decisions if they can reliably detect when they are operating in unfamiliar domains or situations, have low confidence that humans would approve of their reasoning and decisions, have low confidence in predicted consequences, or are considering actions that could cause significant harm; in those cases, we’d like AI systems to “check in” with humans more intensively and to act more conservatively. It seems likely to me that these kinds of properties would contribute significantly to alignment and safety, and that we could pursue these properties by designing systems to learn to reason and make decisions in human-approved ways, or by directly studying statistical properties like calibration or “conservativeness”.

“Learning to reason and make decisions from human examples and feedback” and “learning to act ‘conservatively’ where ‘appropriate’” don’t seem to me to be many orders of magnitude more difficult than the kinds of learning tasks AI systems are good at today. If it was necessary for an AI system to imitate human judgement perfectly, I would be much more skeptical of this approach, but that doesn’t seem to be necessary, as Paul argues:

“You need only the vaguest understanding of humans to guess that killing the user is: (1) not something they would approve of, (2) not something they would do, (3) not in line with their instrumental preferences.

So in order to get bad outcomes here you have to really mess up your model of what humans want (or more likely mess up the underlying framework in an important way).

If we imagine a landscape of possible interpretations of human preferences, there is a ‘right’ interpretation that we are shooting for. But if you start with a wrong answer that is anywhere in the neighborhood, you will do things like ‘ask the user what to do, and don’t manipulate them.’ And these behaviors will eventually get you where you want to go.

That is to say, the ‘right’ behavior is surrounded by a massive crater of ‘good enough’ behaviors, and in the long-term they all converge to the same place. We just need to land in the crater.”

Learning to reason from humans is a good fit with today’s AI research, and is broad enough that it would be very surprising to me if it were not productively applicable to early advanced AI systems.

It seems to me that this kind of approach is also much more likely to be robust to unanticipated problems than a formal, HRAD-style approach would be, since it explicitly aims to learn how to reason in human-endorsed ways instead of relying on researchers to notice and formally solve all critical problems of reasoning before the system is built. There are significant open questions about whether and how we could make machine learning robust and theoretically well-understood enough for high confidence, but it seems to me that this will be the case for any technical pathway that relies on learning about human preferences in order to act desirably.

Finally, it seems to me that if a lack of HRAD-style understanding does leave us exposed to many important “unknown unknown” problems, there is a good chance that some of those problems will be revealed by failures or difficulties in achieving alignment in earlier AI systems, and that researchers who are actively thinking about the goal of aligning advanced AI systems will be able to notice these failings and relate them to a need for better HRAD-style understanding. This kind of process seems very likely to be applicable to learning to reason from humans, but could also apply to other approaches to AI alignment. I do not think that this process is guaranteed to reveal a need for HRAD-style understanding in the case that it is needed, and I am fairly sure that some failure modes will not appear in earlier advanced AI systems (the failure modes Bostrom calls “treacherous turns”, which only appear when an AI system has a large range of general-purpose capabilities, can reason very powerfully, etc.). It’s possible that earlier failure modes will be too rare, too late, or not clearly enough related to a need for HRAD-style research. However, if a lack of fundamental understanding does expose us to many important “unknown unknown” failure modes, it seems more likely to me that some informative failures will happen early than that all such failures will appear only after systems are advanced enough to be extremely high-impact, and that researchers motivated by alignment of advanced AI will notice if those failures could be addressed through HRAD-style understanding. (I’m uncertain about how researchers who aren’t thinking actively about alignment of advanced AI would respond, and I think one of the most valuable things we can do today is to increase the number of researchers who are thinking actively about alignment of advanced AI and are therefore more likely to respond appropriately to evidence.)

My credence for this section isn’t higher for three basic reasons:

  • It may be significantly harder to build an aligned AI system that’s much more powerful than a human if we use learned reasoning rules instead of formally specified ones. Very little work has been done on this topic.

  • It may be that some parts of HRAD – e.g. logical uncertainty or benign universal priors – will turn out to be necessary for reliability. This currently looks unlikely to me, but seems like the main way that parts of HRAD could turn out to be prerequisites for learning to reason from humans.

  • Unknown unknowns; my arguments in this section are pretty loose, and little work has been done on this topic.

3d. MIRI staff are thoughtful, aligned with our values, and have a good track record

As I noted above, I believe that MIRI staff are thoughtful, very dedicated to this problem, aligned with our values, and have a good track record in thinking about existential risk from AI. The fact that some of them are much more optimistic than I am about HRAD research is a very significant factor in favor of HRAD. I think it would be incorrect to place a very low credence (e.g. 1%) on their views being closer to the truth than mine are.

I don’t think it is helpful to try to list a large amount of detail here; I’m including this as its own section in order to emphasize its importance to my reasoning. My views come from many in-person and online conversations with MIRI researchers over the past 5 years, reports of many similar conversations by other thoughtful people I trust, and a large amount of online writing about existential risk from AI spread over several sites, most notably LessWrong.com, agentfoundations.org, arbital.com, and intelligence.org.

The most straightforward thing to list is that MIRI was among the first groups to strongly articulate the case for existential risk from artificial intelligence and the need for technical and strategic research on this topic, as noted in our last writeup:

“We believe that MIRI played an important role in publicizing and sharpening the value alignment problem. This problem is described in the introduction to MIRI’s Agent Foundations technical agenda. We are aware of MIRI writing about this problem publicly and in-depth as early as 2001, at a time when we believe it received substantial attention from very few others. While MIRI was not the first to discuss potential risks from advanced artificial intelligence, we believe it was a relatively early and prominent promoter, and generally spoke at more length about specific issues such as the value alignment problem than more long-standing proponents.”

4. How much should Open Phil support HRAD work?

My 10% credence that “if HRAD reached a high level of maturity it would significantly help AI researchers build aligned AI systems” doesn’t fully answer the question of how much we should support HRAD work (with our funding and with our outreach to researchers) relative to other technical work on AI safety. It seems to me that the main additional factors are:

Field-building value: I expect that the majority of the value of our current funding in technical AI safety research will come from its effect of increasing the total number of people who are deeply knowledgeable about technical research on artificial intelligence and machine learning, while also being deeply versed in issues relevant to potential risks. HRAD work appears to be significantly less useful for this goal than other kinds of AI alignment work, since HRAD has not gained much support among AI researchers. (I do think that in order to be effective for field-building, AI safety research directions should be among the most promising we can think of today; this is not an argument for work on non-promising, but attractive “AI safety” research.)

Replaceability: HRAD work seems much more likely than other AI alignment work to be neglected by AI researchers and funders. If HRAD work turns out to be significantly helpful, we could make a significant counterfactual difference by supporting it.

Shovel-readiness: My understanding is that HRAD work is currently funding-constrained (i.e. MIRI could scale up its program given more funds). This is not generally true of technical AI safety work, which in my experience has also required significant staff time.

The difference in field-building value between HRAD and the other technical AI safety work we support makes me significantly more enthusiastic about supporting other technical AI safety work than about supporting HRAD. However, HRAD’s low replaceability and my 10% credence in HRAD being useful make me excited to support at least some HRAD work.

In my view, enough HRAD work should be supported to continue building evidence about its chance of applicability to advanced AI, to have opportunities for other AI researchers to encounter it and become advocates, and to generally make it reasonably likely that if it is more important than it currently appears then we can learn this fact. MIRI’s current size seems to me to be approximately right for this purpose, and as far as I know MIRI staff don’t think MIRI is too small to continue making steady progress. Given this, I am ambivalent (along the lines of our previous grant writeup) about recommending that Good Ventures funds be used to increase MIRI’s capacity for HRAD research.