Fairly short timelines, mildly fast takeoffs, and medium-high uncertainties. --> Looking for abstractions to help cognition-steering and value-loading. --> Grasping at / reacting to related/scary/FOMO lines of research.
Threat Model
An AI system could be built that’s far smarter than any human or small group of humans. This AI system could use its intelligence to defeat any non-motivation-directing safeguards, and gain control of the world and the future of humanity and other sentient life. Based on the orthogonality thesis, this amount of power would probably not, by default, be directed towards the best interests of humanity and other sentient life. Based on the idea of instrumental convergence, such an AI would destroy everything we value in its quest to fulfill its (dumb-by-default) original goal. This AI may require new insights to build, or could arise by “scaling up” existing ML architectures. This AI may “self-improve” its architecture, or it could get smarter through prosaic “hack more cloud computing power” techniques. In either case, it could start from a position of low capabilities and end up as the most powerful entity on Earth. This all could start within as few as 1.5 years from now, and will probably happen within 10 years, barring nuclear or other catastrophe.
The simplest solution to the above problem would be “don’t build superhuman AGI, at least for the near future”. However, superhuman AGI is likely to be built, on purpose or by accident, by any of a handful of groups with large amounts of computational resources and talented researchers. These groups are generally not monolithic, and contain leaders and employees who disagree (internally, with other orgs, and/or with me) about the best approach to AI alignment. (See the section “The AI Landscape” below).
Imagine if any of these groups got a box, today, that said “Input your alignment solution by USB drive, push button to get a superhuman AGI that runs on that, box expires in 1 week”. According to my threat model, humanity is unlikely to survive longer than 1 week in this scenario. This is despite the wildly varying (often quite good!) alignment-motivations and security-mindsets of these groups. On my view, this is (mainly) because none of these groups has an adequate pre-prepared response for the below “Two Subproblems”.
The Two Subproblems
I forgot where this advice came from, but I followed the tip of “Take a day or so to think through AI alignment, for yourself, from scratch”. I was definitely biased by my previous readings on AI (especially by Yudkowsky and Wentworth), but I basically came away with
Steering Cognition
How do we direct the thought-patterns, goals, and development of an AI system? This is basically the rocket alignment analogy, specifically the “Newtonian mechanics”/”basic physics” part.
For many, the core difficulty and most-important-part of AI alignment is to be able to steer a mind’s cognition at all. If we get this right, we set a lower-bound on the badness of AGI X-risks (while also opening the door to S-risks from solving this subproblem and neglecting the one below, but that’s not the immediate focus).
Determining/Loading Values
If we could aim a superintelligent AI system at anything, what should we aim it at, and how? In the “rocket alignment” analogy, this is basically the flight plan (or the method for creating the flight plan) to get to the moon.
At first, this seems to naturally decompose into “determine values” and “encode values into the AGI”. However, I consider this to be one subproblem, because an AGI could most likely carry out the execution of either or both of those steps. But before those steps is something like “figure out what [a pointer to [the best values for an AGI]] would look like, in enough detail to point an AGI at it and expect things to go well from there.” Due to fragility-of-value, I don’t expect a real-life satisfactory solution to AI alignment to involve a human (that is, an neither-augmented-nor-simulated human) writing down the full Sheet Of Human Values and then plugging it into an AGI. However, we could end up with, say, a reliable theory of / mathematical abstraction for our values, which an AGI could then “fill in the blanks” of through observation.
Theory of Change
If I research the above two subproblems (and/or the items in the section “What I’m Personally Learning/Researching” below), then one or both of the above subproblems will become more-solved. This could be a full end-to-end solution, a theoretical-but-proved plan, a paradigm that can be developed further, or contributions to the work of others. I am eager to help and fairly-agnostic about how.
Furthermore, I think that even if my above “Threat Model” is wrong in one or more key ways, the research I want to do would still be helpful. For example, if AI takeoff speeds were slower, I would still want research-similar-to-mine to be developed and refined quickly. If neural networks were usurped by a new AI paradigm, I would still think research-similar-to-mine could help align the new architectures.
And, of course, if you or somebody you know seems well-equipped to carry out any of this research, please steal my ideas, move the work forward, and disclose the results responsibly.
Getting a broad knowledge base of mathematics, modulo my existing knowledge and short timelines.
Finding out what other nuggets of interesting research would be helpful for my goals, from cyborgism/human-researcher-intelligence-amplification to governance/large-training-moratorium to moral uncertainty to theories-of-how-sentience-and/or-consciousness-works.
Applying abstraction theory to understanding and changing the contents of artificial minds. This includes the ability to e.g. trace a “thought” through a neural network, from input to output, and understand its progression.
Applying abstraction theory to determining human values.
Contributing to, giving feedback on, extending, and applying the above areas I’m learning about, especially cutting-across different organizations. (For reference: I am currently in closest touch with people at Orthogonal, Conjecture, and OpenAI. I’ve also talked briefly with various alignment researchers at the EA Global SF 2022 conference.)
And again: If you or somebody you know seems well-equipped to carry out any of this research, please steal my ideas, move the work forward, and disclose the results responsibly.
My Current Constraints
These are described in more depth here. They are (in no particular order):
The most important constraint right now (i.e. the only real bottleneck at this time) is funding. With enough funding, I could work full-time on AI alignment, which would include solving or mitigating the other constraints.
Note that I already have a Bachelor’s degree in computer science, a minor in mathematics, and some other AI-related background (see here).
The AI Landscape
Here is how the rest of the AI alignment/safety landscape looks, to me, as of this writing:
Table 1: An Informal Assessment of Potentially-Strategically-Relevant AI and Alignment Organizations, as of late May 2023. (Note: This table may not be up-to-date.)
Organization
One-Sentence Summary
Are they likely to cause AGI doom, including by accident?
My AI Alignment Research Agenda and Threat Model, right now (May 2023)
TLDR
Fairly short timelines, mildly fast takeoffs, and medium-high uncertainties. --> Looking for abstractions to help cognition-steering and value-loading. --> Grasping at / reacting to related/scary/FOMO lines of research.
Threat Model
An AI system could be built that’s far smarter than any human or small group of humans. This AI system could use its intelligence to defeat any non-motivation-directing safeguards, and gain control of the world and the future of humanity and other sentient life. Based on the orthogonality thesis, this amount of power would probably not, by default, be directed towards the best interests of humanity and other sentient life. Based on the idea of instrumental convergence, such an AI would destroy everything we value in its quest to fulfill its (dumb-by-default) original goal. This AI may require new insights to build, or could arise by “scaling up” existing ML architectures. This AI may “self-improve” its architecture, or it could get smarter through prosaic “hack more cloud computing power” techniques. In either case, it could start from a position of low capabilities and end up as the most powerful entity on Earth. This all could start within as few as 1.5 years from now, and will probably happen within 10 years, barring nuclear or other catastrophe.
The simplest solution to the above problem would be “don’t build superhuman AGI, at least for the near future”. However, superhuman AGI is likely to be built, on purpose or by accident, by any of a handful of groups with large amounts of computational resources and talented researchers. These groups are generally not monolithic, and contain leaders and employees who disagree (internally, with other orgs, and/or with me) about the best approach to AI alignment. (See the section “The AI Landscape” below).
Imagine if any of these groups got a box, today, that said “Input your alignment solution by USB drive, push button to get a superhuman AGI that runs on that, box expires in 1 week”. According to my threat model, humanity is unlikely to survive longer than 1 week in this scenario. This is despite the wildly varying (often quite good!) alignment-motivations and security-mindsets of these groups. On my view, this is (mainly) because none of these groups has an adequate pre-prepared response for the below “Two Subproblems”.
The Two Subproblems
I forgot where this advice came from, but I followed the tip of “Take a day or so to think through AI alignment, for yourself, from scratch”. I was definitely biased by my previous readings on AI (especially by Yudkowsky and Wentworth), but I basically came away with
Steering Cognition
How do we direct the thought-patterns, goals, and development of an AI system? This is basically the rocket alignment analogy, specifically the “Newtonian mechanics”/”basic physics” part.
For many, the core difficulty and most-important-part of AI alignment is to be able to steer a mind’s cognition at all. If we get this right, we set a lower-bound on the badness of AGI X-risks (while also opening the door to S-risks from solving this subproblem and neglecting the one below, but that’s not the immediate focus).
Determining/Loading Values
If we could aim a superintelligent AI system at anything, what should we aim it at, and how? In the “rocket alignment” analogy, this is basically the flight plan (or the method for creating the flight plan) to get to the moon.
At first, this seems to naturally decompose into “determine values” and “encode values into the AGI”. However, I consider this to be one subproblem, because an AGI could most likely carry out the execution of either or both of those steps. But before those steps is something like “figure out what [a pointer to [the best values for an AGI]] would look like, in enough detail to point an AGI at it and expect things to go well from there.” Due to fragility-of-value, I don’t expect a real-life satisfactory solution to AI alignment to involve a human (that is, an neither-augmented-nor-simulated human) writing down the full Sheet Of Human Values and then plugging it into an AGI. However, we could end up with, say, a reliable theory of / mathematical abstraction for our values, which an AGI could then “fill in the blanks” of through observation.
Theory of Change
If I research the above two subproblems (and/or the items in the section “What I’m Personally Learning/Researching” below), then one or both of the above subproblems will become more-solved. This could be a full end-to-end solution, a theoretical-but-proved plan, a paradigm that can be developed further, or contributions to the work of others. I am eager to help and fairly-agnostic about how.
Furthermore, I think that even if my above “Threat Model” is wrong in one or more key ways, the research I want to do would still be helpful. For example, if AI takeoff speeds were slower, I would still want research-similar-to-mine to be developed and refined quickly. If neural networks were usurped by a new AI paradigm, I would still think research-similar-to-mine could help align the new architectures.
And, of course, if you or somebody you know seems well-equipped to carry out any of this research, please steal my ideas, move the work forward, and disclose the results responsibly.
What Success Could Look Like
Create a manual/framework for building a friendly AGI.
Create a manual/framework for building an AGI that can be pointed at anything.
Helping with either of the above two.
Some other result that prevents AGI-caused extinction of humanity.
What I’m Personally Learning/Researching
Learning
John Wentworth’s work on abstractions
MIRI’s work on agent foundations
QACI’s work on pointing-at-real-world-values
Getting a broad knowledge base of mathematics, modulo my existing knowledge and short timelines.
Finding out what other nuggets of interesting research would be helpful for my goals, from cyborgism/human-researcher-intelligence-amplification to governance/large-training-moratorium to moral uncertainty to theories-of-how-sentience-and/or-consciousness-works.
Researching
Finding or developing mathematical structures that are actually useful in aligning smart AI systems (“What’s the type signature of an agent?”).
Contributing to abstraction theory to get it to a point where “pointing at things in the world” is doable.
Applying abstraction theory to understanding and changing the contents of artificial minds. This includes the ability to e.g. trace a “thought” through a neural network, from input to output, and understand its progression.
Applying abstraction theory to determining human values.
Contributing to, giving feedback on, extending, and applying the above areas I’m learning about, especially cutting-across different organizations. (For reference: I am currently in closest touch with people at Orthogonal, Conjecture, and OpenAI. I’ve also talked briefly with various alignment researchers at the EA Global SF 2022 conference.)
And again: If you or somebody you know seems well-equipped to carry out any of this research, please steal my ideas, move the work forward, and disclose the results responsibly.
My Current Constraints
These are described in more depth here. They are (in no particular order):
Funding
Mathematical intuition/ability/”talent”
Mathematical concepts (getting even more “broad technical background”)
Mathematical notation/formalism knowledge
My working memory
My mental stamina
The most important constraint right now (i.e. the only real bottleneck at this time) is funding. With enough funding, I could work full-time on AI alignment, which would include solving or mitigating the other constraints.
Note that I already have a Bachelor’s degree in computer science, a minor in mathematics, and some other AI-related background (see here).
The AI Landscape
Here is how the rest of the AI alignment/safety landscape looks, to me, as of this writing:
Table 1: An Informal Assessment of Potentially-Strategically-Relevant AI and Alignment Organizations, as of late May 2023. (Note: This table may not be up-to-date.)