My sense of the current general landscape of AI Safety is: various groups of people pursuing quite different research agendas, and not very many explicit and written-up arguments for why these groups think their agenda is a priority (a notable exception is Paul’s argument for working on prosaic alignment). Does this sound right? If so, why has this dynamic emerged and should we be concerned about it? If not, then I’m curious about why I developed this picture.
I think the picture is somewhat correct, and we surprisingly should not be too concerned about the dynamic.
My model for this is:
1) there are some hard and somewhat nebulous problems “in the world”
2) people try to formalize them using various intuitions/framings/kinds of math; also using some “very deep priors”
3) the resulting agendas look at the surface level extremely different, and create the impression you have
but actually
4) if you understand multiple agendas deep enough, you get a sense
how they are sometimes “reflecting” the same underlying problem
if they are based on some “deep priors”, how deep it is, and how hard to argue it can be
how much they are based on “tastes” and “intuitions” ~ one model how to think about it is people having boxes comparable to policy net in AlphaZero: a mental black-box which spits useful predictions, but is not interpretable in language
Overall, given our current state of knowledge, I think running these multiple efforts in parallel is a better approach with higher chance of success that an idea that we should invest a lot in resolving disagreements/prioritizing, and everyone should work on the “best agenda”.
This seems to go against some core EA heuristic (“compare the options, take the best”) but actually is more in line with what rational allocation of resources in the face of uncertainty.
For example, CAIS and something like “classical superintelligence in a box picture” disagree a lot on the surface level. However, if you look deeper, you will find many similar problems. Simple to explain example: problem of manipulating the operator—which has (in my view) some “hard core” involving both math and philosophy, where you want the AI to somehow communicate with humans in a way which at the same time allows a) the human to learn from the AI if the AI knows something about the world b) the operator’s values are not “overwritten” by the AI c) you don’t want to prohibit moral progress. In CAIS language this is connected to so called manipulative services.
Or: one of the biggest hits of past year is the mesa-optimisation paper. However, if you are familiar with prior work, you will notice many of the proposed solutions with mesa-optimisers are similar/same solutions as previously proposed for so called ‘daemons’ or ‘misaligned subagents’. This is because the problems partially overlap (the mesa-optimisation framing is more clear and makes a stronger case for “this is what to expect by default”). Also while, for example, on the surface level there is a lot of disagreement between e.g. MIRI researchers, Paul Christiano and Eric Drexler, you will find a “distillation” proposal targeted at the above described problem in Eric’s work from 2015, many connected ideas in Paul’s work on distillation, and while find it harder to understand Eliezer I think his work also reflects understanding of the problem.
b)
For example: You can ask whether the space of intelligent systems is fundamentally continuous, or not. (I call it “the continuity assumption”). This is connected to many agendas—if the space is fundamentally discontinuous this would cause serious problems to some forms of IDA, debate, interpretability & more.
(An example of discontinuity would be existence of problems which are impossible to meaningfully factorize; there are many more ways how the space could be discontinuous)
There are powerful intuitions going both ways on this.
Not Buck, but one possibility is that people pursuing different high-level agendas have different intuitions about what’s valuable, and those kind of disagreements are relatively difficult to resolve, and the best way to resolve them is to gather more “object-level” data.
Maybe people have already spent a fair amount of time having in-person discussions trying to resolve their disagreements, and haven’t made progress, and this discourages them from writing up their thoughts because they think it won’t be a good use of time. However, this line of reasoning might be mistaken—it seems plausible to me that people entering the field of AI safety are relatively impartial judges of which intuitions do/don’t seem valid, and the question of where new people in the field of AI safety should focus is an important one, and having more public disagreement would improve human capital allocation.
I think your sense is correct. I think that plenty of people have short docs on why their approach is good; I think basically no-one has long docs engaging thoroughly with the criticisms of their paths (I don’t think Paul’s published arguments defending his perspective count as complete; Paul has arguments that I hear him make in person that I haven’t seen written up.)
My guess is that it’s developed because various groups decided that it was pretty unlikely that they were going to be able to convince other groups of their work, and so they decided to just go their own ways. This is exacerbated by the fact that several AI safety groups have beliefs which are based on arguments which they’re reluctant to share with each other.
(I was having a conversation with an AI safety researcher at a different org recently, and they couldn’t tell me about some things that they knew from their job, and I couldn’t tell them about things from my job. We were reflecting on the situation, and then one of us proposed the metaphor that we’re like two people who were sliding on ice next to each other and then pushed away and have now chosen our paths and can’t interact anymore to course correct.)
Should we be concerned? Idk, seems kind of concerning. I kind of agree with MIRI that it’s not clearly worth it for MIRI leadership to spend time talking to people like Paul who disagree with them a lot.
Also, sometimes fields should fracture a bit while they work on their own stuff; maybe we’ll develop our own separate ideas for the next five years, and then come talk to each other more when we have clearer ideas.
I suspect that things like the Alignment Newsletter are causing AI safety researchers to understand and engage with each other’s work more; this seems good.
FWIW, it’s not clear to me that AI alignment folks with different agendas have put less effort into (or have made less progress on) understanding the motivations for other agendas than is typical in other somewhat-analogous fields. Like, MIRI leadership and Paul have put >25 (and maybe >100, over the years?) hours into arguing about merits of their differing agendas (in person, on the web, in GDocs comments), and my impression is that central participants to those conversations (e.g. Paul, Eliezer, Nate) can pass the others’ ideological Turing tests reasonably well on a fair number of sub-questions and down 1-3 levels of “depth” (depending on the sub-question), and that might be more effort and better ITT performance than is typical for “research agenda motivation disagreements” in small niche fields that are comparable on some other dimensions.
I suspect that things like the Alignment Newsletter are causing AI safety researchers to understand and engage with each other’s work more; this seems good.
This is the goal, but it’s unclear that it’s having much of an effect. I feel like I relatively often have conversations with AI safety researchers where I mention something I highlighted in the newsletter, and the other person hasn’t heard of it, or has a very superficial / wrong understanding of it (one that I think would be corrected by reading just the summary in the newsletter).
This is very anecdotal; even if there are times when I talk to people and they do know the paper that I’m talking about because of the newsletter, I probably wouldn’t notice / learn that fact.
(In contrast, junior researchers are often more informed than I would expect, at least about the landscape, even if not the underlying reasons / arguments.)
My sense of the current general landscape of AI Safety is: various groups of people pursuing quite different research agendas, and not very many explicit and written-up arguments for why these groups think their agenda is a priority (a notable exception is Paul’s argument for working on prosaic alignment). Does this sound right? If so, why has this dynamic emerged and should we be concerned about it? If not, then I’m curious about why I developed this picture.
I think the picture is somewhat correct, and we surprisingly should not be too concerned about the dynamic.
My model for this is:
1) there are some hard and somewhat nebulous problems “in the world”
2) people try to formalize them using various intuitions/framings/kinds of math; also using some “very deep priors”
3) the resulting agendas look at the surface level extremely different, and create the impression you have
but actually
4) if you understand multiple agendas deep enough, you get a sense
how they are sometimes “reflecting” the same underlying problem
if they are based on some “deep priors”, how deep it is, and how hard to argue it can be
how much they are based on “tastes” and “intuitions” ~ one model how to think about it is people having boxes comparable to policy net in AlphaZero: a mental black-box which spits useful predictions, but is not interpretable in language
Overall, given our current state of knowledge, I think running these multiple efforts in parallel is a better approach with higher chance of success that an idea that we should invest a lot in resolving disagreements/prioritizing, and everyone should work on the “best agenda”.
This seems to go against some core EA heuristic (“compare the options, take the best”) but actually is more in line with what rational allocation of resources in the face of uncertainty.
Thanks for the reply! Could you give examples of:
a) two agendas that seem to be “reflecting” the same underlying problem despite appearing very different superficially?
b) a “deep prior” that you think some agenda is (partially) based on, and how you would go about working out how deep it is?
Sure
a)
For example, CAIS and something like “classical superintelligence in a box picture” disagree a lot on the surface level. However, if you look deeper, you will find many similar problems. Simple to explain example: problem of manipulating the operator—which has (in my view) some “hard core” involving both math and philosophy, where you want the AI to somehow communicate with humans in a way which at the same time allows a) the human to learn from the AI if the AI knows something about the world b) the operator’s values are not “overwritten” by the AI c) you don’t want to prohibit moral progress. In CAIS language this is connected to so called manipulative services.
Or: one of the biggest hits of past year is the mesa-optimisation paper. However, if you are familiar with prior work, you will notice many of the proposed solutions with mesa-optimisers are similar/same solutions as previously proposed for so called ‘daemons’ or ‘misaligned subagents’. This is because the problems partially overlap (the mesa-optimisation framing is more clear and makes a stronger case for “this is what to expect by default”). Also while, for example, on the surface level there is a lot of disagreement between e.g. MIRI researchers, Paul Christiano and Eric Drexler, you will find a “distillation” proposal targeted at the above described problem in Eric’s work from 2015, many connected ideas in Paul’s work on distillation, and while find it harder to understand Eliezer I think his work also reflects understanding of the problem.
b)
For example: You can ask whether the space of intelligent systems is fundamentally continuous, or not. (I call it “the continuity assumption”). This is connected to many agendas—if the space is fundamentally discontinuous this would cause serious problems to some forms of IDA, debate, interpretability & more.
(An example of discontinuity would be existence of problems which are impossible to meaningfully factorize; there are many more ways how the space could be discontinuous)
There are powerful intuitions going both ways on this.
Not Buck, but one possibility is that people pursuing different high-level agendas have different intuitions about what’s valuable, and those kind of disagreements are relatively difficult to resolve, and the best way to resolve them is to gather more “object-level” data.
Maybe people have already spent a fair amount of time having in-person discussions trying to resolve their disagreements, and haven’t made progress, and this discourages them from writing up their thoughts because they think it won’t be a good use of time. However, this line of reasoning might be mistaken—it seems plausible to me that people entering the field of AI safety are relatively impartial judges of which intuitions do/don’t seem valid, and the question of where new people in the field of AI safety should focus is an important one, and having more public disagreement would improve human capital allocation.
I think your sense is correct. I think that plenty of people have short docs on why their approach is good; I think basically no-one has long docs engaging thoroughly with the criticisms of their paths (I don’t think Paul’s published arguments defending his perspective count as complete; Paul has arguments that I hear him make in person that I haven’t seen written up.)
My guess is that it’s developed because various groups decided that it was pretty unlikely that they were going to be able to convince other groups of their work, and so they decided to just go their own ways. This is exacerbated by the fact that several AI safety groups have beliefs which are based on arguments which they’re reluctant to share with each other.
(I was having a conversation with an AI safety researcher at a different org recently, and they couldn’t tell me about some things that they knew from their job, and I couldn’t tell them about things from my job. We were reflecting on the situation, and then one of us proposed the metaphor that we’re like two people who were sliding on ice next to each other and then pushed away and have now chosen our paths and can’t interact anymore to course correct.)
Should we be concerned? Idk, seems kind of concerning. I kind of agree with MIRI that it’s not clearly worth it for MIRI leadership to spend time talking to people like Paul who disagree with them a lot.
Also, sometimes fields should fracture a bit while they work on their own stuff; maybe we’ll develop our own separate ideas for the next five years, and then come talk to each other more when we have clearer ideas.
I suspect that things like the Alignment Newsletter are causing AI safety researchers to understand and engage with each other’s work more; this seems good.
FWIW, it’s not clear to me that AI alignment folks with different agendas have put less effort into (or have made less progress on) understanding the motivations for other agendas than is typical in other somewhat-analogous fields. Like, MIRI leadership and Paul have put >25 (and maybe >100, over the years?) hours into arguing about merits of their differing agendas (in person, on the web, in GDocs comments), and my impression is that central participants to those conversations (e.g. Paul, Eliezer, Nate) can pass the others’ ideological Turing tests reasonably well on a fair number of sub-questions and down 1-3 levels of “depth” (depending on the sub-question), and that might be more effort and better ITT performance than is typical for “research agenda motivation disagreements” in small niche fields that are comparable on some other dimensions.
This is the goal, but it’s unclear that it’s having much of an effect. I feel like I relatively often have conversations with AI safety researchers where I mention something I highlighted in the newsletter, and the other person hasn’t heard of it, or has a very superficial / wrong understanding of it (one that I think would be corrected by reading just the summary in the newsletter).
This is very anecdotal; even if there are times when I talk to people and they do know the paper that I’m talking about because of the newsletter, I probably wouldn’t notice / learn that fact.
(In contrast, junior researchers are often more informed than I would expect, at least about the landscape, even if not the underlying reasons / arguments.)