What is your credence in doom conditional on AIs not caring for humans?
How much do they care about humans, and what counts as doom? I think these things matter.
If we’re assuming all AIs don’t care at all about humans and doom = human extinction, then I think the probability is pretty high, like 65%.
If we’re allowed to assume that some small minority of AIs cares about humans, or AIs care about humans to some degree, perhaps in the way humans care about wildlife species preservation, then I think the probability is quite a lot lower, at maybe 25%.
For precision, both of these estimates are over the next 100 years, since I have almost no idea what will happen in the very long run.
What is your response to the standard arguments that ‘just train them hard to be ethical’ won’t work? E.g. Ajeya Cotra’s writings on the training game.
In most of these stories, including in Ajeya’s story IIRC, humanity just doesn’t seem to try very hard to reduce misalignment? I don’t think that’s a very reasonable assumption. (Charitably, it could be interpreted as a warning rather than a prediction.) I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
Can you say more about what model you have in mind? Do you have a model?
I’m happy to meet up some time and explain in person. I’ll try to remember to DM you later about that, but if I forget, then feel free to remind me.
I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
Maybe so. But I can’t really see mechanistic interpretability being solved to a sufficient degree to detect a situationally aware AI playing the training game, in time to avert doom. Not without a long pause first at least!
In my opinion, “X is dubious” lines up pretty well with “X is 75% likely to be false”. That said, enough people have objected to this that I think I’ll change the wording.
OK, so our credences aren’t actually that different after all. I’m actually at less than 65%, funnily enough! (But that’s for doom = extinction. I think human extinction is unlikely for reasons to do with acausal trade; there will be a small minority of AIs that care about humans, just not on Earth. I usually use a broader definition of “doom” as “About as bad as human extinction, or worse.”)
I am pretty confident that what happens in the next 100 years will straightforwardly translate to what happens in the long run. If humans are still well-cared-for in 2100 they probably also will be in 2100,000,000.
I agree that if some AIs care about humans, or if all AIs care a little bit about humans, the situation looks proportionately better. Unfortunately that’s not what I expect to happen by default on Earth.
In most of these stories, including in Ajeya’s story IIRC, humanity just doesn’t seem to try very hard to reduce misalignment? I don’t think that’s a very reasonable assumption. (Charitably, it could be interpreted as a warning rather than a prediction.) I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
That’s not really an answer to my question—Ajeya’s argument is about how today’s alignment techniques (e.g. RLHF + monitoring) won’t work even if turbocharged with huge amounts of investment. It sounds like you are disagreeing, and saying that if we just spend lots of $$$ doing lots and lots of RLHF, it’ll work. Or when you say humanity will try harder, do you mean they’ll use some other technique than the ones Ajeya thinks won’t work? If so, which technique?
(Separately, I tend to think humanity will probably invest less in alignment than it does in her stories, but that’s not the crux between us I think.)
How much do they care about humans, and what counts as doom? I think these things matter.
If we’re assuming all AIs don’t care at all about humans and doom = human extinction, then I think the probability is pretty high, like 65%.
If we’re allowed to assume that some small minority of AIs cares about humans, or AIs care about humans to some degree, perhaps in the way humans care about wildlife species preservation, then I think the probability is quite a lot lower, at maybe 25%.
For precision, both of these estimates are over the next 100 years, since I have almost no idea what will happen in the very long run.
In most of these stories, including in Ajeya’s story IIRC, humanity just doesn’t seem to try very hard to reduce misalignment? I don’t think that’s a very reasonable assumption. (Charitably, it could be interpreted as a warning rather than a prediction.) I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
I’m happy to meet up some time and explain in person. I’ll try to remember to DM you later about that, but if I forget, then feel free to remind me.
Maybe so. But I can’t really see mechanistic interpretability being solved to a sufficient degree to detect a situationally aware AI playing the training game, in time to avert doom. Not without a long pause first at least!
I’m surprised by your 25%. To me, that really doesn’t match up with
from your essay.
In my opinion, “X is dubious” lines up pretty well with “X is 75% likely to be false”. That said, enough people have objected to this that I think I’ll change the wording.
OK, so our credences aren’t actually that different after all. I’m actually at less than 65%, funnily enough! (But that’s for doom = extinction. I think human extinction is unlikely for reasons to do with acausal trade; there will be a small minority of AIs that care about humans, just not on Earth. I usually use a broader definition of “doom” as “About as bad as human extinction, or worse.”)
I am pretty confident that what happens in the next 100 years will straightforwardly translate to what happens in the long run. If humans are still well-cared-for in 2100 they probably also will be in 2100,000,000.
I agree that if some AIs care about humans, or if all AIs care a little bit about humans, the situation looks proportionately better. Unfortunately that’s not what I expect to happen by default on Earth.
That’s not really an answer to my question—Ajeya’s argument is about how today’s alignment techniques (e.g. RLHF + monitoring) won’t work even if turbocharged with huge amounts of investment. It sounds like you are disagreeing, and saying that if we just spend lots of $$$ doing lots and lots of RLHF, it’ll work. Or when you say humanity will try harder, do you mean they’ll use some other technique than the ones Ajeya thinks won’t work? If so, which technique?
(Separately, I tend to think humanity will probably invest less in alignment than it does in her stories, but that’s not the crux between us I think.)