Super great post. I’ve been thinking about posting a nuance in (what I think about) the Eliezer class of threat models but haven’t gotten around to it. (Warning: negative valence, as I will recall the moment I first underwent visceral sadness at the alignment problem).
Rob Bensinger tweeted something like “if we stick the landing on this, I’m going to lose an unrecoverable amount of bayes points”, and for two years already I’ve had a massively different way of thinking about deployment of advanced systems because I find something like a “law of mad science” very plausible.
The high level takeaway is that (in this class of threat models) we can “survive takeoff” (not that I don’t hate that framing) and accumulate lots of evidence that the doomcoin landed on heads (really feeling like we’re in the early stages of a glorious transhuman future or a more modest FALGSC), for hundreds of years. And then someone pushes a typo in a yaml file to the server, and we die.
There seems to be very little framing of “mostly Eliezer-like ‘flipping the doomcoin’ scenario, where forecasters thus far have only concerned themselves with the date of the first flip, but from then on the doomcoin is flipped on new years eve at midnight every year until it comes up tails and we die”. In other words, if we are obligated to hustle the weight of the doomcoin now before the first flip, then we are at least as obligated to apply at least constant vigilance, forevermore, and there’s a stronger case to be made for demanding strictly increasing vigilance (pulling the weight of the doomcoin further and further every year). (this realization was my visceral sadness moment, in 2021 on discord, whereas before I was thinking about threat models as like a fun and challenging video game RNG or whatever).
I think the oxford folks have some literature on “existential security”, which I just don’t buy or expect at all. It seems deeply unlikely to me that there will be tricks we can pull after the first time the doomcoin lands on heads to keep it from flipping again. I think the “pivotal act” literature from MIRI tries to discuss this, by thinking about ways we can get some freebie years thrown in there (new years eve parties with no doomcoin flip), which is better than nothing. But this constant/increasing vigilance factor or the repeated flips of doomcoin seems like a niche informal inside view among people who’ve been hanging out longer than a couple years.
Picking on Eliezer as a public intellectual for a second, insofar as my model of him is accurate (that his confidence that we die is more of an “eventually” thing and he has very little relation to Conjecture, who in many worlds will just take a hit to their brier score in 2028, which Eliezer will be shielded from because he doesn’t commit to dates), I would have liked to see him retweet the Bensinger comment and warn us about all the ways in which we could observe wildly transformative AI not kill everyone, declare victory, then a few hundred years later push a bad yaml file to the server and die.
(All of this modulo my feelings that “doomcoin” is an annoying and thought-destroying way of characterizing the distribution over how you expect things to go well and poorly, probably at the same time, but that’s it’s own jar of paperclips)
I think that’s strongly contra Eliezer’s model, which is shaped something like “succeeding at solving the alignment problem eliminates most sources of existential risk, because aligned AGI will in fact be competent to solve for them in a robust way”. This does obviously imply something about the ability of random humans to spin up unmonitored nanofactories push a bad yaml file. Maybe there’ll be some much more clever solution(s) for various possible problems? /shrug
Yeah, I think “ASI implies an extreme case of lock-in” is a major tendency in the literature (especially sequences-era), but 1. people disagree about whether “alignment” refers to something that outsmarts even this implication or not, then they disagree about relative tractability and plausibility of the different alignment visions, and 2. this is very much a separate set of steps that provide room for disagreement among people who broadly accept Eliezer-like threatmodels (doomcoin stuff).
I don’t want to zero in on actually-existing Eliezer (at whichever time step), I’m more interested in like a threatmodel class or cluster around lack of fire alarms, capabilities we can’t distinguish from magic, things of this nature.
Super great post. I’ve been thinking about posting a nuance in (what I think about) the Eliezer class of threat models but haven’t gotten around to it. (Warning: negative valence, as I will recall the moment I first underwent visceral sadness at the alignment problem).
Rob Bensinger tweeted something like “if we stick the landing on this, I’m going to lose an unrecoverable amount of bayes points”, and for two years already I’ve had a massively different way of thinking about deployment of advanced systems because I find something like a “law of mad science” very plausible.
The high level takeaway is that (in this class of threat models) we can “survive takeoff” (not that I don’t hate that framing) and accumulate lots of evidence that the doomcoin landed on heads (really feeling like we’re in the early stages of a glorious transhuman future or a more modest FALGSC), for hundreds of years. And then someone pushes a typo in a yaml file to the server, and we die.
There seems to be very little framing of “mostly Eliezer-like ‘flipping the doomcoin’ scenario, where forecasters thus far have only concerned themselves with the date of the first flip, but from then on the doomcoin is flipped on new years eve at midnight every year until it comes up tails and we die”. In other words, if we are obligated to hustle the weight of the doomcoin now before the first flip, then we are at least as obligated to apply at least constant vigilance, forevermore, and there’s a stronger case to be made for demanding strictly increasing vigilance (pulling the weight of the doomcoin further and further every year). (this realization was my visceral sadness moment, in 2021 on discord, whereas before I was thinking about threat models as like a fun and challenging video game RNG or whatever).
I think the oxford folks have some literature on “existential security”, which I just don’t buy or expect at all. It seems deeply unlikely to me that there will be tricks we can pull after the first time the doomcoin lands on heads to keep it from flipping again. I think the “pivotal act” literature from MIRI tries to discuss this, by thinking about ways we can get some freebie years thrown in there (new years eve parties with no doomcoin flip), which is better than nothing. But this constant/increasing vigilance factor or the repeated flips of doomcoin seems like a niche informal inside view among people who’ve been hanging out longer than a couple years.
Picking on Eliezer as a public intellectual for a second, insofar as my model of him is accurate (that his confidence that we die is more of an “eventually” thing and he has very little relation to Conjecture, who in many worlds will just take a hit to their brier score in 2028, which Eliezer will be shielded from because he doesn’t commit to dates), I would have liked to see him retweet the Bensinger comment and warn us about all the ways in which we could observe wildly transformative AI not kill everyone, declare victory, then a few hundred years later push a bad yaml file to the server and die.
(All of this modulo my feelings that “doomcoin” is an annoying and thought-destroying way of characterizing the distribution over how you expect things to go well and poorly, probably at the same time, but that’s it’s own jar of paperclips)
I think that’s strongly contra Eliezer’s model, which is shaped something like “succeeding at solving the alignment problem eliminates most sources of existential risk, because aligned AGI will in fact be competent to solve for them in a robust way”. This does obviously imply something about the ability of random humans to
spin up unmonitored nanofactoriespush a bad yaml file. Maybe there’ll be some much more clever solution(s) for various possible problems? /shrugYeah, I think “ASI implies an extreme case of lock-in” is a major tendency in the literature (especially sequences-era), but 1. people disagree about whether “alignment” refers to something that outsmarts even this implication or not, then they disagree about relative tractability and plausibility of the different alignment visions, and 2. this is very much a separate set of steps that provide room for disagreement among people who broadly accept Eliezer-like threatmodels (doomcoin stuff).
I don’t want to zero in on actually-existing Eliezer (at whichever time step), I’m more interested in like a threatmodel class or cluster around lack of fire alarms, capabilities we can’t distinguish from magic, things of this nature.