People have pretty different background expectations about what the most relevant or worrying kind of AI misalignment /takeover/… scenario would look like. This also corresponds to different views on when they expect signs of it to be visible (such that not seeing those signs or seeing something else would update them). Among other issues, I think this confuses discussion around whether (e.g.) “alignment is easy” or how we should be updating.[1]
My brain likes pictures, so I’ve found it useful to tag different views and discussions via the following diagrams (these are pretty “raw”/not-distilled):
1)
2) And a second one, roughly:
how much systems at a given capability level appear safe vs where they actually are on the path or spectrum to the kind of safety we care about[2].
(This also has some more notes on how people might relate differently to the same results / evidence.)
These are very messy sketches! I’m sharing because I made a hacky commitment to post short things and in case it’s useful for someone (or in case a comment helps clarify things for me, which I’d definitely appreciate). There’s some chance that I’ll clean these up and update them later.
Again, a huge part of the problem/confusion seems to be that this is a very underdetermined term; see footnote above. Also feels sort of related to things I wrote about here
(altohugh I’m guessing it’s also partly due to the fact that this is a basically unedited sketch and I first drew this because a similar image had come to mind in a variety of contexts, and I wanted a version I could adapt as needed—i.e. it was meant to be flexible. If I were making a v2 I’d probably want to commit more, though.)
People have pretty different background expectations about what the most relevant or worrying kind of AI misalignment /takeover/… scenario would look like. This also corresponds to different views on when they expect signs of it to be visible (such that not seeing those signs or seeing something else would update them). Among other issues, I think this confuses discussion around whether (e.g.) “alignment is easy” or how we should be updating.[1]
My brain likes pictures, so I’ve found it useful to tag different views and discussions via the following diagrams (these are pretty “raw”/not-distilled):
1)
2) And a second one, roughly:
how much systems at a given capability level appear safe vs where they actually are on the path or spectrum to the kind of safety we care about[2].
(This also has some more notes on how people might relate differently to the same results / evidence.)
These are very messy sketches! I’m sharing because I made a hacky commitment to post short things and in case it’s useful for someone (or in case a comment helps clarify things for me, which I’d definitely appreciate). There’s some chance that I’ll clean these up and update them later.
Related: “conflationary alliances” (see also a post with a version of this dynamic about “charity” on the Forum)
Again, a huge part of the problem/confusion seems to be that this is a very underdetermined term; see footnote above. Also feels sort of related to things I wrote about here
(altohugh I’m guessing it’s also partly due to the fact that this is a basically unedited sketch and I first drew this because a similar image had come to mind in a variety of contexts, and I wanted a version I could adapt as needed—i.e. it was meant to be flexible. If I were making a v2 I’d probably want to commit more, though.)