One thing that I think would really help me read this document would be (from Joe) a sense of “here’s the parts where my mind changed the most in the course of this investigation”.
Something like (note that this is totally made up) “there’s a particular exploration of alignment where I had conceptualized it as kinda like about making the AI think right but now I conceptualize it as about not thinking wrong which I explore in section a.b.c”.
Also maybe something like a sense of which of the premises Joe changed his mind on the most – where the probabilities shifted a lot.
This does seem like a helpful kind of content to include (here I think of Luke’s section on this here, in the context of his work on moral patienthood). I’ll consider revising to say more in this vein. In the meantime, here are a few updates off the top of my head:
It now feels more salient to me now just how many AI applications may be covered by systems that either aren’t agentic planning/strategically aware (including e.g. interacting modular systems, especially where humans are in the loop for some parts, and/or intuitively “sphexish/brittle” non-APS systems ), or by systems which are specialized/myopic/limited in capability in various ways. That is, a generalized learning agent that’s superhuman (let alone better than e.g. all of human civilization) in ~all domains, with objectives as open-ended and long-term as “maximize paperclips,” now seems to me a much more specific type of system, and one whose role in an automated economy—especially early on—seems more unclear. (I discuss this a bit in Section 3, section 4.3.1.3, and section 4.3.2).
Thinking about the considerations discussed in the “unusual difficulties” section generally gave me more clarity about how this problem differs from safety problems arising in the context of other technologies (I think I had previously been putting more weight on considerations like “building technology that performs function F is easier than building some technology that performs function F safely and reliably,” which apply more generally).
I realized how much I had been implicitly conceptualizing the “alignment problem” as “we must give these AI systems objectives that we’re OK seeing pursued with ~arbitrary degrees of capability” (something akin to the “omni test”). Meeting standards in this vicinity (to the extent that they’re well defined in a given case) seems like a very desirable form of robustness (and I’m sympathetic to related comments from Eliezer to the effect that “don’t build systems that are searching for ways to kill you, even if you think the search will come up empty”), but I found it helpful to remember that the ultimate problem is “we need to ensure that these systems don’t seek power in misaligned ways on any inputs they’re in fact exposed to” (e.g., what I’m calling “practical PS-alignment”) -- a framing that leaves more conceptual room, at least, for options that don’t “get the objectives exactly right,” and/or that involve restricting a system’s capabilities/time horizons, preventing it from “intelligence exploding,” controlling its options/incentives, and so on (though I do think options in this vein raise their own issues, of the type of that the “omni test” is meant to avoid, see 4.3.1.3, 4.3.2.3, and 4.3.3). I discuss this a bit in section 4.1.
I realized that my thinking re: “races to the bottom on safety” had been driven centrally by abstract arguments/models that could apply in principle to many industries (e.g., pharmaceuticals). It now seems to me a knottier and more empirical question how models of this kind will actually apply in a given real-world case re: AI. I discuss this a bit in section 5.3.1.
One thing that I think would really help me read this document would be (from Joe) a sense of “here’s the parts where my mind changed the most in the course of this investigation”.
Something like (note that this is totally made up) “there’s a particular exploration of alignment where I had conceptualized it as kinda like about making the AI think right but now I conceptualize it as about not thinking wrong which I explore in section a.b.c”.
Also maybe something like a sense of which of the premises Joe changed his mind on the most – where the probabilities shifted a lot.
Hi Ben,
This does seem like a helpful kind of content to include (here I think of Luke’s section on this here, in the context of his work on moral patienthood). I’ll consider revising to say more in this vein. In the meantime, here are a few updates off the top of my head:
It now feels more salient to me now just how many AI applications may be covered by systems that either aren’t agentic planning/strategically aware (including e.g. interacting modular systems, especially where humans are in the loop for some parts, and/or intuitively “sphexish/brittle” non-APS systems ), or by systems which are specialized/myopic/limited in capability in various ways. That is, a generalized learning agent that’s superhuman (let alone better than e.g. all of human civilization) in ~all domains, with objectives as open-ended and long-term as “maximize paperclips,” now seems to me a much more specific type of system, and one whose role in an automated economy—especially early on—seems more unclear. (I discuss this a bit in Section 3, section 4.3.1.3, and section 4.3.2).
Thinking about the considerations discussed in the “unusual difficulties” section generally gave me more clarity about how this problem differs from safety problems arising in the context of other technologies (I think I had previously been putting more weight on considerations like “building technology that performs function F is easier than building some technology that performs function F safely and reliably,” which apply more generally).
I realized how much I had been implicitly conceptualizing the “alignment problem” as “we must give these AI systems objectives that we’re OK seeing pursued with ~arbitrary degrees of capability” (something akin to the “omni test”). Meeting standards in this vicinity (to the extent that they’re well defined in a given case) seems like a very desirable form of robustness (and I’m sympathetic to related comments from Eliezer to the effect that “don’t build systems that are searching for ways to kill you, even if you think the search will come up empty”), but I found it helpful to remember that the ultimate problem is “we need to ensure that these systems don’t seek power in misaligned ways on any inputs they’re in fact exposed to” (e.g., what I’m calling “practical PS-alignment”) -- a framing that leaves more conceptual room, at least, for options that don’t “get the objectives exactly right,” and/or that involve restricting a system’s capabilities/time horizons, preventing it from “intelligence exploding,” controlling its options/incentives, and so on (though I do think options in this vein raise their own issues, of the type of that the “omni test” is meant to avoid, see 4.3.1.3, 4.3.2.3, and 4.3.3). I discuss this a bit in section 4.1.
I realized that my thinking re: “races to the bottom on safety” had been driven centrally by abstract arguments/models that could apply in principle to many industries (e.g., pharmaceuticals). It now seems to me a knottier and more empirical question how models of this kind will actually apply in a given real-world case re: AI. I discuss this a bit in section 5.3.1.
Great answer, thanks.