This post steps through three gaps/points of uncertainty I wanted to explore and submit for consideration after reading Leopold Aschenbrenner’s recent essay, Situational Awareness. I expect an upcoming analysis from Zvi Mowshowitz to be far more detailed (obviously), but I wanted to share some thoughts here to which I’ve come relatively independently, before engaging with more in-depth and informed responses.
I think Aschenbrenner is probably right about a lot of things, including the trajectory. “It is strikingly plausible that by 2027, models will be able to do the work of an AI researcher/engineer. That doesn’t require believing in sci-fi; it just requires believing in straight lines on a graph.” (p. 8) [emphasis mine]. I also think there are a lot of valuable topics that his piece touches on, but can’t afford to keep in-scope without becoming unwieldy. This includes conditions under which automated A(G/S)I safety researchers may be moral patients; the heightened risks and catastrophic consequences of kinetic US-China conflict, especially considering that both the Manhattan and Apollo projects took place under similar conditions; and many others, to be sure.
Here, I try to address what feels more obviously missing/uncertain to me, and I try to raise things for consideration (for which commenters may be able to point to particular resources).
Automated A(G/S)I researchers and safety implications
I agree that hundred million automated A(G/S)I safety researchers pursuing (super-)alignment might be better than the status quo, but it doesn’t clearly signal when progress on their capabilities should be deprioritised in favour of safety gains.
Aschenbrenner makes the important point that “we don’t need to automate everything—just AI research” (p. 49), which suggests that the first truly transformative models may be savant-like specialists in a handful of relevant domains, rather than human-like generalists. When and how, though, do we select the model(s) to replicate a hundred million times, rather than devoting the same resources to just a little more capability along the domains most relevant to safety? What line, however blurry and unpredictable, exists between such a model and the kind of thing we don’t want to build? Can we reliably test for other emergent properties that we might not want in certain models, and decide to sacrifice the concomitant safety gains as the cost of doing this right?
The shelf-life of “top-down” interpretability
“Top-down” interpretability, in which we try to locate information or activity in a neural net without a full understanding of how it is processed, feels like it has a short shelf life on Aschenbrenner’s timelines (or shorter).
In Section 3c, he provides a comprehensive version of the (super-)alignment problem, and steps through his reasons for cautious optimism. He is optimistic that top-down interpretability techniques could help us to build something like an “AI lie detector”, where we look for the parts of a neural net that ‘light up’ when a model is lying (p. 119). My intuition, though, is that top-down interpretability is probably only useful until a system is capable enough to create its own accurate models of the minds of others, after which it can work out that we would pursue a technique exactly like top-down interpretability to detect undesirable traits. I would then expect some version of the following: either a) the model is capable enough to behave or modify itself in a way that masks how it ‘lights up’ when deceiving us, making itself more opaque, and/or b) the model could suggest designs for future models where this becomes less detectable, cleverly disguised as safety measures.
(Perhaps we’ll know how and where to look for signs of deception when a model suggests such designs; perhaps the model can’t disguise this before it reaches a level of capability we wouldn’t trust with safety research in the first place; perhaps there’s a new ML paradigm that solves for interpretability. This is a major uncertainty for me, borne of technical ignorance, and flagged here as a perceived gap rather than a disagreement.)
The difference between taking the lead, and crossing the threshold
Finally, I agree that it would be better for a democracy to take and hold a decisive lead on superintelligence (p. 127), but it is less clear to me that anybody benefits by actually reaching superintelligence first, unless there is corresponding progress on the (super-)alignment problem.
This distinction wasn’t clear on my reading of the paper, but I think it matters. I have a clear preference about who should derive the ongoing benefits from safer and more capable A(G/S)I on the path to superintelligence, if Aschenbrenner has accurately described both the urgency and the abundance of low(er)-hanging fruit to reach for here along the way (improving infosec, solving robotics, etc.), and this is because only healthy democracies can preserve as much option space for future values as possible (p. 134).
I have a less clear preference, however – even approaching ambivalence – about who actually crosses some threshold into superintelligence first, without clearer progress on (super-)alignment, because all the relevant contributors to a future that is bad for humans (e.g. value lock-in, perverse instantiation, instrumental convergence, goal preservation, etc.) seem to apply either way. I don’t know that where superintelligence is developed, or by whom, will matter if we somehow have less time than Aschenbrenner suggests, because we may never accrue any of the benefits that might have enabled us to lengthen timelines, and solve the extremely hard problems (including, if possible, the alleviation of race dynamics by ensuring a healthy democracy attains that lead at all).
Some thoughts on Leopold Aschenbrenner’s Situational Awareness paper
This post steps through three gaps/points of uncertainty I wanted to explore and submit for consideration after reading Leopold Aschenbrenner’s recent essay, Situational Awareness. I expect an upcoming analysis from Zvi Mowshowitz to be far more detailed (obviously), but I wanted to share some thoughts here to which I’ve come relatively independently, before engaging with more in-depth and informed responses.
I think Aschenbrenner is probably right about a lot of things, including the trajectory. “It is strikingly plausible that by 2027, models will be able to do the work of an AI researcher/engineer. That doesn’t require believing in sci-fi; it just requires believing in straight lines on a graph.” (p. 8) [emphasis mine]. I also think there are a lot of valuable topics that his piece touches on, but can’t afford to keep in-scope without becoming unwieldy. This includes conditions under which automated A(G/S)I safety researchers may be moral patients; the heightened risks and catastrophic consequences of kinetic US-China conflict, especially considering that both the Manhattan and Apollo projects took place under similar conditions; and many others, to be sure.
Here, I try to address what feels more obviously missing/uncertain to me, and I try to raise things for consideration (for which commenters may be able to point to particular resources).
Automated A(G/S)I researchers and safety implications
I agree that hundred million automated A(G/S)I safety researchers pursuing (super-)alignment might be better than the status quo, but it doesn’t clearly signal when progress on their capabilities should be deprioritised in favour of safety gains.
Aschenbrenner makes the important point that “we don’t need to automate everything—just AI research” (p. 49), which suggests that the first truly transformative models may be savant-like specialists in a handful of relevant domains, rather than human-like generalists. When and how, though, do we select the model(s) to replicate a hundred million times, rather than devoting the same resources to just a little more capability along the domains most relevant to safety? What line, however blurry and unpredictable, exists between such a model and the kind of thing we don’t want to build? Can we reliably test for other emergent properties that we might not want in certain models, and decide to sacrifice the concomitant safety gains as the cost of doing this right?
The shelf-life of “top-down” interpretability
“Top-down” interpretability, in which we try to locate information or activity in a neural net without a full understanding of how it is processed, feels like it has a short shelf life on Aschenbrenner’s timelines (or shorter).
In Section 3c, he provides a comprehensive version of the (super-)alignment problem, and steps through his reasons for cautious optimism. He is optimistic that top-down interpretability techniques could help us to build something like an “AI lie detector”, where we look for the parts of a neural net that ‘light up’ when a model is lying (p. 119). My intuition, though, is that top-down interpretability is probably only useful until a system is capable enough to create its own accurate models of the minds of others, after which it can work out that we would pursue a technique exactly like top-down interpretability to detect undesirable traits. I would then expect some version of the following: either a) the model is capable enough to behave or modify itself in a way that masks how it ‘lights up’ when deceiving us, making itself more opaque, and/or b) the model could suggest designs for future models where this becomes less detectable, cleverly disguised as safety measures.
(Perhaps we’ll know how and where to look for signs of deception when a model suggests such designs; perhaps the model can’t disguise this before it reaches a level of capability we wouldn’t trust with safety research in the first place; perhaps there’s a new ML paradigm that solves for interpretability. This is a major uncertainty for me, borne of technical ignorance, and flagged here as a perceived gap rather than a disagreement.)
The difference between taking the lead, and crossing the threshold
Finally, I agree that it would be better for a democracy to take and hold a decisive lead on superintelligence (p. 127), but it is less clear to me that anybody benefits by actually reaching superintelligence first, unless there is corresponding progress on the (super-)alignment problem.
This distinction wasn’t clear on my reading of the paper, but I think it matters. I have a clear preference about who should derive the ongoing benefits from safer and more capable A(G/S)I on the path to superintelligence, if Aschenbrenner has accurately described both the urgency and the abundance of low(er)-hanging fruit to reach for here along the way (improving infosec, solving robotics, etc.), and this is because only healthy democracies can preserve as much option space for future values as possible (p. 134).
I have a less clear preference, however – even approaching ambivalence – about who actually crosses some threshold into superintelligence first, without clearer progress on (super-)alignment, because all the relevant contributors to a future that is bad for humans (e.g. value lock-in, perverse instantiation, instrumental convergence, goal preservation, etc.) seem to apply either way. I don’t know that where superintelligence is developed, or by whom, will matter if we somehow have less time than Aschenbrenner suggests, because we may never accrue any of the benefits that might have enabled us to lengthen timelines, and solve the extremely hard problems (including, if possible, the alleviation of race dynamics by ensuring a healthy democracy attains that lead at all).