It’s hard for me to get a sense for stuff like “how quickly are we moving towards the kind of AI that I’m really worried about?” I think this stems partly from (1) a conflation of different types of “crazy powerful AI”, and (2) the way that benchmarks and other measures of “AI progress” de-couple from actual progress towards the relevant things. Trying to represent these things graphically helps me orient/think.
First, it seems useful to distinguish the breadth or generality of state-of-the-art AI models and how able they are on some relevant capabilities. Once I separate these out, I can plot roughly where some definitions of “crazy powerful AI” apparently lie on these axes:
(I think there are too many definitions of “AGI” at this point. Many people would make that area much narrower, but possibly in different ways.)
Visualizing things this way also makes it easier for me[2] to ask: Where do various threat models kick in? Where do we get “transformative” effects? (Where does “TAI” lie?)
Another question that I keep thinking about is something like: “what are key narrow (sets of) capabilities such that the risks from models grow ~linearly as they improve on those capabilities?” Or maybe “What is the narrowest set of capabilities for which we capture basically all the relevant info by turning the axes above into something like ‘average ability on that set’ and ‘coverage of those abilities’, and then plotting how risk changes as we move the frontier?”
The most plausible sets of abilities like this might be something like:
If I try the former, how does risk from different AI systems change?
And we could try drawing some curves that represent our guesses about how the risk changes as we make progress on a narrow set of AI capabilities on the x-axis. This is very hard; I worry that companies focus on benchmarks in ways that make them less meaningful, so I don’t want to put performance on a specific benchmark on the x-axis. But we could try placing some fuzzier “true” milestones along the way, asking what the shape of the curve would be in reference to those, and then trying to approximate how far along we are with respect to them by using a combination of metrics and other measures. (Of course, it’s also really difficult to develop a reasonable/useful sense for how far apart those milestones are on the most appropriate measure of progress — or how close partial completion of these milestones is to full completion.)
Here’s a sketch:
Overall I’m really unsure of which milestones I should pay attention to here, and how risk changes as we might move through them.
It could make sense to pay attention to real-world impacts of (future) AI systems instead of their ~intrinsic qualities, but real-world impacts seem harder to find robust precursors to, rely on many non-AI factors, and interpreting them involves trying to untangle many different cruxes or worldviews. (Paying attention to both intrinsic qualities and real-world impacts seems useful and important, though.)
All of this also complicates how I relate to questions like “Is AI progress speeding up or slowing down?” (If I ignore all of these confusions and just try to evaluate progress intuitively / holistically, it doesn’t seem to be slowing down in relevant ways.[4])
Thoughts/suggestions/comments on any of this are very welcome (although I may not respond, at least not quickly).
Some content related to at least some of the above (non-exhaustive):
Recent post that I appreciated, which outlined timelines via some milestones and then outlined a picture of how the world might change in the background
This diagram also makes me wonder how much pushing the bottom right corner further to the right (on some relevant capabilities) would help (especially as an alternative to pushing up or diagonally), given that sub-human general models don’t seem that safety-favoring but some narrow-but-relevant superhuman capabilities could help us deal with the risks of more general, human-level+ systems.
I’m trying to get back into the habit of posting more content, and aiming for a quick take makes it easier for me to get over perfectionism or other hurdles (or avoid spending more time on this kind of thing than I endorse). But I’ll take this as a nudge to consider sharing things as posts more often. :)
I found these visualizations very helpful! I think of AGI as the top of your HLAI section: human level in all tasks. Life 3.0 claimed that just being superhuman at AI coding would become super risky (recursive self improvement (RSI)). But it seems to me it would need to be ~human level at some other tasks as well like planning and deception to be super risky. Still, that could be relatively narrow overall.
Quick list of some ways benchmarks might be (accidentally) misleading[1]
Poor “construct validity”[2]( & systems that are optimized for the metric)
The connection between what the benchmark is measuring and what it’s trying to measure (or what people think it’s measuring) is broken. In particular:
Missing critical steps
When benchmarks are trying to evaluate progress on some broad capability (like “engineering” or “math ability” or “planning”), they’re often testing specific performance on meaningfully simpler tasks. Performance on those tasks might be missing key aspects of true/real-world/relevant progress on that capability.
Besides the inherent difficulty of measuring the right thing, it’s important to keep in mind that systems might have been trained specifically to perform well on a given benchmark.
This is probably a bigger problem for benchmarks that have gotten more coverage.
And some benchmarks have been designed specifically to be challenging for existing leading models, which can make new/other AI systems appear to have made more progress on these capabilities (relative to the older models) than they actually are.
We’re seeing this with the “Humanity’s Last Exam” benchmark.
...and sometimes some of the apparent limitations are random or kinda fake, such that a minor improvement appears to lead to radical progress
Discontinuous metrics: (partial) progress on a given benchmark might be misleading.
The difficulty of tasks/tests in a benchmark often varies significantly (often for good reason), and reporting of results might explain the benchmark by focusing on its most difficult tests instead of the ones that the model actually completed.
I think this was an issue for Frontier Math, although I’m not sure how much strongly to discount some of the results as a result.
-> This (along with issues like 1b above, which can lead to saturation) is part of what makes it harder to extrapolate from e.g. plots of progress on certain benchmarks.
Noise & poisoning of the metric: Even on the metric in question, data might have leaked into the training process, the measurement process itself can be easily affected by things like who’s running it, comparing performance of different models that were tested slightly differently might be messy, etc. Some specific issues here (I might try to add links later):
Differences in how the measurement/testing process was actually set up (including how much inference compute was used, how many tries a model was given, access to tools, etc.)
Misc
Selective reporting/measurement (AI companies want to report their successes)
Tasks that appear difficult (because they’re hard for humans) might be might be especially easier for AI systems (and vice versa) — and this might cause us to think that more (or less) progress is being made than is true
Additions are welcome! (Also, I couldn’t quickly find a list like this earlier, but I’d be surprised if a better version than what I have above wasn’t available somewhere; I’d love recommendations.)
Notes on some of my AI-related confusions[1]
It’s hard for me to get a sense for stuff like “how quickly are we moving towards the kind of AI that I’m really worried about?” I think this stems partly from (1) a conflation of different types of “crazy powerful AI”, and (2) the way that benchmarks and other measures of “AI progress” de-couple from actual progress towards the relevant things. Trying to represent these things graphically helps me orient/think.
First, it seems useful to distinguish the breadth or generality of state-of-the-art AI models and how able they are on some relevant capabilities. Once I separate these out, I can plot roughly where some definitions of “crazy powerful AI” apparently lie on these axes:
(I think there are too many definitions of “AGI” at this point. Many people would make that area much narrower, but possibly in different ways.)
Visualizing things this way also makes it easier for me[2] to ask: Where do various threat models kick in? Where do we get “transformative” effects? (Where does “TAI” lie?)
Another question that I keep thinking about is something like: “what are key narrow (sets of) capabilities such that the risks from models grow ~linearly as they improve on those capabilities?” Or maybe “What is the narrowest set of capabilities for which we capture basically all the relevant info by turning the axes above into something like ‘average ability on that set’ and ‘coverage of those abilities’, and then plotting how risk changes as we move the frontier?”
The most plausible sets of abilities like this might be something like:
Everything necessary for AI R&D[3]
Long-horizon planning and technical skills?
If I try the former, how does risk from different AI systems change?
And we could try drawing some curves that represent our guesses about how the risk changes as we make progress on a narrow set of AI capabilities on the x-axis. This is very hard; I worry that companies focus on benchmarks in ways that make them less meaningful, so I don’t want to put performance on a specific benchmark on the x-axis. But we could try placing some fuzzier “true” milestones along the way, asking what the shape of the curve would be in reference to those, and then trying to approximate how far along we are with respect to them by using a combination of metrics and other measures. (Of course, it’s also really difficult to develop a reasonable/useful sense for how far apart those milestones are on the most appropriate measure of progress — or how close partial completion of these milestones is to full completion.)
Here’s a sketch:
Overall I’m really unsure of which milestones I should pay attention to here, and how risk changes as we might move through them.
It could make sense to pay attention to real-world impacts of (future) AI systems instead of their ~intrinsic qualities, but real-world impacts seem harder to find robust precursors to, rely on many non-AI factors, and interpreting them involves trying to untangle many different cruxes or worldviews. (Paying attention to both intrinsic qualities and real-world impacts seems useful and important, though.)
All of this also complicates how I relate to questions like “Is AI progress speeding up or slowing down?” (If I ignore all of these confusions and just try to evaluate progress intuitively / holistically, it doesn’t seem to be slowing down in relevant ways.[4])
Thoughts/suggestions/comments on any of this are very welcome (although I may not respond, at least not quickly).
Some content related to at least some of the above (non-exhaustive):
METR’s RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
AI Impacts page on HLAI, especially “Human-level” is superhuman
List of some definitions of advanced AI systems
John Wentworth distinguishing between “early transformative AI” and “superintelligence”
Holden Karnofsky’s recent piece for the Carnegie Endowment for International Peace: AI Has Been Surprising for Years
Writing on (issues with) benchmarks/evals: Kelsey Piper in Vox, Anthropic’s “challenges in evaluating AI systems”, Epoch in 2023 on how well compute predicts benchmark performance (pretty well on average, harder individually)
Recent post that I appreciated, which outlined timelines via some milestones and then outlined a picture of how the world might change in the background
Not representing “Forethought views” here! (I don’t know what Forethought folks think of all of this.)
Written/drawn very quickly.
This diagram also makes me wonder how much pushing the bottom right corner further to the right (on some relevant capabilities) would help (especially as an alternative to pushing up or diagonally), given that sub-human general models don’t seem that safety-favoring but some narrow-but-relevant superhuman capabilities could help us deal with the risks of more general, human-level+ systems.
Could a narrower set work? E.g. to what extent do we just care about ML engineering?
although I’m somewhat surprised at the lack of apparent real-world effects
Lizka might not work for the forum anymore, but I would have thought she could see this is way too deep and good for a quicktake!
Thanks for saying this!
I’m trying to get back into the habit of posting more content, and aiming for a quick take makes it easier for me to get over perfectionism or other hurdles (or avoid spending more time on this kind of thing than I endorse). But I’ll take this as a nudge to consider sharing things as posts more often. :)
Love it! I like that idea, and if its a lower bar psychologically to post in quicktakes that makes sense :).
I found these visualizations very helpful! I think of AGI as the top of your HLAI section: human level in all tasks. Life 3.0 claimed that just being superhuman at AI coding would become super risky (recursive self improvement (RSI)). But it seems to me it would need to be ~human level at some other tasks as well like planning and deception to be super risky. Still, that could be relatively narrow overall.
Follow-up:
Quick list of some ways benchmarks might be (accidentally) misleading[1]
Poor “construct validity”[2]( & systems that are optimized for the metric)
The connection between what the benchmark is measuring and what it’s trying to measure (or what people think it’s measuring) is broken. In particular:
Missing critical steps
When benchmarks are trying to evaluate progress on some broad capability (like “engineering” or “math ability” or “planning”), they’re often testing specific performance on meaningfully simpler tasks. Performance on those tasks might be missing key aspects of true/real-world/relevant progress on that capability.
Besides the inherent difficulty of measuring the right thing, it’s important to keep in mind that systems might have been trained specifically to perform well on a given benchmark.
This is probably a bigger problem for benchmarks that have gotten more coverage.
And some benchmarks have been designed specifically to be challenging for existing leading models, which can make new/other AI systems appear to have made more progress on these capabilities (relative to the older models) than they actually are.
We’re seeing this with the “Humanity’s Last Exam” benchmark.
...and sometimes some of the apparent limitations are random or kinda fake, such that a minor improvement appears to lead to radical progress
Discontinuous metrics: (partial) progress on a given benchmark might be misleading.
The difficulty of tasks/tests in a benchmark often varies significantly (often for good reason), and reporting of results might explain the benchmark by focusing on its most difficult tests instead of the ones that the model actually completed.
I think this was an issue for Frontier Math, although I’m not sure how much strongly to discount some of the results as a result.
-> This (along with issues like 1b above, which can lead to saturation) is part of what makes it harder to extrapolate from e.g. plots of progress on certain benchmarks.
Noise & poisoning of the metric: Even on the metric in question, data might have leaked into the training process, the measurement process itself can be easily affected by things like who’s running it, comparing performance of different models that were tested slightly differently might be messy, etc. Some specific issues here (I might try to add links later):
Brittleness to question phrasing
Data contamination (discussed e.g. here)
Differences in how the measurement/testing process was actually set up (including how much inference compute was used, how many tries a model was given, access to tools, etc.)
Misc
Selective reporting/measurement (AI companies want to report their successes)
Tasks that appear difficult (because they’re hard for humans) might be might be especially easier for AI systems (and vice versa) — and this might cause us to think that more (or less) progress is being made than is true
E.g. protein folding is really hard for humans
This looks relevant, but I haven’t read it
Some benchmarks seem mostly irrelevant to what I care about
Systems are tested pre post-training enhancements or other changes
Additions are welcome! (Also, I couldn’t quickly find a list like this earlier, but I’d be surprised if a better version than what I have above wasn’t available somewhere; I’d love recommendations.)
Open Phil’s announcement of their now-closed benchmarking RFP has some useful notes on this, particularly the section on “what makes for a strong benchmark.”I also appreciated METR’s list of desiderata here.
To be clear: I’m not trying to say anything on ways benchmarks might be useful/harmful here. And I’m really not an expert.
This paper looks relevant but I haven’t read it.