On Scheming: I actually donât think scheming risk is the most important factor. Even removing it completely doesnât change my final conclusion. I agree that a bimodal distribution with scheming/ânon-scheming would be appropriate for a more sophisticated model. I just ended up lowering the weight I assign to the scheming factor (by half) to take into account that I am not sure whether scheming will/âwonât be an issue. In my analysis, the ability to get good feedback signals/âsuccess criteria is the factor that moves me the most to thinking that capabilities get sped up before safety.
On Task length: You have more visibility into this, so Iâm happy to defer. But Iâd love to hear more about why you think tasks in capabilities research have longer task lengths. Is it because you have to run large evals or do pre-training runs? Do you think this argument applies to all areas of capabilities research?
Oh sorry, I missed the weights on the factors, and thought you were taking an unweighted average.
Is it because you have to run large evals or do pre-training runs? Do you think this argument applies to all areas of capabilities research?
All tasks in capabilities are ultimately trying to optimize the capability-cost frontier, which usually benefits from measuring capability.
If you have an AI that will do well at most tasks you give it that take (say) a week, then you have the problem that the naive way of evaluating the AI (run it on some difficult tasks and see how well it does) now takes a very long time to give you useful signal. So you now have two options:
Just run the naive evaluation and put in a lot of effort to make it faster.
Find some cheap proxy for capability that is easy to evaluate, and use that to drive your progress.
This doesnât apply for training /â inference efficiency (since you hold the AI and thus capabilities constant, so you donât need to measure capability). And there is already a good proxy for pretraining improvements, namely perplexity. But for all the other areas, this is going to increasingly be a problem that they will need to solve.
On reflection this is probably not best captured in your âtask lengthâ criterion, but rather the âfeedback quality /â verifiabilityâ criterion.
Is this just a statement that there is more low-hanging fruit in safety research? I.e., you can in some sense learn an equal amount from a two-minute rollout for both capabilities and safety, but capabilities researchers have already learned most of what was possible and safety researchers havenât exhausted everything yet.
Or is this a stronger claim that safety work is inherently a more short-time horizon thing?
Or is this a stronger claim that safety work is inherently a more short-time horizon thing?
It is more like this stronger claim.
I might not use âinherentlyâ here. A core safety question is whether an AI system is behaving well because it is aligned, or because it is pursuing convergent instrumental subgoals until it can takeover. The ânaturalâ test is to run the AI until it has enough power to easily take over, at which point you observe whether it takes over, which is extremely long-horizon. But obviously this was never an option for safety anyway, and many of the proxies that we think about are more short horizon.
For what itâs worth, I think pre-training alone is probably enough to get us to about 1-3 month time horizons based on a 7 month doubling time, but pre-training data will start to run out in the early 2030s, meaning that you no longer (in the absence of other benchmarks) have very good general proxies of capabilities improvements.
The real issue isnât the difference between hours and months long tasks, but the difference between months long tasks and century long tasks, which Steve Newman describes well here.
Thanks for the input!
On Scheming: I actually donât think scheming risk is the most important factor. Even removing it completely doesnât change my final conclusion. I agree that a bimodal distribution with scheming/ânon-scheming would be appropriate for a more sophisticated model. I just ended up lowering the weight I assign to the scheming factor (by half) to take into account that I am not sure whether scheming will/âwonât be an issue.
In my analysis, the ability to get good feedback signals/âsuccess criteria is the factor that moves me the most to thinking that capabilities get sped up before safety.
On Task length: You have more visibility into this, so Iâm happy to defer. But Iâd love to hear more about why you think tasks in capabilities research have longer task lengths. Is it because you have to run large evals or do pre-training runs? Do you think this argument applies to all areas of capabilities research?
Oh sorry, I missed the weights on the factors, and thought you were taking an unweighted average.
All tasks in capabilities are ultimately trying to optimize the capability-cost frontier, which usually benefits from measuring capability.
If you have an AI that will do well at most tasks you give it that take (say) a week, then you have the problem that the naive way of evaluating the AI (run it on some difficult tasks and see how well it does) now takes a very long time to give you useful signal. So you now have two options:
Just run the naive evaluation and put in a lot of effort to make it faster.
Find some cheap proxy for capability that is easy to evaluate, and use that to drive your progress.
This doesnât apply for training /â inference efficiency (since you hold the AI and thus capabilities constant, so you donât need to measure capability). And there is already a good proxy for pretraining improvements, namely perplexity. But for all the other areas, this is going to increasingly be a problem that they will need to solve.
On reflection this is probably not best captured in your âtask lengthâ criterion, but rather the âfeedback quality /â verifiabilityâ criterion.
Is this just a statement that there is more low-hanging fruit in safety research? I.e., you can in some sense learn an equal amount from a two-minute rollout for both capabilities and safety, but capabilities researchers have already learned most of what was possible and safety researchers havenât exhausted everything yet.
Or is this a stronger claim that safety work is inherently a more short-time horizon thing?
It is more like this stronger claim.
I might not use âinherentlyâ here. A core safety question is whether an AI system is behaving well because it is aligned, or because it is pursuing convergent instrumental subgoals until it can takeover. The ânaturalâ test is to run the AI until it has enough power to easily take over, at which point you observe whether it takes over, which is extremely long-horizon. But obviously this was never an option for safety anyway, and many of the proxies that we think about are more short horizon.
For what itâs worth, I think pre-training alone is probably enough to get us to about 1-3 month time horizons based on a 7 month doubling time, but pre-training data will start to run out in the early 2030s, meaning that you no longer (in the absence of other benchmarks) have very good general proxies of capabilities improvements.
The real issue isnât the difference between hours and months long tasks, but the difference between months long tasks and century long tasks, which Steve Newman describes well here.