Jan Wehner🔸 comments on Will we get automated alignment research before an AI Takeoff?

Jan Wehner🔸 24 Jan 2026 11:04 UTC
3 points
0 ∶ 0
Thanks for the input!
On Scheming: I actually don’t think scheming risk is the most important factor. Even removing it completely doesn’t change my final conclusion. I agree that a bimodal distribution with scheming/non-scheming would be appropriate for a more sophisticated model. I just ended up lowering the weight I assign to the scheming factor (by half) to take into account that I am not sure whether scheming will/won’t be an issue.
In my analysis, the ability to get good feedback signals/success criteria is the factor that moves me the most to thinking that capabilities get sped up before safety.
On Task length: You have more visibility into this, so I’m happy to defer. But I’d love to hear more about why you think tasks in capabilities research have longer task lengths. Is it because you have to run large evals or do pre-training runs? Do you think this argument applies to all areas of capabilities research?
- Rohin Shah 24 Jan 2026 15:19 UTC
  12 points
  1 ∶ 0
  Parent
  Oh sorry, I missed the weights on the factors, and thought you were taking an unweighted average.
  Is it because you have to run large evals or do pre-training runs? Do you think this argument applies to all areas of capabilities research?
  All tasks in capabilities are ultimately trying to optimize the capability-cost frontier, which usually benefits from measuring capability.
  If you have an AI that will do well at most tasks you give it that take (say) a week, then you have the problem that the naive way of evaluating the AI (run it on some difficult tasks and see how well it does) now takes a very long time to give you useful signal. So you now have two options:
  1. Just run the naive evaluation and put in a lot of effort to make it faster.
  2. Find some cheap proxy for capability that is easy to evaluate, and use that to drive your progress.
  This doesn’t apply for training / inference efficiency (since you hold the AI and thus capabilities constant, so you don’t need to measure capability). And there is already a good proxy for pretraining improvements, namely perplexity. But for all the other areas, this is going to increasingly be a problem that they will need to solve.
  On reflection this is probably not best captured in your “task length” criterion, but rather the “feedback quality / verifiability” criterion.
  - Ben_West🔸 27 Jan 2026 19:04 UTC
    6 points
    0 ∶ 0
    Parent
    Is this just a statement that there is more low-hanging fruit in safety research? I.e., you can in some sense learn an equal amount from a two-minute rollout for both capabilities and safety, but capabilities researchers have already learned most of what was possible and safety researchers haven’t exhausted everything yet.
    Or is this a stronger claim that safety work is inherently a more short-time horizon thing?
    - Rohin Shah 29 Jan 2026 17:47 UTC
      4 points
      0 ∶ 0
      Parent
      Or is this a stronger claim that safety work is inherently a more short-time horizon thing?
      It is more like this stronger claim.
      I might not use “inherently” here. A core safety question is whether an AI system is behaving well because it is aligned, or because it is pursuing convergent instrumental subgoals until it can takeover. The “natural” test is to run the AI until it has enough power to easily take over, at which point you observe whether it takes over, which is extremely long-horizon. But obviously this was never an option for safety anyway, and many of the proxies that we think about are more short horizon.
  - Sharmake 28 Jan 2026 15:48 UTC
    2 points
    0 ∶ 0
    Parent
    For what it’s worth, I think pre-training alone is probably enough to get us to about 1-3 month time horizons based on a 7 month doubling time, but pre-training data will start to run out in the early 2030s, meaning that you no longer (in the absence of other benchmarks) have very good general proxies of capabilities improvements.
    The real issue isn’t the difference between hours and months long tasks, but the difference between months long tasks and century long tasks, which Steve Newman describes well here.