I would push back a little, the main thing is that buying time interventions obviously have significant sign uncertainty. Eg. your graph on median researcher “buying time” vs technical alignment, I think should have very wide error at the low end of “buying time”, going significantly below 0 within the 95% confidence interval. Technical alignment is lots less risky to that extent.
To clarify, you think that “buying time” might have a negative impact [on timelines/safety]?
Even if you think that, I think I’m pretty uncertain of the impact of technical alignment, if we’re talking about all work that is deemed ‘technical alignment.’ e.g., I’m not sure that on the margin I would prefer an additional alignment researcher (without knowing what they were researching or anything else about them), though I think it’s very unlikely that they would have net-negative impact.
So, I think I disagree that (a) “buying time” (excluding weird pivotal acts like trying to shut down labs) might have net negative impact and that & thus also that (b) “buying time” has more variance than technical alignment.
edit: Thought about it more and I disagree with my original formulation of the disagreement. I think “buying time” is more likely to be net negative than alignment research, but also that alignment research is usually not very helpful.
Rob puts it well in his comment as “social coordination”. If someone tries “buying time” interventions and fails, I think that because of largely social effects, poorly done “buying time” interventions have potential to both fail at buying time and preclude further coordination with mainstream ML. So net negative effect.
On the other hand, technical alignment does not have this risk.
I agree that technical alignment has the risk of accelerating timelines though.
But if someone tries technical alignment and fails to produce results, that has no impact compared to a counterfactual where they just did web dev or something.
My reference point here is the anecdotal disdain (from Twitter, YouTube, can dm if you want) some in the ML community have for anyone who they perceive to be slowing them down.
I see! Yes, I agree that more public “buying time” interventions (e.g. outreach) could be net negative. However, for the average person entering AI safety, I think there are less risky “buying time” interventions that are more useful than technical alignment.
I think probably this post should be edited and “focus on low risk interventions first” put in bold in the first sentence and put right next to the pictures. Because the most careless people (possibly like me...) are the ones that will read that and not read the current caveats
You’d be well able to compute the risk on your own, however, if you seriously considered doing any big outreach efforts. I think people should still have a large prior on action for anything that looks promising to them. : )
If Buying time interventions are conjunctive (ie. one can cancel out the effect of the others); but technical alignment is disjunctive
If the distribution of people performing both kinds of intervention is mostly towards the lower end of thoughtfulness/competence, (which we should imo expect)
Then technical alignment is a better recommendation for most people.
In fact it suggests that the graph in the post should be reversed (but the axis at the bottom should be social competence rather than technical competence)
It’s not clear to me that the variance of “being a technical researcher” is actually lower than “being a social coordinator”. Historically, quite a lot of capabilities advancements have come out of efforts that were initially intended to be alignment-focused.
Edited to add: I do think it’s probably harder to have a justified inside-view model of whether one’s efforts are directionally positive or negative when attempting to “buy time”, as opposed to “doing technical research”, if one actually makes a real effort in both cases.
Would you be able to give tangible examples where alignment research has advanced capabilities? I’ve no doubt it’s happened due to alignment-focused researchers being chatty about their capabilities-related findings, but idk examples.
There’s obviously substantial disagreement here, but the most recent salient example (and arguably the entire surrounding context of OpenAI as an institution).
Not sure what Rob is referring to but there are a fair few examples of org/people’s purposes slipping from alignment to capabilities, eg. OpenAI
I myself find it surprisingly difficult to focus on ideas that are robustly beneficial to alignment but not to capabilities.
(E.g. I have a bunch of interpretability ideas. But interpretability can only have no impact on, or accelerate timelines)
Do you know if any of the alignment orgs have some kind of alignment research NDA, with a panel to allow any alignment-only ideas be public, but keep the maybe-capabilities ideas private?
Do you mean you find it hard to avoid thinking about capabilities research or hard to avoid sharing it?
It seems reasonable to me that you’d actually want to try to advance the capabilities frontier, to yourself, privately, so you’re better able to understand the system you’re trying to align, and also you can better predict what’s likely to be dangerous.
That’s a reasonable point—the way this would reflect in the above graph is then wider uncertainty around technical alignment at the high end of researcher ability
I would push back a little, the main thing is that buying time interventions obviously have significant sign uncertainty. Eg. your graph on median researcher “buying time” vs technical alignment, I think should have very wide error at the low end of “buying time”, going significantly below 0 within the 95% confidence interval. Technical alignment is lots less risky to that extent.
To clarify, you think that “buying time” might have a negative impact [on timelines/safety]?
Even if you think that, I think I’m pretty uncertain of the impact of technical alignment, if we’re talking about all work that is deemed ‘technical alignment.’ e.g., I’m not sure that on the margin I would prefer an additional alignment researcher (without knowing what they were researching or anything else about them), though I think it’s very unlikely that they would have net-negative impact.
So, I think I disagree that (a) “buying time” (excluding weird pivotal acts like trying to shut down labs) might have net negative impact and that & thus also that (b) “buying time” has more variance than technical alignment.edit: Thought about it more and I disagree with my original formulation of the disagreement. I think “buying time” is more likely to be net negative than alignment research, but also that alignment research is usually not very helpful.
Rob puts it well in his comment as “social coordination”. If someone tries “buying time” interventions and fails, I think that because of largely social effects, poorly done “buying time” interventions have potential to both fail at buying time and preclude further coordination with mainstream ML. So net negative effect.
On the other hand, technical alignment does not have this risk.
I agree that technical alignment has the risk of accelerating timelines though.
But if someone tries technical alignment and fails to produce results, that has no impact compared to a counterfactual where they just did web dev or something.
My reference point here is the anecdotal disdain (from Twitter, YouTube, can dm if you want) some in the ML community have for anyone who they perceive to be slowing them down.
I see! Yes, I agree that more public “buying time” interventions (e.g. outreach) could be net negative. However, for the average person entering AI safety, I think there are less risky “buying time” interventions that are more useful than technical alignment.
I think probably this post should be edited and “focus on low risk interventions first” put in bold in the first sentence and put right next to the pictures. Because the most careless people (possibly like me...) are the ones that will read that and not read the current caveats
You’d be well able to compute the risk on your own, however, if you seriously considered doing any big outreach efforts. I think people should still have a large prior on action for anything that looks promising to them. : )
An addendum is then:
If Buying time interventions are conjunctive (ie. one can cancel out the effect of the others); but technical alignment is disjunctive
If the distribution of people performing both kinds of intervention is mostly towards the lower end of thoughtfulness/competence, (which we should imo expect)
Then technical alignment is a better recommendation for most people.
In fact it suggests that the graph in the post should be reversed (but the axis at the bottom should be social competence rather than technical competence)
It’s not clear to me that the variance of “being a technical researcher” is actually lower than “being a social coordinator”. Historically, quite a lot of capabilities advancements have come out of efforts that were initially intended to be alignment-focused.
Edited to add: I do think it’s probably harder to have a justified inside-view model of whether one’s efforts are directionally positive or negative when attempting to “buy time”, as opposed to “doing technical research”, if one actually makes a real effort in both cases.
Would you be able to give tangible examples where alignment research has advanced capabilities? I’ve no doubt it’s happened due to alignment-focused researchers being chatty about their capabilities-related findings, but idk examples.
There’s obviously substantial disagreement here, but the most recent salient example (and arguably the entire surrounding context of OpenAI as an institution).
Not sure what Rob is referring to but there are a fair few examples of org/people’s purposes slipping from alignment to capabilities, eg. OpenAI
I myself find it surprisingly difficult to focus on ideas that are robustly beneficial to alignment but not to capabilities.
(E.g. I have a bunch of interpretability ideas. But interpretability can only have no impact on, or accelerate timelines)
Do you know if any of the alignment orgs have some kind of alignment research NDA, with a panel to allow any alignment-only ideas be public, but keep the maybe-capabilities ideas private?
Do you mean you find it hard to avoid thinking about capabilities research or hard to avoid sharing it?
It seems reasonable to me that you’d actually want to try to advance the capabilities frontier, to yourself, privately, so you’re better able to understand the system you’re trying to align, and also you can better predict what’s likely to be dangerous.
Thinking about
That’s a reasonable point—the way this would reflect in the above graph is then wider uncertainty around technical alignment at the high end of researcher ability