Other currently neglected agendas may increase P(Alignment | Humanity creates an SFC) while not increasing P(Alignment AND Humanity creates an SFC). Those include agendas aiming at decreasing P(Humanity creates an SFC | Misalignment). An example of intervention in such an agenda is overriding instrumental goals for space colonization and replacing them with an active desire not to colonize space. This defensive preference could be removed later, conditional on achieving corrigibility.
What’s the difference between “P(Alignment | Humanity creates an SFC)” and “P(Alignment AND Humanity creates an SFC)”?
What’s the difference between “P(Alignment | Humanity creates an SFC)” and “P(Alignment AND Humanity creates an SFC)”?
I will try to explain it more clearly. Thanks for asking.
P(Alignment AND Humanity creates an SFC) = P(Alignment | Humanity creates an SFC) x P(Humanity creates an SFC)
So the difference is that when you optimize for P(Alignment | Humanity creates an SFC), you no longer optimize for the term P(Humanity creates an SFC), which was included in the conjunctive probability.
Can you maybe run us through 2 worked examples for bullet point 2? Like what is someone currently doing (or planning to do) that you think should be deprioritised? And presumably, there might be something that you think should be prioritised instead?
Bullet point 2 is: (ii) Deprioritizing to some degree AI Safety agendas mostly increasing P(Humanity creates an SFC) but not increasing much P(Alignment | Humanity creates an SFC).
Here are speculative examples. The degree to which their priorities should be updated is to be debated. I only claim that they may need to be updated conditional on the hypotheses being significantly correct.
AI Misuse reduction: If the PTIs are (a) to prevent extinction through misuse and chaos, (b) to prevent the loss of alignment power resulting from a more chaotic world, and (c) to provide more time for Alignment research, then it is plausible that the PTI (a) would become less impactful.
Misalign AI Control: If the PTIs are (c) as above, (d) to prevent extinction through controlling early misaligned AI trying to take over, (e) to control misaligned early AIs to make them work on Alignment research, and (f) to create fire alarms (note: which somewhat contradicts the path (b) above), then it is plausible the PTI (d) would be less impactful since these early misaligned AI may have a higher chance to not create an SFC after taking over (e.g., they don’t survive destroying humanity or don’t care about space colonization).
Here is another vague diluted effect: If an intervention, like AI control, increases P(Humanity creates an SFC | Early Misalignment), then this intervention may need to be discounted more than if it was increasing P(Humanity creates an SFC) only. Changing P(Humanity creates an SFC) may have no impact when the hypotheses are significantly correct, but P(Humanity creates an SFC | Misalignment) is net negative, and Early Misalignment and (Late) Misalignment may be strongly correlated.
AI evaluations: The reduction of the impact of (a) and (d) may also impact the overall importance of this agenda.
Well, at a technical level the first is a conditional probability and the second is an unconditional probability of a conjunction. So the first is to be read as “the probability that alignment is achieved, conditional on humanity creating a spacefaring civilization” whilst the second is “the probability that the following is happens: alignment is solved and humanity creates a spacefaring civilization”. If you think of probability as a space, where the likelihood of an outcome=the proportion of the space it takes up, then:
-the first is the proportion of the region of probability space taken up humanity creating a space-faring civilization in which alignment occurs.
-the second is the proportion of the whole of probability space in in which both alignment occurs and humanity creates a space-faring civilization.
But yes, knowing that does not automatically bring real understanding of what’s going on. Or at least for me it doesn’t. Probably the whole idea being expressed would better written up much more informally, focusing on a concrete story of how particular actions taken by people concerned with alignment might surprisingly be bad our suboptimal.
What’s the difference between “P(Alignment | Humanity creates an SFC)” and “P(Alignment AND Humanity creates an SFC)”?
I will try to explain it more clearly. Thanks for asking.
P(Alignment AND Humanity creates an SFC) = P(Alignment | Humanity creates an SFC) x P(Humanity creates an SFC)
So the difference is that when you optimize for P(Alignment | Humanity creates an SFC), you no longer optimize for the term P(Humanity creates an SFC), which was included in the conjunctive probability.
Bullet point 2 is: (ii) Deprioritizing to some degree AI Safety agendas mostly increasing P(Humanity creates an SFC) but not increasing much P(Alignment | Humanity creates an SFC).
Here are speculative examples. The degree to which their priorities should be updated is to be debated. I only claim that they may need to be updated conditional on the hypotheses being significantly correct.
AI Misuse reduction: If the PTIs are (a) to prevent extinction through misuse and chaos, (b) to prevent the loss of alignment power resulting from a more chaotic world, and (c) to provide more time for Alignment research, then it is plausible that the PTI (a) would become less impactful.
Misalign AI Control: If the PTIs are (c) as above, (d) to prevent extinction through controlling early misaligned AI trying to take over, (e) to control misaligned early AIs to make them work on Alignment research, and (f) to create fire alarms (note: which somewhat contradicts the path (b) above), then it is plausible the PTI (d) would be less impactful since these early misaligned AI may have a higher chance to not create an SFC after taking over (e.g., they don’t survive destroying humanity or don’t care about space colonization).
Here is another vague diluted effect: If an intervention, like AI control, increases P(Humanity creates an SFC | Early Misalignment), then this intervention may need to be discounted more than if it was increasing P(Humanity creates an SFC) only. Changing P(Humanity creates an SFC) may have no impact when the hypotheses are significantly correct, but P(Humanity creates an SFC | Misalignment) is net negative, and Early Misalignment and (Late) Misalignment may be strongly correlated.
AI evaluations: The reduction of the impact of (a) and (d) may also impact the overall importance of this agenda.
These updates are, at the moment, speculative.
Well, at a technical level the first is a conditional probability and the second is an unconditional probability of a conjunction. So the first is to be read as “the probability that alignment is achieved, conditional on humanity creating a spacefaring civilization” whilst the second is “the probability that the following is happens: alignment is solved and humanity creates a spacefaring civilization”. If you think of probability as a space, where the likelihood of an outcome=the proportion of the space it takes up, then:
-the first is the proportion of the region of probability space taken up humanity creating a space-faring civilization in which alignment occurs.
-the second is the proportion of the whole of probability space in in which both alignment occurs and humanity creates a space-faring civilization.
But yes, knowing that does not automatically bring real understanding of what’s going on. Or at least for me it doesn’t. Probably the whole idea being expressed would better written up much more informally, focusing on a concrete story of how particular actions taken by people concerned with alignment might surprisingly be bad our suboptimal.
Thanks David, that makes sense :)