What’s the difference between “P(Alignment | Humanity creates an SFC)” and “P(Alignment AND Humanity creates an SFC)”?
I will try to explain it more clearly. Thanks for asking.
P(Alignment AND Humanity creates an SFC) = P(Alignment | Humanity creates an SFC) x P(Humanity creates an SFC)
So the difference is that when you optimize for P(Alignment | Humanity creates an SFC), you no longer optimize for the term P(Humanity creates an SFC), which was included in the conjunctive probability.
Can you maybe run us through 2 worked examples for bullet point 2? Like what is someone currently doing (or planning to do) that you think should be deprioritised? And presumably, there might be something that you think should be prioritised instead?
Bullet point 2 is: (ii) Deprioritizing to some degree AI Safety agendas mostly increasing P(Humanity creates an SFC) but not increasing much P(Alignment | Humanity creates an SFC).
Here are speculative examples. The degree to which their priorities should be updated is to be debated. I only claim that they may need to be updated conditional on the hypotheses being significantly correct.
AI Misuse reduction: If the PTIs are (a) to prevent extinction through misuse and chaos, (b) to prevent the loss of alignment power resulting from a more chaotic world, and (c) to provide more time for Alignment research, then it is plausible that the PTI (a) would become less impactful.
Misalign AI Control: If the PTIs are (c) as above, (d) to prevent extinction through controlling early misaligned AI trying to take over, (e) to control misaligned early AIs to make them work on Alignment research, and (f) to create fire alarms (note: which somewhat contradicts the path (b) above), then it is plausible the PTI (d) would be less impactful since these early misaligned AI may have a higher chance to not create an SFC after taking over (e.g., they don’t survive destroying humanity or don’t care about space colonization).
Here is another vague diluted effect: If an intervention, like AI control, increases P(Humanity creates an SFC | Early Misalignment), then this intervention may need to be discounted more than if it was increasing P(Humanity creates an SFC) only. Changing P(Humanity creates an SFC) may have no impact when the hypotheses are significantly correct, but P(Humanity creates an SFC | Misalignment) is net negative, and Early Misalignment and (Late) Misalignment may be strongly correlated.
AI evaluations: The reduction of the impact of (a) and (d) may also impact the overall importance of this agenda.
I will try to explain it more clearly. Thanks for asking.
P(Alignment AND Humanity creates an SFC) = P(Alignment | Humanity creates an SFC) x P(Humanity creates an SFC)
So the difference is that when you optimize for P(Alignment | Humanity creates an SFC), you no longer optimize for the term P(Humanity creates an SFC), which was included in the conjunctive probability.
Bullet point 2 is: (ii) Deprioritizing to some degree AI Safety agendas mostly increasing P(Humanity creates an SFC) but not increasing much P(Alignment | Humanity creates an SFC).
Here are speculative examples. The degree to which their priorities should be updated is to be debated. I only claim that they may need to be updated conditional on the hypotheses being significantly correct.
AI Misuse reduction: If the PTIs are (a) to prevent extinction through misuse and chaos, (b) to prevent the loss of alignment power resulting from a more chaotic world, and (c) to provide more time for Alignment research, then it is plausible that the PTI (a) would become less impactful.
Misalign AI Control: If the PTIs are (c) as above, (d) to prevent extinction through controlling early misaligned AI trying to take over, (e) to control misaligned early AIs to make them work on Alignment research, and (f) to create fire alarms (note: which somewhat contradicts the path (b) above), then it is plausible the PTI (d) would be less impactful since these early misaligned AI may have a higher chance to not create an SFC after taking over (e.g., they don’t survive destroying humanity or don’t care about space colonization).
Here is another vague diluted effect: If an intervention, like AI control, increases P(Humanity creates an SFC | Early Misalignment), then this intervention may need to be discounted more than if it was increasing P(Humanity creates an SFC) only. Changing P(Humanity creates an SFC) may have no impact when the hypotheses are significantly correct, but P(Humanity creates an SFC | Misalignment) is net negative, and Early Misalignment and (Late) Misalignment may be strongly correlated.
AI evaluations: The reduction of the impact of (a) and (d) may also impact the overall importance of this agenda.
These updates are, at the moment, speculative.