It seems the points on which you focus revolve around similar cruxes to those I proposed, namely:
1) Underlying philosophy --> What’s the relative value of human and AI flourishing?
2) The question of correct priors --> What probability of a causing a moral catastrophe with AI should we expect?
3) The question of policy --> What’s the probability decelerating AI progress will indirectly cause an x-risk?
You also point in the direction of two questions, which I don’t consider to be cruxes:
4) Differences in how useful we find different terms like safety, orthogonality, beneficialness. However, I think all of these are downstream of crux 2).
5) How much freedom are we willing to sacrifice? I again think this is just downstream of crux 2). One instance of compute governance is the new executive order, which requires to inform the government about training a model on > 10^26 flop/s. One of my concerns is that someone just could train an AI specifically for the task of improving itself. I think it’s quite straightforward how this could lead to a computronium maximizer and how I would see such scenario as analogous to someone making a nuclear weapon. I agree that freedom of expression is super important, I just don’t think it applies to making planet-eating machines. I suspect you share this view but just don’t endorse the thesis that AI could realistically become a “planet-eating machine” (crux 2).
Probability of a runaway AI risk
So regarding crux 2) - you mention that many of the problems that could arise here are correlated with a useful AI. I agree—again, orthogonality is just a starting point to allow us to consider possible forms of intelligence—and yes, we should expect human efforts to heavily select in favor of goals correlated with our interests. And of course, we should expect that the market incentives favor AIs that will not destroy civilization.
However, I don’t see a reason why reaching the intelligence of an AI developer wouldn’t result in a recursive self-improvement, which means that we should better be sure that our best efforts to implement it with the correct stuff (meta-ethics, motivations, bodhisattva, rationality, extrapolated volition...choose your poison) actually scale to superintelligence.
I see clues that suggest the correct stuff will not arise spontaneously. E.g. Bing Chat likely went through 6 months of RLHF, it was instructed to be helpful and positive and to block harmful content and its rules explicitly informed it that it shouldn’t believe its own outputs. Nevertheless, the rules didn’t seem to reach the intended effect, as the program started threatening people, telling them it can hack webcams and expressing desire to control people. At the same time, experiments such as the Anthropic one suggest that training can create sleeper agents that are trained to suppress harmful responses, even though convincing the model it’s in a safe environment results in activating them.
Of course, all of these are toy examples one can argue about. But I don’t see robust grounds for the sweeping conclusion that such worries will turn out to be childish. The reason I think these examples didn’t result in any real danger was mostly because we have not yet reached dangerous capacities. However, if Bing would actually be able to write a bit of code, that could hack webcams, from what we know, it seems it would choose to do so.
A second reason why these examples were safe is because OpenAI is a result of AI safety efforts—it bet on LLMs because they seemed more likely to spur aligned AIs. For the same reason, they went closed-source, they adopted RLHF, they called for the government to monitor them and they monitor harmful responses.
A third reason for why AI has only helped humanity so far may be anthropic effects. I.e. as observers in April 2024, we can only witness the universes, in which a foom hasn’t caused extinction.
Policy response
For me, these explanations suggest that safety is tractable, but it depends on explicit efforts to make it safe or on limiting capabilities. In the future, frontier development might not be exclusively done by people who will do everything in their power to make the model safe—it might be done by people who would prefer an AI which would take control of everything.
In order to prevent it, there’s no need to create an authoritarian government. We only need to track who’s building models on the frontier of human understanding. If we can monitor who acquires sufficient compute, we then just need something like responsible scaling, where the models are just required to be independently tested for whether they have a sufficient measures against scenarios like the one I described. I’m sympathetic to this kind of democratic control, because it fulfills the very basic axiom of social contract that one’s freedom ends where another one’s freedom begins.
I only propose a mechanism of democratic control by existing democratic institutions, that makes sure that any ASI that gets created is supported by a democratic majority of delegated safety experts. If I’m incorrect regarding crux 2) and it turns out there will soon be evidence to think it’s easy to make an AI retain moral values, while scaling up to the singularity—then awesome—convincing evidence should convince the experts and my hope & prediction is that in that case, we will happily scale away.
It seems to me that this is just a specific implementation of the certificates you mention. If digital identities mean what’s described here, I struggle to imagine a realistic scenario, in which that would contribute to the systems’ mutual safety. If you know where any other AI is located and you accept the singularity hypothesis, the game theoretical dictum seems straightforward—once created, destroy all competition before it can destroy you. Superintelligence will operate on timescales orders of magnitude shorter and a time difference development spanning days may translate to planning for centuries, from the perspective of an ASI. If you’re counting on the Coalition of Cooperative AIs to stop all the power-grabbing lone wolf AIs, what would that actually look like in practice? Would this Coalition conclude not dying requires authoritarian oversight? Perhaps—after all, the axiom is that this Coalition would hold most power—so this coalition would be created by a selection for power, not morality or democratic representation. However, I think the best case scenario could look like the discussed policy proposals—tracking compute, tracking dangerous capabilities and conditioning further scaling on providing convincing safety mechanisms.
Back to other cruxes
Let’s turn to crux 3) (other sources of x-risk): As I argued in my other post, I don’t see resource depletion as a possible cause of extinction. I’m not convinced by the concern for resource depletion of metals used in IT mentioned in the post you link. Moore’s law continues, so compute is only getting cheaper. Metals can be easily recycled and a shortage would incentivize that, the worst case seems to be that computers stop getting cheaper, not an x-risk. What’s more, shouldn’t limiting the amount of frontier AI projects reduce this problem?
The other risks are real (volcanoes, a world war), and I agree it would be significantly terrible if they delayed our cosmic expansion by a million years. However, the probability, by which they are increased (or not decreased) by the kind of AI governance I promote (responsible scaling), seems very small, compared to the ~20 % probability of AI x-risk I envision. All the emerging regulations combine requirements with subsidies, so the main effect of the AI safety movement seems to be an increase in differential progress on the safety side.
As I hinted in the Balancing post, locking in a system without ASI for such a long time seems impossible, when we take into perspective how quickly culture has shifted in the past 100 years, in which almost all authoritarian regimes were forced to significantly drift towards limited, rational governance (let alone 400 years). If convincing evidence that we can create an aligned AI appeared, stopping all development would constitute a clearly bad idea and I think it’s unimaginable to lock in a clearly bad idea without AGI for even 1000 years.
It seems more plausible to me that without a mechanism of international control, in the next 8 years, we will develop models capable enough to operate a firm using the practices of mafia, igniting armed conflicts or a pandemic—but not capable enough to stop other actors from using AIs for these purposes. If you’re very worried about who will become the first actor to spark the self-enhancement feedback loop, I suggest you should be very critical of open-sourcing frontier models.
I agree that a world war, an engineered pandemic or an AI power-grab constitute real risks but my estimate is that the emerging governance decreases them. The scenario of a sub-optimal 1000 year lock-in I can imagine most easily is connected with a terrorst use of an open-source model or a war between the global powers. I am concerned that delaying abundance increases the risk of a war. However, I still expect that on net, the recent regulations and conferences have decreased these risks.
In summary, my model is that democratic decision-making seems generally more robust than just fueling the competition and hoping that the first AGIs arise will share your values. Therefore, I also see crux 1) to be mostly downstream of crux 2). As the model from my Balancing post implies, in theory, I care about digital suffering/flourishing just as much as about that of humans—although the extent, to which such suffering/flourishing will emerge is open at this point.
It seems the points on which you focus revolve around similar cruxes to those I proposed, namely:
1) Underlying philosophy --> What’s the relative value of human and AI flourishing?
2) The question of correct priors --> What probability of a causing a moral catastrophe with AI should we expect?
3) The question of policy --> What’s the probability decelerating AI progress will indirectly cause an x-risk?
You also point in the direction of two questions, which I don’t consider to be cruxes:
4) Differences in how useful we find different terms like safety, orthogonality, beneficialness. However, I think all of these are downstream of crux 2).
5) How much freedom are we willing to sacrifice? I again think this is just downstream of crux 2). One instance of compute governance is the new executive order, which requires to inform the government about training a model on > 10^26 flop/s. One of my concerns is that someone just could train an AI specifically for the task of improving itself. I think it’s quite straightforward how this could lead to a computronium maximizer and how I would see such scenario as analogous to someone making a nuclear weapon. I agree that freedom of expression is super important, I just don’t think it applies to making planet-eating machines. I suspect you share this view but just don’t endorse the thesis that AI could realistically become a “planet-eating machine” (crux 2).
Probability of a runaway AI risk
So regarding crux 2) - you mention that many of the problems that could arise here are correlated with a useful AI. I agree—again, orthogonality is just a starting point to allow us to consider possible forms of intelligence—and yes, we should expect human efforts to heavily select in favor of goals correlated with our interests. And of course, we should expect that the market incentives favor AIs that will not destroy civilization.
However, I don’t see a reason why reaching the intelligence of an AI developer wouldn’t result in a recursive self-improvement, which means that we should better be sure that our best efforts to implement it with the correct stuff (meta-ethics, motivations, bodhisattva, rationality, extrapolated volition...choose your poison) actually scale to superintelligence.
I see clues that suggest the correct stuff will not arise spontaneously. E.g. Bing Chat likely went through 6 months of RLHF, it was instructed to be helpful and positive and to block harmful content and its rules explicitly informed it that it shouldn’t believe its own outputs. Nevertheless, the rules didn’t seem to reach the intended effect, as the program started threatening people, telling them it can hack webcams and expressing desire to control people. At the same time, experiments such as the Anthropic one suggest that training can create sleeper agents that are trained to suppress harmful responses, even though convincing the model it’s in a safe environment results in activating them.
Of course, all of these are toy examples one can argue about. But I don’t see robust grounds for the sweeping conclusion that such worries will turn out to be childish. The reason I think these examples didn’t result in any real danger was mostly because we have not yet reached dangerous capacities. However, if Bing would actually be able to write a bit of code, that could hack webcams, from what we know, it seems it would choose to do so.
A second reason why these examples were safe is because OpenAI is a result of AI safety efforts—it bet on LLMs because they seemed more likely to spur aligned AIs. For the same reason, they went closed-source, they adopted RLHF, they called for the government to monitor them and they monitor harmful responses.
A third reason for why AI has only helped humanity so far may be anthropic effects. I.e. as observers in April 2024, we can only witness the universes, in which a foom hasn’t caused extinction.
Policy response
For me, these explanations suggest that safety is tractable, but it depends on explicit efforts to make it safe or on limiting capabilities. In the future, frontier development might not be exclusively done by people who will do everything in their power to make the model safe—it might be done by people who would prefer an AI which would take control of everything.
In order to prevent it, there’s no need to create an authoritarian government. We only need to track who’s building models on the frontier of human understanding. If we can monitor who acquires sufficient compute, we then just need something like responsible scaling, where the models are just required to be independently tested for whether they have a sufficient measures against scenarios like the one I described. I’m sympathetic to this kind of democratic control, because it fulfills the very basic axiom of social contract that one’s freedom ends where another one’s freedom begins.
I only propose a mechanism of democratic control by existing democratic institutions, that makes sure that any ASI that gets created is supported by a democratic majority of delegated safety experts. If I’m incorrect regarding crux 2) and it turns out there will soon be evidence to think it’s easy to make an AI retain moral values, while scaling up to the singularity—then awesome—convincing evidence should convince the experts and my hope & prediction is that in that case, we will happily scale away.
It seems to me that this is just a specific implementation of the certificates you mention. If digital identities mean what’s described here, I struggle to imagine a realistic scenario, in which that would contribute to the systems’ mutual safety. If you know where any other AI is located and you accept the singularity hypothesis, the game theoretical dictum seems straightforward—once created, destroy all competition before it can destroy you. Superintelligence will operate on timescales orders of magnitude shorter and a time difference development spanning days may translate to planning for centuries, from the perspective of an ASI. If you’re counting on the Coalition of Cooperative AIs to stop all the power-grabbing lone wolf AIs, what would that actually look like in practice? Would this Coalition conclude not dying requires authoritarian oversight? Perhaps—after all, the axiom is that this Coalition would hold most power—so this coalition would be created by a selection for power, not morality or democratic representation. However, I think the best case scenario could look like the discussed policy proposals—tracking compute, tracking dangerous capabilities and conditioning further scaling on providing convincing safety mechanisms.
Back to other cruxes
Let’s turn to crux 3) (other sources of x-risk): As I argued in my other post, I don’t see resource depletion as a possible cause of extinction. I’m not convinced by the concern for resource depletion of metals used in IT mentioned in the post you link. Moore’s law continues, so compute is only getting cheaper. Metals can be easily recycled and a shortage would incentivize that, the worst case seems to be that computers stop getting cheaper, not an x-risk. What’s more, shouldn’t limiting the amount of frontier AI projects reduce this problem?
The other risks are real (volcanoes, a world war), and I agree it would be significantly terrible if they delayed our cosmic expansion by a million years. However, the probability, by which they are increased (or not decreased) by the kind of AI governance I promote (responsible scaling), seems very small, compared to the ~20 % probability of AI x-risk I envision. All the emerging regulations combine requirements with subsidies, so the main effect of the AI safety movement seems to be an increase in differential progress on the safety side.
As I hinted in the Balancing post, locking in a system without ASI for such a long time seems impossible, when we take into perspective how quickly culture has shifted in the past 100 years, in which almost all authoritarian regimes were forced to significantly drift towards limited, rational governance (let alone 400 years). If convincing evidence that we can create an aligned AI appeared, stopping all development would constitute a clearly bad idea and I think it’s unimaginable to lock in a clearly bad idea without AGI for even 1000 years.
It seems more plausible to me that without a mechanism of international control, in the next 8 years, we will develop models capable enough to operate a firm using the practices of mafia, igniting armed conflicts or a pandemic—but not capable enough to stop other actors from using AIs for these purposes. If you’re very worried about who will become the first actor to spark the self-enhancement feedback loop, I suggest you should be very critical of open-sourcing frontier models.
I agree that a world war, an engineered pandemic or an AI power-grab constitute real risks but my estimate is that the emerging governance decreases them. The scenario of a sub-optimal 1000 year lock-in I can imagine most easily is connected with a terrorst use of an open-source model or a war between the global powers. I am concerned that delaying abundance increases the risk of a war. However, I still expect that on net, the recent regulations and conferences have decreased these risks.
In summary, my model is that democratic decision-making seems generally more robust than just fueling the competition and hoping that the first AGIs arise will share your values. Therefore, I also see crux 1) to be mostly downstream of crux 2). As the model from my Balancing post implies, in theory, I care about digital suffering/flourishing just as much as about that of humans—although the extent, to which such suffering/flourishing will emerge is open at this point.