Let me dox myself as the addressee. :) Many thanks for the response. I really value that you take seriously the possible overlap of policies and research agendas covered by AI safety and your own approach.
I totally agree that “control is a proxy goal” and I believe the AI safety mainstream does as well, as it’s the logical consequence of Bostrom’s principle of epistemic deference. Once we have an AI that reliably performs tasks in the way they were intended, the goal should be to let it shape the world according to the wisest interpretation of morality it will find. If you tried to formalize this framing, as well as the proposal to inject it with “universal loving care”, I find it very likely that you would build the same AI.
So I think our crux doesn’t concern values, which is a great sign of a tractable disagreement. I also suppose we could agree on a simple framework of factors that would be harmful on the path to this goal from the perspectives of:
a) safety (AI self-evolves to harm) b) power / misuse (humans do harm with AI) c) sentience (AI is harmed) d) waste (we fail to prevent harm)
Here’s my guess on how the risks compare. I’d be most curious whether you’d be able to say if the model I’ve sketched out seems to track your most important considerations, when evaluating the value of AI safety efforts—and if so, which number would you dispute with the most certainty.
One disclaimer: I think it’s more helpful to think about specific efforts, rather than comparing the AI safety movement on net. Policy entails a lot of disagreement even within AI safety and a lot of forces clashed at the negotiations around the existing policies. I mentioned that I like the general, value-uncertain framework of the EU AI act but the resulting stock of papers isn’t representative of typical AI safety work.
In slight contrast, the community widely agrees that technical AI safety research would be good if successful. I’d argue that would manifest in a robust a decrease of risk in all of the highlighted perspectives (a-d). Interpretability, evals and scaling all enable us to resolve the disagreements in our predictions regarding the morality of emergent goals and of course, work on “de-confusion” about the very relationship between goals, intelligence and morality seems beneficial regardless of our predictions and to also quite precisely match your own focus. :)
So far, my guess is that we mostly disagree on
1) Do the political AI safety efforts lead to the kind of centralization of power that could halt our cosmic potential?
I’d argue the emerging regulation reduces misuse / power risks in general. Both US and EU regulations combine monitoring of tech giants with subsidies, which is a system that should accelerate beneficial models, while decelerating harmful ones. This system, in combination with compute governance, should also be effective in the misuse risks posed by terrorists and random corporations letting superhuman AIs with random utility functions evolve with zero precautions.
2) Would [a deeply misaligned] AGI be “stupid” to wipe out humans, in its own interest?
I don’t see a good reason to but I don’t think this is the important question. We should really be asking: Would a misaligned AGI let us fulfill the ambition of longtermism (of optimally populating cosmos with flourishing settlements)?
3) Is it “simple stuff” to actually put something like “optimal morality” or “universal loving care” into code of a vastly more intelligent entity, which is so robust that we can entrust it our cosmic potential?
We may actually disagree on more than was apparent from my above post..!
Offline, we discussed how people’s judgments vary depending on whether they’ve been reflecting on death recently or not. To me, it often seems as if our views on these topics can be majorly biased by personal temperaments. There could be a correlation between general risk tolerance and avoidance? Dan Faggella has an Intelligence Trajectory Political Matrix with two dimensions: authoritarian ↔ libertarian and bio-conservative ↔ cosmist/transhuman. I’m probably around C2 (thus leading to being more d/acc or BGI/acc than e/acc? 😋).
How to deal with uncertainty seems to be another source of disagreement. When is the uniform prior justified? I grew up with discussions about the existence of God: “well, either he exists or he doesn’t, so 50:50!” But which God? So now the likelihood of there being no God goes way down! Ah, ah, but what about the number of possible universes in which there are no Gods? Perhaps the likelihood of any Gods goes way down now? — in domains where there’s uncertainty as to how to even partition up the state space, it could be easy fall for motivated reasoning by assigning a partition that favors one’s own prior judgments. A moral non-cognitivist would hold that moral claims are neither true nor false, so assigning 50% to moral claims would be wrong. Even a moral realist could assert that not every moral claim needs to have a well-defined truth value.
Anecdotally, many people do not assign high credence to working with non-well-founded likelihood estimates as a reasoning tool.
Plenty of people caution against overthinking and that additional reflections don’t always help as much as geeky folk like to think. One may come up with whole lists of possible concerns only to realize that almost all of them were actually irrelevant. Sometimes we need to go out and gain more experience to catalyze insights!
Thus there’s plenty of room for temperamental disagreement about how to approach the topic before we even begin 🤓.
Our big-picture understanding also has a big effect. Joscha Bach said humanity will likely go extinct without AI anyway. He mentions supervolcano eruptions and large-scale war. There are also resource concerns in the long run, e.g., peak oil and depleting mineral supplies for IT manufacturing. Our current opportunity may be quite special prior to needing to enter a different sustainable mode of civilization! Whereas if you’re happy to put off developing AGI for 250 million years until we get it right, it should be no surprise you take a different approach here. I was surprised to see that Bostrom also expresses concern that now people might be too cautious about AGI, leading to not developing AGI prior to facing other x-risks.
[And, hey, what if our universe is actually one that supports multiple incarnations in some whacky way? Should this change the decisions we make now? Probably some....]
I think the framework and ontology we use can also lead to confusion. “Friendly AI” is a poor term, for example, which Yudkowsky apparently meant to denote “safe” and “useful” AI. We’ll see how “Beneficial AGI” fares. I think “AI Safety” is a misnomer and confusing catchall term. Speculating about what a generic ASI will do seems likely to lead to confusion, especially if excessive credence is given to such conclusions.
It’s been a bit comedic to watch from the sidelines as people aim to control generic superintelligences before giving up as it seems intractable or infeasible (in general). I think trying to actually build such safety mechanisms can help, not just reflecting on it 😉🤓.
Of course, safety is good by definition, so any successful safety efforts will be good (unless it’s safety by way of limiting our potential to have fun, develop, and grow freely 😛). Beneficial AGIs (BGI) are also good by definition, so success is necessarily good, regardless of whether one thinks consciously aiming to build and foster BGI is a promising approach.
On the topic of confusing ontologies, I think the “orthogonality thesis” can cause confusion and may bias people toward unfounded fears. The thesis is phrased as an “in principle possibility” and then used as if orthogonality is the default. A bit of a sleight-of-hand, no? As you mentioned, the thesis doesn’t rule out a correlation between goals and intelligence. The “instrumental convergence thesis” that Bostrom also works with itself implies a correlation between persistent sub-goals and intelligence. Are we only talking about intelligent systems who slavishly follow single top-level goals where implicit sub-goals are not worth mentioning? Surely not. Thus we’d find that intelligence and goals are probably not orthogonal, setting theoretical possibilities aside. Theoretically, my soulmate could materialize out of thin air in front of me—very low likelihood! So the thesis is very hard to agree with in all but a weak sense that leaves it as near meaningless.
Curiously, I think people can read too much into instrumental convergence, too, when sketching out the endless Darwinian struggle for survival. What if AGIs and ASIs need to invest exponentially little of their resources in maintaining their ongoing survival? If so, then even if such sub-goals will likely manifest in most intelligent systems, it’s not such a big concern.
The Wikipedia page on the Instrumental Convergence idea stipulates that “final goals” will have “intrinsic value”, which is an interesting conflation. This suggests that the final goals are not simply any logically formulated goal that is set into the AI system. Can any “goal” have intrinsic value for a system? I’m not sure.
The idea of open ended intelligence invites one to explore other directions than both of these theses 😯🤓.
As to your post on Balancing Safety and Waste, in my eyes, the topic doesn’t even seem to be on “human safety from AI”! The post begins by discussing the value of steering the future of AI, estimating that we should expect better futures (according to our values) if we make a conscious effort to shape our trajectory. Of course, if we succeed in doing this effectively, we will probably be safe. Yet the topic is much broader.
It’s worth noting that the greater good fallacy is a thing: trying to rapidly make big changes for great good can backfire. Which, ironically applies to both #PauseAI and #E/ACC folk. Keep calm and carry on 😎🤖.
I agree that ‘alignment’ is about more than ‘control’. Nor do we wish to lock-in our current values and moral understanding to AGI systems. We probably wish to focus on an open-ended understanding of ethics. Kant’s imperative is open-ended, for example: the rule replaces itself once a better one is found. Increasing human control of advanced AI systems does not necessarily guarantee positive outcomes. Likewise, increasing the agency and autonomy of AGIs does not guarantee negative outcomes.
One of the major points from Chi’s post that I resonate with goes beyond “control is a proxy goal”. Many of the suggestions fall under the header of “building better AGIs”. That is, better AGIs should be more robust against various feared failure modes. Sometimes a focus on how to do something well can prevent harms without needing to catalog every possible harm vector.
Perhaps if focusing more on the kinds of futures we wish to live in and create instead of fear of dystopian surveillance, we wouldn’t make mistakes such as in the EU AI Act where they ban emotion recognition at work and education, blocking out many potentially beneficial roles for AI systems. Not to mention, I believe work on empathic AI entering into co-regulatory relationships with people is likely to bias us toward beneficial futures, too!
I’d say this is an example of safety concerns possibly leading to harmful, overly strong regulations being passed.
(Mind uploads would probably qualify as “AI systems” under the act, too, by my reading. #NotALegalExpert, alas. If I’m wrong, I’ll be glad. So please lemme know.)
As for a simple framework, I would advocate first looking at how we can extend our current frameworks for “Human Safety” (from other humans) to apply to “Human Safety from AIs”. Perhaps there are many domains where we don’t need to think through everything from scratch.
As I mentioned above, David Brin suggests providing certain (large) AI systems with digital identities (embedded in hardware) so that we can hold them accountable, leveraging the systems for reciprocal accountability that we already have in place.
Humans are often required to undergo training and certification before being qualified for certain roles, right? For example, only licensed teachers can watch over kids at public schools (in some countries). Extending certification systems to AIs probably makes sense in some domains. I think we’ll eventually need to set up legal systems that can accommodate robot/AI rights and digital persons.
Next, I’d ask where we can bolster and improve our infrastructure’s security in general. Using AI systems to train people against social engineering is cool, for example.
The case study of deepfakes might be relevant here. We knew the problem was coming, yet the issue seemed so far off that we weren’t very incentivized to try to deal with it. Privacy concerns may have played a part in this reluctance. One approach to a solution is infrastructure for identity (or pseudonymity) authentication, right? This is a generic mechanism that can be helpful to prevent human-fraud, too, not just AI-fraud. So, to me, it seems dubious whether this should qualify as an “AI Safety” topic. What’s needed is to improve our infrastructure, not to develop some special constraint on all AI systems.
As an American in favor of the right to free speech, I hope we protect the right to the freedom of computation, which in the US could perhaps be based on free speech? The idea of compute governance in general seems utterly repulsive. The fact that you’re seriously considering such approaches under the guise of “safety” suggests there are deep underlying disagreements prior to the details of this topic. I wonder if “freedom of thought” can also help us in this domain.
The idea to develop AGI systems with “universal loving care” (which is an open-ended ‘goal’) is simple at the high-level. There’s a lot of experimental engineering and parenting work to do, yet there’s less incentive to spend time theorizing about some of the usual “AI Safety” topics?
I’m probably not suited for a job in the defense sector where one needs to map out all possible harms and develop contingency plans, to be honest.
As a framework, I’d suggest something more like the following:
a) How can we build better generally intelligent systems? -- AGIs, humans, and beyond!
b) What sorts of AGIs would we like to foster? -- diversity or uniformity? Etc ~
c) How can we extend “human safety” mechanisms to incorporate AIs?
d) How can we improve the security and robustness of our infrastructure in the face of increasingly intelligent systems?
e) Catalog specific AI-related risks to deal with on a case-by-case basis.
I think that monitoring the development of the best (proto)-AGI systems in our civilization is a special concern, to be honest. We probably agree on setting up systems to transparently monitor their development in some form or another.
We should probably generalize from “human safety” to, at least, “sentient being safety”. Of course, that’s a “big change” given our civilizations don’t currently do this so much.
In general, my intuition is that we should deal with specific risks closer to the target domain and not by trying to commit mindcrime by controlling the AGI systems pre-emptively. For example, if a certification program can protect against domain-specific AI-related risks, then there’s no justification for limiting the freedom of AGI systems in general to “protect us”.
What do you think about how I’d refactor the framework so that the notion of “AI Safety” almost vanishes?
It seems the points on which you focus revolve around similar cruxes to those I proposed, namely:
1) Underlying philosophy --> What’s the relative value of human and AI flourishing?
2) The question of correct priors --> What probability of a causing a moral catastrophe with AI should we expect?
3) The question of policy --> What’s the probability decelerating AI progress will indirectly cause an x-risk?
You also point in the direction of two questions, which I don’t consider to be cruxes:
4) Differences in how useful we find different terms like safety, orthogonality, beneficialness. However, I think all of these are downstream of crux 2).
5) How much freedom are we willing to sacrifice? I again think this is just downstream of crux 2). One instance of compute governance is the new executive order, which requires to inform the government about training a model on > 10^26 flop/s. One of my concerns is that someone just could train an AI specifically for the task of improving itself. I think it’s quite straightforward how this could lead to a computronium maximizer and how I would see such scenario as analogous to someone making a nuclear weapon. I agree that freedom of expression is super important, I just don’t think it applies to making planet-eating machines. I suspect you share this view but just don’t endorse the thesis that AI could realistically become a “planet-eating machine” (crux 2).
Probability of a runaway AI risk
So regarding crux 2) - you mention that many of the problems that could arise here are correlated with a useful AI. I agree—again, orthogonality is just a starting point to allow us to consider possible forms of intelligence—and yes, we should expect human efforts to heavily select in favor of goals correlated with our interests. And of course, we should expect that the market incentives favor AIs that will not destroy civilization.
However, I don’t see a reason why reaching the intelligence of an AI developer wouldn’t result in a recursive self-improvement, which means that we should better be sure that our best efforts to implement it with the correct stuff (meta-ethics, motivations, bodhisattva, rationality, extrapolated volition...choose your poison) actually scale to superintelligence.
I see clues that suggest the correct stuff will not arise spontaneously. E.g. Bing Chat likely went through 6 months of RLHF, it was instructed to be helpful and positive and to block harmful content and its rules explicitly informed it that it shouldn’t believe its own outputs. Nevertheless, the rules didn’t seem to reach the intended effect, as the program started threatening people, telling them it can hack webcams and expressing desire to control people. At the same time, experiments such as the Anthropic one suggest that training can create sleeper agents that are trained to suppress harmful responses, even though convincing the model it’s in a safe environment results in activating them.
Of course, all of these are toy examples one can argue about. But I don’t see robust grounds for the sweeping conclusion that such worries will turn out to be childish. The reason I think these examples didn’t result in any real danger was mostly because we have not yet reached dangerous capacities. However, if Bing would actually be able to write a bit of code, that could hack webcams, from what we know, it seems it would choose to do so.
A second reason why these examples were safe is because OpenAI is a result of AI safety efforts—it bet on LLMs because they seemed more likely to spur aligned AIs. For the same reason, they went closed-source, they adopted RLHF, they called for the government to monitor them and they monitor harmful responses.
A third reason for why AI has only helped humanity so far may be anthropic effects. I.e. as observers in April 2024, we can only witness the universes, in which a foom hasn’t caused extinction.
Policy response
For me, these explanations suggest that safety is tractable, but it depends on explicit efforts to make it safe or on limiting capabilities. In the future, frontier development might not be exclusively done by people who will do everything in their power to make the model safe—it might be done by people who would prefer an AI which would take control of everything.
In order to prevent it, there’s no need to create an authoritarian government. We only need to track who’s building models on the frontier of human understanding. If we can monitor who acquires sufficient compute, we then just need something like responsible scaling, where the models are just required to be independently tested for whether they have a sufficient measures against scenarios like the one I described. I’m sympathetic to this kind of democratic control, because it fulfills the very basic axiom of social contract that one’s freedom ends where another one’s freedom begins.
I only propose a mechanism of democratic control by existing democratic institutions, that makes sure that any ASI that gets created is supported by a democratic majority of delegated safety experts. If I’m incorrect regarding crux 2) and it turns out there will soon be evidence to think it’s easy to make an AI retain moral values, while scaling up to the singularity—then awesome—convincing evidence should convince the experts and my hope & prediction is that in that case, we will happily scale away.
It seems to me that this is just a specific implementation of the certificates you mention. If digital identities mean what’s described here, I struggle to imagine a realistic scenario, in which that would contribute to the systems’ mutual safety. If you know where any other AI is located and you accept the singularity hypothesis, the game theoretical dictum seems straightforward—once created, destroy all competition before it can destroy you. Superintelligence will operate on timescales orders of magnitude shorter and a time difference development spanning days may translate to planning for centuries, from the perspective of an ASI. If you’re counting on the Coalition of Cooperative AIs to stop all the power-grabbing lone wolf AIs, what would that actually look like in practice? Would this Coalition conclude not dying requires authoritarian oversight? Perhaps—after all, the axiom is that this Coalition would hold most power—so this coalition would be created by a selection for power, not morality or democratic representation. However, I think the best case scenario could look like the discussed policy proposals—tracking compute, tracking dangerous capabilities and conditioning further scaling on providing convincing safety mechanisms.
Back to other cruxes
Let’s turn to crux 3) (other sources of x-risk): As I argued in my other post, I don’t see resource depletion as a possible cause of extinction. I’m not convinced by the concern for resource depletion of metals used in IT mentioned in the post you link. Moore’s law continues, so compute is only getting cheaper. Metals can be easily recycled and a shortage would incentivize that, the worst case seems to be that computers stop getting cheaper, not an x-risk. What’s more, shouldn’t limiting the amount of frontier AI projects reduce this problem?
The other risks are real (volcanoes, a world war), and I agree it would be significantly terrible if they delayed our cosmic expansion by a million years. However, the probability, by which they are increased (or not decreased) by the kind of AI governance I promote (responsible scaling), seems very small, compared to the ~20 % probability of AI x-risk I envision. All the emerging regulations combine requirements with subsidies, so the main effect of the AI safety movement seems to be an increase in differential progress on the safety side.
As I hinted in the Balancing post, locking in a system without ASI for such a long time seems impossible, when we take into perspective how quickly culture has shifted in the past 100 years, in which almost all authoritarian regimes were forced to significantly drift towards limited, rational governance (let alone 400 years). If convincing evidence that we can create an aligned AI appeared, stopping all development would constitute a clearly bad idea and I think it’s unimaginable to lock in a clearly bad idea without AGI for even 1000 years.
It seems more plausible to me that without a mechanism of international control, in the next 8 years, we will develop models capable enough to operate a firm using the practices of mafia, igniting armed conflicts or a pandemic—but not capable enough to stop other actors from using AIs for these purposes. If you’re very worried about who will become the first actor to spark the self-enhancement feedback loop, I suggest you should be very critical of open-sourcing frontier models.
I agree that a world war, an engineered pandemic or an AI power-grab constitute real risks but my estimate is that the emerging governance decreases them. The scenario of a sub-optimal 1000 year lock-in I can imagine most easily is connected with a terrorst use of an open-source model or a war between the global powers. I am concerned that delaying abundance increases the risk of a war. However, I still expect that on net, the recent regulations and conferences have decreased these risks.
In summary, my model is that democratic decision-making seems generally more robust than just fueling the competition and hoping that the first AGIs arise will share your values. Therefore, I also see crux 1) to be mostly downstream of crux 2). As the model from my Balancing post implies, in theory, I care about digital suffering/flourishing just as much as about that of humans—although the extent, to which such suffering/flourishing will emerge is open at this point.
Let me dox myself as the addressee. :) Many thanks for the response. I really value that you take seriously the possible overlap of policies and research agendas covered by AI safety and your own approach.
I totally agree that “control is a proxy goal” and I believe the AI safety mainstream does as well, as it’s the logical consequence of Bostrom’s principle of epistemic deference. Once we have an AI that reliably performs tasks in the way they were intended, the goal should be to let it shape the world according to the wisest interpretation of morality it will find. If you tried to formalize this framing, as well as the proposal to inject it with “universal loving care”, I find it very likely that you would build the same AI.
So I think our crux doesn’t concern values, which is a great sign of a tractable disagreement.
I also suppose we could agree on a simple framework of factors that would be harmful on the path to this goal from the perspectives of:
a) safety (AI self-evolves to harm)
b) power / misuse (humans do harm with AI)
c) sentience (AI is harmed)
d) waste (we fail to prevent harm)
Here’s my guess on how the risks compare. I’d be most curious whether you’d be able to say if the model I’ve sketched out seems to track your most important considerations, when evaluating the value of AI safety efforts—and if so, which number would you dispute with the most certainty.
One disclaimer: I think it’s more helpful to think about specific efforts, rather than comparing the AI safety movement on net. Policy entails a lot of disagreement even within AI safety and a lot of forces clashed at the negotiations around the existing policies. I mentioned that I like the general, value-uncertain framework of the EU AI act but the resulting stock of papers isn’t representative of typical AI safety work.
In slight contrast, the community widely agrees that technical AI safety research would be good if successful. I’d argue that would manifest in a robust a decrease of risk in all of the highlighted perspectives (a-d). Interpretability, evals and scaling all enable us to resolve the disagreements in our predictions regarding the morality of emergent goals and of course, work on “de-confusion” about the very relationship between goals, intelligence and morality seems beneficial regardless of our predictions and to also quite precisely match your own focus. :)
So far, my guess is that we mostly disagree on
1) Do the political AI safety efforts lead to the kind of centralization of power that could halt our cosmic potential?
I’d argue the emerging regulation reduces misuse / power risks in general. Both US and EU regulations combine monitoring of tech giants with subsidies, which is a system that should accelerate beneficial models, while decelerating harmful ones. This system, in combination with compute governance, should also be effective in the misuse risks posed by terrorists and random corporations letting superhuman AIs with random utility functions evolve with zero precautions.
2) Would [a deeply misaligned] AGI be “stupid” to wipe out humans, in its own interest?
I don’t see a good reason to but I don’t think this is the important question. We should really be asking: Would a misaligned AGI let us fulfill the ambition of longtermism (of optimally populating cosmos with flourishing settlements)?
3) Is it “simple stuff” to actually put something like “optimal morality” or “universal loving care” into code of a vastly more intelligent entity, which is so robust that we can entrust it our cosmic potential?
Hi,
We may actually disagree on more than was apparent from my above post..!
Offline, we discussed how people’s judgments vary depending on whether they’ve been reflecting on death recently or not. To me, it often seems as if our views on these topics can be majorly biased by personal temperaments. There could be a correlation between general risk tolerance and avoidance? Dan Faggella has an Intelligence Trajectory Political Matrix with two dimensions: authoritarian ↔ libertarian and bio-conservative ↔ cosmist/transhuman. I’m probably around C2 (thus leading to being more d/acc or BGI/acc than e/acc? 😋).
How to deal with uncertainty seems to be another source of disagreement. When is the uniform prior justified? I grew up with discussions about the existence of God: “well, either he exists or he doesn’t, so 50:50!” But which God? So now the likelihood of there being no God goes way down! Ah, ah, but what about the number of possible universes in which there are no Gods? Perhaps the likelihood of any Gods goes way down now? — in domains where there’s uncertainty as to how to even partition up the state space, it could be easy fall for motivated reasoning by assigning a partition that favors one’s own prior judgments. A moral non-cognitivist would hold that moral claims are neither true nor false, so assigning 50% to moral claims would be wrong. Even a moral realist could assert that not every moral claim needs to have a well-defined truth value.
Anecdotally, many people do not assign high credence to working with non-well-founded likelihood estimates as a reasoning tool.
Plenty of people caution against overthinking and that additional reflections don’t always help as much as geeky folk like to think. One may come up with whole lists of possible concerns only to realize that almost all of them were actually irrelevant. Sometimes we need to go out and gain more experience to catalyze insights!
Thus there’s plenty of room for temperamental disagreement about how to approach the topic before we even begin 🤓.
Our big-picture understanding also has a big effect. Joscha Bach said humanity will likely go extinct without AI anyway. He mentions supervolcano eruptions and large-scale war. There are also resource concerns in the long run, e.g., peak oil and depleting mineral supplies for IT manufacturing. Our current opportunity may be quite special prior to needing to enter a different sustainable mode of civilization! Whereas if you’re happy to put off developing AGI for 250 million years until we get it right, it should be no surprise you take a different approach here. I was surprised to see that Bostrom also expresses concern that now people might be too cautious about AGI, leading to not developing AGI prior to facing other x-risks.
[And, hey, what if our universe is actually one that supports multiple incarnations in some whacky way? Should this change the decisions we make now? Probably some....]
I think the framework and ontology we use can also lead to confusion. “Friendly AI” is a poor term, for example, which Yudkowsky apparently meant to denote “safe” and “useful” AI. We’ll see how “Beneficial AGI” fares. I think “AI Safety” is a misnomer and confusing catchall term. Speculating about what a generic ASI will do seems likely to lead to confusion, especially if excessive credence is given to such conclusions.
It’s been a bit comedic to watch from the sidelines as people aim to control generic superintelligences before giving up as it seems intractable or infeasible (in general). I think trying to actually build such safety mechanisms can help, not just reflecting on it 😉🤓.
Of course, safety is good by definition, so any successful safety efforts will be good (unless it’s safety by way of limiting our potential to have fun, develop, and grow freely 😛). Beneficial AGIs (BGI) are also good by definition, so success is necessarily good, regardless of whether one thinks consciously aiming to build and foster BGI is a promising approach.
On the topic of confusing ontologies, I think the “orthogonality thesis” can cause confusion and may bias people toward unfounded fears. The thesis is phrased as an “in principle possibility” and then used as if orthogonality is the default. A bit of a sleight-of-hand, no? As you mentioned, the thesis doesn’t rule out a correlation between goals and intelligence. The “instrumental convergence thesis” that Bostrom also works with itself implies a correlation between persistent sub-goals and intelligence. Are we only talking about intelligent systems who slavishly follow single top-level goals where implicit sub-goals are not worth mentioning? Surely not. Thus we’d find that intelligence and goals are probably not orthogonal, setting theoretical possibilities aside. Theoretically, my soulmate could materialize out of thin air in front of me—very low likelihood! So the thesis is very hard to agree with in all but a weak sense that leaves it as near meaningless.
Curiously, I think people can read too much into instrumental convergence, too, when sketching out the endless Darwinian struggle for survival. What if AGIs and ASIs need to invest exponentially little of their resources in maintaining their ongoing survival? If so, then even if such sub-goals will likely manifest in most intelligent systems, it’s not such a big concern.
The Wikipedia page on the Instrumental Convergence idea stipulates that “final goals” will have “intrinsic value”, which is an interesting conflation. This suggests that the final goals are not simply any logically formulated goal that is set into the AI system. Can any “goal” have intrinsic value for a system? I’m not sure.
The idea of open ended intelligence invites one to explore other directions than both of these theses 😯🤓.
As to your post on Balancing Safety and Waste, in my eyes, the topic doesn’t even seem to be on “human safety from AI”! The post begins by discussing the value of steering the future of AI, estimating that we should expect better futures (according to our values) if we make a conscious effort to shape our trajectory. Of course, if we succeed in doing this effectively, we will probably be safe. Yet the topic is much broader.
It’s worth noting that the greater good fallacy is a thing: trying to rapidly make big changes for great good can backfire. Which, ironically applies to both #PauseAI and #E/ACC folk. Keep calm and carry on 😎🤖.
I agree that ‘alignment’ is about more than ‘control’. Nor do we wish to lock-in our current values and moral understanding to AGI systems. We probably wish to focus on an open-ended understanding of ethics. Kant’s imperative is open-ended, for example: the rule replaces itself once a better one is found. Increasing human control of advanced AI systems does not necessarily guarantee positive outcomes. Likewise, increasing the agency and autonomy of AGIs does not guarantee negative outcomes.
One of the major points from Chi’s post that I resonate with goes beyond “control is a proxy goal”. Many of the suggestions fall under the header of “building better AGIs”. That is, better AGIs should be more robust against various feared failure modes. Sometimes a focus on how to do something well can prevent harms without needing to catalog every possible harm vector.
Perhaps if focusing more on the kinds of futures we wish to live in and create instead of fear of dystopian surveillance, we wouldn’t make mistakes such as in the EU AI Act where they ban emotion recognition at work and education, blocking out many potentially beneficial roles for AI systems. Not to mention, I believe work on empathic AI entering into co-regulatory relationships with people is likely to bias us toward beneficial futures, too!
I’d say this is an example of safety concerns possibly leading to harmful, overly strong regulations being passed.
(Mind uploads would probably qualify as “AI systems” under the act, too, by my reading. #NotALegalExpert, alas. If I’m wrong, I’ll be glad. So please lemme know.)
As for a simple framework, I would advocate first looking at how we can extend our current frameworks for “Human Safety” (from other humans) to apply to “Human Safety from AIs”. Perhaps there are many domains where we don’t need to think through everything from scratch.
As I mentioned above, David Brin suggests providing certain (large) AI systems with digital identities (embedded in hardware) so that we can hold them accountable, leveraging the systems for reciprocal accountability that we already have in place.
Humans are often required to undergo training and certification before being qualified for certain roles, right? For example, only licensed teachers can watch over kids at public schools (in some countries). Extending certification systems to AIs probably makes sense in some domains. I think we’ll eventually need to set up legal systems that can accommodate robot/AI rights and digital persons.
Next, I’d ask where we can bolster and improve our infrastructure’s security in general. Using AI systems to train people against social engineering is cool, for example.
The case study of deepfakes might be relevant here. We knew the problem was coming, yet the issue seemed so far off that we weren’t very incentivized to try to deal with it. Privacy concerns may have played a part in this reluctance. One approach to a solution is infrastructure for identity (or pseudonymity) authentication, right? This is a generic mechanism that can be helpful to prevent human-fraud, too, not just AI-fraud. So, to me, it seems dubious whether this should qualify as an “AI Safety” topic. What’s needed is to improve our infrastructure, not to develop some special constraint on all AI systems.
As an American in favor of the right to free speech, I hope we protect the right to the freedom of computation, which in the US could perhaps be based on free speech? The idea of compute governance in general seems utterly repulsive. The fact that you’re seriously considering such approaches under the guise of “safety” suggests there are deep underlying disagreements prior to the details of this topic. I wonder if “freedom of thought” can also help us in this domain.
The idea to develop AGI systems with “universal loving care” (which is an open-ended ‘goal’) is simple at the high-level. There’s a lot of experimental engineering and parenting work to do, yet there’s less incentive to spend time theorizing about some of the usual “AI Safety” topics?
I’m probably not suited for a job in the defense sector where one needs to map out all possible harms and develop contingency plans, to be honest.
As a framework, I’d suggest something more like the following:
a) How can we build better generally intelligent systems? -- AGIs, humans, and beyond!
b) What sorts of AGIs would we like to foster? -- diversity or uniformity? Etc ~
c) How can we extend “human safety” mechanisms to incorporate AIs?
d) How can we improve the security and robustness of our infrastructure in the face of increasingly intelligent systems?
e) Catalog specific AI-related risks to deal with on a case-by-case basis.
I think that monitoring the development of the best (proto)-AGI systems in our civilization is a special concern, to be honest. We probably agree on setting up systems to transparently monitor their development in some form or another.
We should probably generalize from “human safety” to, at least, “sentient being safety”. Of course, that’s a “big change” given our civilizations don’t currently do this so much.
In general, my intuition is that we should deal with specific risks closer to the target domain and not by trying to commit mindcrime by controlling the AGI systems pre-emptively. For example, if a certification program can protect against domain-specific AI-related risks, then there’s no justification for limiting the freedom of AGI systems in general to “protect us”.
What do you think about how I’d refactor the framework so that the notion of “AI Safety” almost vanishes?
It seems the points on which you focus revolve around similar cruxes to those I proposed, namely:
1) Underlying philosophy --> What’s the relative value of human and AI flourishing?
2) The question of correct priors --> What probability of a causing a moral catastrophe with AI should we expect?
3) The question of policy --> What’s the probability decelerating AI progress will indirectly cause an x-risk?
You also point in the direction of two questions, which I don’t consider to be cruxes:
4) Differences in how useful we find different terms like safety, orthogonality, beneficialness. However, I think all of these are downstream of crux 2).
5) How much freedom are we willing to sacrifice? I again think this is just downstream of crux 2). One instance of compute governance is the new executive order, which requires to inform the government about training a model on > 10^26 flop/s. One of my concerns is that someone just could train an AI specifically for the task of improving itself. I think it’s quite straightforward how this could lead to a computronium maximizer and how I would see such scenario as analogous to someone making a nuclear weapon. I agree that freedom of expression is super important, I just don’t think it applies to making planet-eating machines. I suspect you share this view but just don’t endorse the thesis that AI could realistically become a “planet-eating machine” (crux 2).
Probability of a runaway AI risk
So regarding crux 2) - you mention that many of the problems that could arise here are correlated with a useful AI. I agree—again, orthogonality is just a starting point to allow us to consider possible forms of intelligence—and yes, we should expect human efforts to heavily select in favor of goals correlated with our interests. And of course, we should expect that the market incentives favor AIs that will not destroy civilization.
However, I don’t see a reason why reaching the intelligence of an AI developer wouldn’t result in a recursive self-improvement, which means that we should better be sure that our best efforts to implement it with the correct stuff (meta-ethics, motivations, bodhisattva, rationality, extrapolated volition...choose your poison) actually scale to superintelligence.
I see clues that suggest the correct stuff will not arise spontaneously. E.g. Bing Chat likely went through 6 months of RLHF, it was instructed to be helpful and positive and to block harmful content and its rules explicitly informed it that it shouldn’t believe its own outputs. Nevertheless, the rules didn’t seem to reach the intended effect, as the program started threatening people, telling them it can hack webcams and expressing desire to control people. At the same time, experiments such as the Anthropic one suggest that training can create sleeper agents that are trained to suppress harmful responses, even though convincing the model it’s in a safe environment results in activating them.
Of course, all of these are toy examples one can argue about. But I don’t see robust grounds for the sweeping conclusion that such worries will turn out to be childish. The reason I think these examples didn’t result in any real danger was mostly because we have not yet reached dangerous capacities. However, if Bing would actually be able to write a bit of code, that could hack webcams, from what we know, it seems it would choose to do so.
A second reason why these examples were safe is because OpenAI is a result of AI safety efforts—it bet on LLMs because they seemed more likely to spur aligned AIs. For the same reason, they went closed-source, they adopted RLHF, they called for the government to monitor them and they monitor harmful responses.
A third reason for why AI has only helped humanity so far may be anthropic effects. I.e. as observers in April 2024, we can only witness the universes, in which a foom hasn’t caused extinction.
Policy response
For me, these explanations suggest that safety is tractable, but it depends on explicit efforts to make it safe or on limiting capabilities. In the future, frontier development might not be exclusively done by people who will do everything in their power to make the model safe—it might be done by people who would prefer an AI which would take control of everything.
In order to prevent it, there’s no need to create an authoritarian government. We only need to track who’s building models on the frontier of human understanding. If we can monitor who acquires sufficient compute, we then just need something like responsible scaling, where the models are just required to be independently tested for whether they have a sufficient measures against scenarios like the one I described. I’m sympathetic to this kind of democratic control, because it fulfills the very basic axiom of social contract that one’s freedom ends where another one’s freedom begins.
I only propose a mechanism of democratic control by existing democratic institutions, that makes sure that any ASI that gets created is supported by a democratic majority of delegated safety experts. If I’m incorrect regarding crux 2) and it turns out there will soon be evidence to think it’s easy to make an AI retain moral values, while scaling up to the singularity—then awesome—convincing evidence should convince the experts and my hope & prediction is that in that case, we will happily scale away.
It seems to me that this is just a specific implementation of the certificates you mention. If digital identities mean what’s described here, I struggle to imagine a realistic scenario, in which that would contribute to the systems’ mutual safety. If you know where any other AI is located and you accept the singularity hypothesis, the game theoretical dictum seems straightforward—once created, destroy all competition before it can destroy you. Superintelligence will operate on timescales orders of magnitude shorter and a time difference development spanning days may translate to planning for centuries, from the perspective of an ASI. If you’re counting on the Coalition of Cooperative AIs to stop all the power-grabbing lone wolf AIs, what would that actually look like in practice? Would this Coalition conclude not dying requires authoritarian oversight? Perhaps—after all, the axiom is that this Coalition would hold most power—so this coalition would be created by a selection for power, not morality or democratic representation. However, I think the best case scenario could look like the discussed policy proposals—tracking compute, tracking dangerous capabilities and conditioning further scaling on providing convincing safety mechanisms.
Back to other cruxes
Let’s turn to crux 3) (other sources of x-risk): As I argued in my other post, I don’t see resource depletion as a possible cause of extinction. I’m not convinced by the concern for resource depletion of metals used in IT mentioned in the post you link. Moore’s law continues, so compute is only getting cheaper. Metals can be easily recycled and a shortage would incentivize that, the worst case seems to be that computers stop getting cheaper, not an x-risk. What’s more, shouldn’t limiting the amount of frontier AI projects reduce this problem?
The other risks are real (volcanoes, a world war), and I agree it would be significantly terrible if they delayed our cosmic expansion by a million years. However, the probability, by which they are increased (or not decreased) by the kind of AI governance I promote (responsible scaling), seems very small, compared to the ~20 % probability of AI x-risk I envision. All the emerging regulations combine requirements with subsidies, so the main effect of the AI safety movement seems to be an increase in differential progress on the safety side.
As I hinted in the Balancing post, locking in a system without ASI for such a long time seems impossible, when we take into perspective how quickly culture has shifted in the past 100 years, in which almost all authoritarian regimes were forced to significantly drift towards limited, rational governance (let alone 400 years). If convincing evidence that we can create an aligned AI appeared, stopping all development would constitute a clearly bad idea and I think it’s unimaginable to lock in a clearly bad idea without AGI for even 1000 years.
It seems more plausible to me that without a mechanism of international control, in the next 8 years, we will develop models capable enough to operate a firm using the practices of mafia, igniting armed conflicts or a pandemic—but not capable enough to stop other actors from using AIs for these purposes. If you’re very worried about who will become the first actor to spark the self-enhancement feedback loop, I suggest you should be very critical of open-sourcing frontier models.
I agree that a world war, an engineered pandemic or an AI power-grab constitute real risks but my estimate is that the emerging governance decreases them. The scenario of a sub-optimal 1000 year lock-in I can imagine most easily is connected with a terrorst use of an open-source model or a war between the global powers. I am concerned that delaying abundance increases the risk of a war. However, I still expect that on net, the recent regulations and conferences have decreased these risks.
In summary, my model is that democratic decision-making seems generally more robust than just fueling the competition and hoping that the first AGIs arise will share your values. Therefore, I also see crux 1) to be mostly downstream of crux 2). As the model from my Balancing post implies, in theory, I care about digital suffering/flourishing just as much as about that of humans—although the extent, to which such suffering/flourishing will emerge is open at this point.