I think 1 and 3 seem like arguments that reduce the desirability of these roles but it’s hard to see how they can make them net-negative.
Arguments 4 and to some extent 2 give a real case that could in principle make something net-negative but I’m sceptical that the effect scales that far. In particular if this were right, I think it would effectively say that it would be better if AI labs invested less rather than more in safety. I can’t rule out that that’s correct, but it seems like a pretty galaxy-brained take and I would want some robust arguments before I took it seriously, and I don’t think these are close to meeting that threshold for me personally.
Further, I think that there are a bunch of arguments for the value of safety work within labs (e.g. access to sota models; building institutional capacity and learning; cultural outreach) which seem to me to be significant and you’re not engaging with.
I think 1 and 3 seem like arguments that reduce the desirability of these roles but it’s hard to see how they can make them net-negative.
Yes, specifically by claim 1, positive value can only asymptotically approach 0 (ignoring opportunity costs).
For small specialised models (designed for specific uses in a specific context of use for a specific user group), we see in practice that safety R&D can make a big difference.
Unscoped everything-for-everyone models (otherwise called ‘general purpose AI’) sits somewhere in between.
I think progress on generalisable safety R&D is practically intractable for the current model sizes and uses scaling rates that AI corporations are competing at.
The functioning of the model weights’ during computation is too variable depending on changes in the input (distribution), and the contexts the models output into are also too varied (too many possible paths through which the propagated effects could cause failures in irregular locality-dependent ways, given the complexity of the nested societies and ecosystems we humans depend on to live and to live well).
Arguments 4 and to some extent 2 give a real case that could in principle make something net-negative but I’m sceptical that the effect scales that far.
Some relevant aspects are missing in what you shared so far.
Particularly, we need to consider that any one AGI lab is (as of now) beholden to the rest of society to continue operating.
This is clearly true in the limit. Imagine some freak mass catastrophe caused by OpenAI: staff would leave, consumers would stop buying, and regulators would shut the place down.
But it is also true in practice. From the outside, these AGI labs may look like institutional pillars of strength. But from the inside, management is constantly jostling, trying to source enough investments and/or profitable productisation avenues to cover high staff salaries and compute costs. This is why I think DeepMind allowed themselves to be acquired by Google in the first place. They ran a $649 million loss in 2019, and could simply not maintain that burn rate without a larger tech corporation covering their losses for them.
In practice, AGI labs are constantly finding ways to make themselves look serious about safety, and finding ways to address safety issues customers are noticing. Not just because some employees there are paying attention to those harms and taking care to avoid them. But also because they’re dealing with newly introduced AI products that already have lots of controversies associated to it (in these rough categories: data laundering, worker exploitation, design errors and misuses, resource-intensive and polluting hardware).
If we think about this in simplified dimensions:
There is an actual safety dimension, as would be defined by how the (potential) effects of this technology impact us humans and the world contexts we depend on to live.
There is a perceived safety dimension, which is defined by how safe we perceive the system to be, based in a small part on our own direct experience and reasoning, and for a large part on what we hear/read from others around us.
Outside stakeholders would need to perceive the system to be unsafe to restrict further scaling and/or uses (which IMO is much more effective than trying to make scaled open-ended systems comprehensively safe after the fact). Where ‘the system’ can include the institutional hierarchies and infrastructure through which an AI models is developed and deployed.
Corporations have a knack for finding ways to hide product harms, while influencing people to not notice or to dismiss those harms. See cases Big Tabacco, Big Pharma, Big Oil.
Corporations that manage to do that can make profit from selling products without getting shut down. This is what capitalism – open market transactions and private profit reinvestment – in part selects for. This is what Big Tech companies that win out over time manage to do.
(it feels like I’m repeating stuff obvious to you, but it bears repeating to set the context)
In particular if this were right, I think it would effectively say that it would be better if AI labs invested less rather than more in safety.
Are you stating an intuition that it would be surprising if AGI labs invested less in improving actual safety, then that would be overall less harmful?
I am saying with claim 4. that there is another dimension, perceived safety. The more that an AI corporation is able to make the system be or at least look *locally* safe to users and other stakeholders (even if globally much more unsafe), the more the rest of society will permit and support the corporations to scale on. And the more that the AI corporation can promote that they are responsibly scaling toward some future aligned system that is *globally* safe, the more that nerdy researchers and other stakeholders open to that kind of messaging can treat that as a sign of virtue and give the corporation a pass there too.
And that unfortunately, by claim 1, that actual safety is intractable when scaling such open-ended (and increasingly automated) systems. Which is why in established safety-critical industries – eg. for medical devices, cars, planes, industrial plants, even kitchen devices – there are best practices for narrowly scoping the design of the machines to specific uses and contexts of use.
So actual safety is intractable for such open-ended systems, but AI corporations can and do disproportionately support research and research communication that increases perceived safety.
But actual safety is tractable for restricting corporate AI scaling (if you reduce the system’s degrees of freedom of interaction, you reduce the possible ways things can go wrong). Unfortunately, fewer people move to restrict corporate-AI scaling if the corporate activities are perceived to be safe.
By researching safety at AGI labs, researchers are therefore predominantly increasing perceived safety, and as a result closing off realistic opportunities to improving actual safety.
Thanks. I’m now understanding your central argument to be:
Improving the quality/quantity of output from safety teams within AI labs has a (much) bigger impact on perceived safety of the lab than it does on actual safety of the lab. This is therefore the dominant term in the impact of the team’s work. Right now it’s negative.
Is that a fair summary?
If so, I think:
Conditional on the premise, the conclusion appears to make sense
It still feels kinda galaxy-brained, which may make me want to retain some scepticism
However I feel way less confident than you in the premise, for I believe a number of reasons:
I’m more optimistic e.g. that control turns out to be useful, or that there are hacky alignment techniques which work long enough to get through to the automation of crucial safety research
I think that there are various non-research pathways for such people to (in expectation) increase the safety of the lab they’re working at
It’s unclear to me what the sign is of quality-of-safety-team-work on perceived-safety to the relevant outsiders (investors/regulators?)
e.g. I think that one class of work people in labs could do is capabilities monitoring, and I think that if this were done to a good standard it could in fact help to reduce perceived-safety to outsiders in a timely fashion
I guess I’m quite sceptical that signals like “well the safety team at this org doesn’t really have any top-tier researchers and is generally a bit badly thought of” will be meaningfully legible to the relevant outsiders, so I don’t really think that reducing the quality of their work will have too much impact on perceived safety
Thanks, I appreciate the paraphrase. Yes, that is a great summary.
I’m more optimistic e.g. that control turns out to be useful, or that there are hacky alignment techniques which work long enough to get through to the automation of crucial safety research
I hear this all the time, but I also notice that people saying it have not investigated the fundamental limits to controllability that you would encounter with any control system.
As a philosopher, would you not want to have a more generalisable and robust argument that this is actually going to work out?
I think that there are various non-research pathways for such people to (in expectation) increase the safety of the lab they’re working at
I’m curious about the pathways you have in mind. I may have missed something here.
e.g. I think that one class of work people in labs could do is capabilities monitoring, and I think that if this were done to a good standard it could in fact help to reduce perceived-safety to outsiders in a timely fashion
I’m skeptical that that would work in this corporate context.
“Capabilities” are just too useful economically and can creep up on you. Putting aside whether we can even measure comprehensively enough for “dangerous capabilities”.
In the meantime, it’s great marketing to clients, to the media, and to national interests: You are working on AI systems that could become so capable, that you even have an entire team devoted to capabilities monitoring.
I guess I’m quite sceptical that signals like “well the safety team at this org doesn’t really have any top-tier researchers and is generally a bit badly thought of” will be meaningfully legible to the relevant outsiders, so I don’t really think that reducing the quality of their work will have too much impact on perceived safety
This is interesting. And a fair argument. Will think about this.
I’m curious about the pathways you have in mind. I may have missed something here.
I think it’s basically things flowing in some form through “the people working on the powerful technology spend time with people seriously concerned with large-scale risks”. From a very zoomed out perspective it just seems obvious that we should be more optimistic about worlds where that’s happening compared to worlds where it’s not (which doesn’t mean that necessarily remains true when we zoom in, but it sure affects my priors).
If I try to tell more concrete stories they include things of the form “the safety-concerned people have better situational awareness and may make better plans later”, and also “when systems start showing troubling indicators, culturally that’s taken much more seriously”. (Ok, I’m not going super concrete in my stories here, but that’s because I don’t want to anchor things on a particular narrow pathway.)
I hear this all the time, but I also notice that people saying it have not investigated the fundamental limits to controllability that you would encounter with any control system.
As a philosopher, would you not want to have a more generalisable and robust argument that this is actually going to work out?
Of course I’d prefer to have something more robust. But I don’t think the lack of that means it’s necessarily useless.
I don’t think control is likely to scale to arbitrarily powerful systems. But it may not need to. I think the next phase of the problem is like “keep things safe for long enough that we can get important work out of AI systems”, where the important work has to be enough that it can be leveraged to something which sets us up well for the following phases.
I don’t think control is likely to scale to arbitrarily powerful systems. But it may not need to… which sets us up well for the following phases.
Under the concept of ‘control’, I am including the capacity of the AI system to control their own components’ effects.
I am talking about fundamental workings of control. Ie. control theory and cybernetics. I.e. as general enough that results are applicable to any following phases as well.
Anders Sandberg has been digging lately into fundamental controllability limits. Could be interesting to talk with Anders.
Improving the quality/quantity of output from safety teams within AI labs has a (much) bigger impact on perceived safety of the lab than it does on actual safety of the lab. This is therefore the dominant term in the impact of the team’s work. Right now it’s negative.
If perception of safety is higher than actual safety, it will lead to underinvestment of future safety, which increases the probability of failure of the system.
Further, I think that there are a bunch of arguments for the value of safety work within labs (e.g. access to sota models; building institutional capacity and learning; cultural outreach) which seem to me to be significant and you’re not engaging with.
Let’s dig into the arguments you mentioned then.
Access to SOTA models
Given that safety research is intractable where open-ended and increasingly automated systems are scaled anywhere near current rates, I don’t really see the value proposition here.
I guess if researchers noticed a bunch of bad design practices and violations of the law in inspecting the SOTA models, they could leak information about that to the public?
Building institutional capacity and learning
Inside a corporation competing against other corporations, where the more power-hungry individuals tend to find ways to the top, the institutional capacity-building and learning you will see will be directed towards extracting more profit and power.
I think this argument considered within its proper institutional context actually cuts against your current conclusion.
Cultural outreach
This reminds me of the cultural exchanges between US and Soviet scientists during the Cold War. Are you thinking of something like that?
Saying that, I notice that the current situation is different in the sense that AI Safety researchers are not one side racing to scale proliferation of dangerous machines in tandem with the other side (AGI labs).
To the extent though that AI Safety researchers can come to share collectively important insights with colleagues at AGI labs – such as on why and how to stop scaling dangerous machine technology, this cuts against my conclusion.
Looking from the outside, I haven’t seen that yet. Early AGI safety thinkers (eg. Yudkowsky, Tegmark) and later funders (eg. Tallinn, Karnofsky) instead supported AGI labs to start up, even if they did not mean to.
But I’m open (and hoping!) to change my mind. It would be great if safety researchers at AGI labs start connecting to collaborate effectively on restricting harmful scaling.
I’m going off the brief descriptions you gave. Does that cover the arguments as you meant them? What did I miss?
I think 1 and 3 seem like arguments that reduce the desirability of these roles but it’s hard to see how they can make them net-negative.
Arguments 4 and to some extent 2 give a real case that could in principle make something net-negative but I’m sceptical that the effect scales that far. In particular if this were right, I think it would effectively say that it would be better if AI labs invested less rather than more in safety. I can’t rule out that that’s correct, but it seems like a pretty galaxy-brained take and I would want some robust arguments before I took it seriously, and I don’t think these are close to meeting that threshold for me personally.
Further, I think that there are a bunch of arguments for the value of safety work within labs (e.g. access to sota models; building institutional capacity and learning; cultural outreach) which seem to me to be significant and you’re not engaging with.
Yes, specifically by claim 1, positive value can only asymptotically approach 0
(ignoring opportunity costs).
For small specialised models (designed for specific uses in a specific context of use for a specific user group), we see in practice that safety R&D can make a big difference.
For ‘AGI’, I would argue that the system cannot be controlled sufficiently to stay safe.
Unscoped everything-for-everyone models (otherwise called ‘general purpose AI’) sits somewhere in between.
I think progress on generalisable safety R&D is practically intractable for the current model sizes and uses scaling rates that AI corporations are competing at.
The functioning of the model weights’ during computation is too variable depending on changes in the input (distribution), and the contexts the models output into are also too varied (too many possible paths through which the propagated effects could cause failures in irregular locality-dependent ways, given the complexity of the nested societies and ecosystems we humans depend on to live and to live well).
Some relevant aspects are missing in what you shared so far.
Particularly, we need to consider that any one AGI lab is (as of now) beholden to the rest of society to continue operating.
This is clearly true in the limit. Imagine some freak mass catastrophe caused by OpenAI:
staff would leave, consumers would stop buying, and regulators would shut the place down.
But it is also true in practice.
From the outside, these AGI labs may look like institutional pillars of strength. But from the inside, management is constantly jostling, trying to source enough investments and/or profitable productisation avenues to cover high staff salaries and compute costs. This is why I think DeepMind allowed themselves to be acquired by Google in the first place. They ran a $649 million loss in 2019, and could simply not maintain that burn rate without a larger tech corporation covering their losses for them.
In practice, AGI labs are constantly finding ways to make themselves look serious about safety, and finding ways to address safety issues customers are noticing. Not just because some employees there are paying attention to those harms and taking care to avoid them. But also because they’re dealing with newly introduced AI products that already have lots of controversies associated to it (in these rough categories: data laundering, worker exploitation, design errors and misuses, resource-intensive and polluting hardware).
If we think about this in simplified dimensions:
There is an actual safety dimension, as would be defined by how the (potential) effects of this technology impact us humans and the world contexts we depend on to live.
There is a perceived safety dimension, which is defined by how safe we perceive the system to be, based in a small part on our own direct experience and reasoning, and for a large part on what we hear/read from others around us.
Outside stakeholders would need to perceive the system to be unsafe to restrict further scaling and/or uses (which IMO is much more effective than trying to make scaled open-ended systems comprehensively safe after the fact). Where ‘the system’ can include the institutional hierarchies and infrastructure through which an AI models is developed and deployed.
Corporations have a knack for finding ways to hide product harms, while influencing people to not notice or to dismiss those harms. See cases Big Tabacco, Big Pharma, Big Oil.
Corporations that manage to do that can make profit from selling products without getting shut down. This is what capitalism – open market transactions and private profit reinvestment – in part selects for. This is what Big Tech companies that win out over time manage to do.
(it feels like I’m repeating stuff obvious to you, but it bears repeating to set the context)
Are you stating an intuition that it would be surprising if AGI labs invested less in improving actual safety, then that would be overall less harmful?
I am saying with claim 4. that there is another dimension, perceived safety.
The more that an AI corporation is able to make the system be or at least look *locally* safe to users and other stakeholders (even if globally much more unsafe), the more the rest of society will permit and support the corporations to scale on. And the more that the AI corporation can promote that they are responsibly scaling toward some future aligned system that is *globally* safe, the more that nerdy researchers and other stakeholders open to that kind of messaging can treat that as a sign of virtue and give the corporation a pass there too.
And that unfortunately, by claim 1, that actual safety is intractable when scaling such open-ended (and increasingly automated) systems. Which is why in established safety-critical industries – eg. for medical devices, cars, planes, industrial plants, even kitchen devices – there are best practices for narrowly scoping the design of the machines to specific uses and contexts of use.
So actual safety is intractable for such open-ended systems, but AI corporations can and do disproportionately support research and research communication that increases perceived safety.
But actual safety is tractable for restricting corporate AI scaling (if you reduce the system’s degrees of freedom of interaction, you reduce the possible ways things can go wrong). Unfortunately, fewer people move to restrict corporate-AI scaling if the corporate activities are perceived to be safe.
By researching safety at AGI labs, researchers are therefore predominantly increasing perceived safety, and as a result closing off realistic opportunities to improving actual safety.
Thanks. I’m now understanding your central argument to be:
Is that a fair summary?
If so, I think:
Conditional on the premise, the conclusion appears to make sense
It still feels kinda galaxy-brained, which may make me want to retain some scepticism
However I feel way less confident than you in the premise, for I believe a number of reasons:
I’m more optimistic e.g. that control turns out to be useful, or that there are hacky alignment techniques which work long enough to get through to the automation of crucial safety research
I think that there are various non-research pathways for such people to (in expectation) increase the safety of the lab they’re working at
It’s unclear to me what the sign is of quality-of-safety-team-work on perceived-safety to the relevant outsiders (investors/regulators?)
e.g. I think that one class of work people in labs could do is capabilities monitoring, and I think that if this were done to a good standard it could in fact help to reduce perceived-safety to outsiders in a timely fashion
I guess I’m quite sceptical that signals like “well the safety team at this org doesn’t really have any top-tier researchers and is generally a bit badly thought of” will be meaningfully legible to the relevant outsiders, so I don’t really think that reducing the quality of their work will have too much impact on perceived safety
Thanks, I appreciate the paraphrase. Yes, that is a great summary.
I hear this all the time, but I also notice that people saying it have not investigated the fundamental limits to controllability that you would encounter with any control system.
As a philosopher, would you not want to have a more generalisable and robust argument that this is actually going to work out?
I’m curious about the pathways you have in mind. I may have missed something here.
I’m skeptical that that would work in this corporate context.
“Capabilities” are just too useful economically and can creep up on you. Putting aside whether we can even measure comprehensively enough for “dangerous capabilities”.
In the meantime, it’s great marketing to clients, to the media, and to national interests:
You are working on AI systems that could become so capable, that you even have an entire team devoted to capabilities monitoring.
This is interesting. And a fair argument. Will think about this.
I think it’s basically things flowing in some form through “the people working on the powerful technology spend time with people seriously concerned with large-scale risks”. From a very zoomed out perspective it just seems obvious that we should be more optimistic about worlds where that’s happening compared to worlds where it’s not (which doesn’t mean that necessarily remains true when we zoom in, but it sure affects my priors).
If I try to tell more concrete stories they include things of the form “the safety-concerned people have better situational awareness and may make better plans later”, and also “when systems start showing troubling indicators, culturally that’s taken much more seriously”. (Ok, I’m not going super concrete in my stories here, but that’s because I don’t want to anchor things on a particular narrow pathway.)
Thanks for clarifying.
Of course I’d prefer to have something more robust. But I don’t think the lack of that means it’s necessarily useless.
I don’t think control is likely to scale to arbitrarily powerful systems. But it may not need to. I think the next phase of the problem is like “keep things safe for long enough that we can get important work out of AI systems”, where the important work has to be enough that it can be leveraged to something which sets us up well for the following phases.
Under the concept of ‘control’, I am including the capacity of the AI system to control their own components’ effects.
I am talking about fundamental workings of control. Ie. control theory and cybernetics.
I.e. as general enough that results are applicable to any following phases as well.
Anders Sandberg has been digging lately into fundamental controllability limits.
Could be interesting to talk with Anders.
I would agree that this is a good summary:
If perception of safety is higher than actual safety, it will lead to underinvestment of future safety, which increases the probability of failure of the system.
Let’s dig into the arguments you mentioned then.
Access to SOTA models
Given that safety research is intractable where open-ended and increasingly automated systems are scaled anywhere near current rates, I don’t really see the value proposition here.
I guess if researchers noticed a bunch of bad design practices and violations of the law in inspecting the SOTA models, they could leak information about that to the public?
Building institutional capacity and learning
Inside a corporation competing against other corporations, where the more power-hungry individuals tend to find ways to the top, the institutional capacity-building and learning you will see will be directed towards extracting more profit and power.
I think this argument considered within its proper institutional context actually cuts against your current conclusion.
Cultural outreach
This reminds me of the cultural exchanges between US and Soviet scientists during the Cold War. Are you thinking of something like that?
Saying that, I notice that the current situation is different in the sense that AI Safety researchers are not one side racing to scale proliferation of dangerous machines in tandem with the other side (AGI labs).
To the extent though that AI Safety researchers can come to share collectively important insights with colleagues at AGI labs – such as on why and how to stop scaling dangerous machine technology, this cuts against my conclusion.
Looking from the outside, I haven’t seen that yet. Early AGI safety thinkers (eg. Yudkowsky, Tegmark) and later funders (eg. Tallinn, Karnofsky) instead supported AGI labs to start up, even if they did not mean to.
But I’m open (and hoping!) to change my mind.
It would be great if safety researchers at AGI labs start connecting to collaborate effectively on restricting harmful scaling.
I’m going off the brief descriptions you gave.
Does that cover the arguments as you meant them? What did I miss?