I think I agree with a lot of the specific points raised here, but I notice a feeling of wariness/unease around the overall message. I had a similar reaction to Haydn’s recent “If your model is going to sell, it has to be safe” piece. Let me try to unpack this:
On one hand, I do think safety is important for the commercial interests of labs. And broadly being better able to understand/control systems seems good from a commercial standpoint.
My biggest reservations can be boiled down into two points:
I don’t think that commercial incentives will be enough to motivate people to solve the hardest parts of alignment. Commercial incentives will drive people to make sure their system appears to do what users want, which is very different than having systems that actually do what users want or robustly do what users want even as they become more powerful. Or to put it another way: near-term commercial incentives don’t really cause me to put appropriate amounts of attention on things like situational awareness or deceptive alignment. I think commercial incentives will be sufficient to reduce the odds of Bingchat fiascos, but I don’t think they’ll motivate the kind of alignment research that’s trying to handle deception, sharp left turns, or even the most ambitious types of scalable oversight work.
The research that is directly incentivized by commercial interests is least likely to be neglected. I expect the most neglected research to be research that doesn’t have any direct commercial benefit. I expect AGI labs will invest a substantial amount of resources to prevent future Bingchat scenarios and other instances of egregious deployment harms. The problem is that I expect many of these approaches (e.g., getting really good at RLHFing your model such that it no longer displays undesirable behaviors) will not generalize to more powerful systems. I think you (and many others) agree with this, but I think the important point here is that the economic incentives will favor RLHFy stuff over stuff that tackles problems that are not as directly commercially incentivized.
As a result, even though I agree with many of your subclaims, I’m still left thinking, “huh, the message I want to spread is not something like “hey, in order to win the race or sell your product, you need to solve alignment.”
But rather something more like “hey, there are some safety problems you’ll need to figure out to sell/deploy your product. Cool that you’re interested in that stuff. There are other safety problems—often ones that are more speculative—that the market is not incentivizing companies to solve. On the margin, I want more attention paid to those problems. And if we just focus on solving the problems that are required for profit/deployment, we will likely fool ourselves into thinking that our systems are safe when they merely appear to be safe, and we may underinvest in understanding/detecting/solving some of the problems that seem most concerning from an x-risk perspective.”
There are other safety problems—often ones that are more speculative—that the market is not incentivizing companies to solve.
My personal response would be as follows:
As Leopold presents it, the key pressure here that keeps labs in check is societal constraints on deployment, not perceived ability to make money. The hope is that society’s response has the following properties:
thoughtful, prominent experts are attuned to these risks and demand rigorous responses
policymakers are attuned to (thoughtful) expert opinion
policy levers exist that provide policymakers with oversight / leverage over labs
If labs are sufficiently thoughtful, they’ll notice that deploying models is in fact bad for them! Can’t make profit if you’re dead. *taps forehead knowingly*
but in practice I agree that lots of people are motivated by the tastiness of progress, pro-progress vibes, etc., and will not notice the skulls.
Counterpoints to 1:
Good regulation of deployment is hard (though not impossible in my view).
reasonable policy responses are difficult to steer towards
attempts at raising awareness of AI risk could lead to policymakers getting too excited about the promise of AI while ignoring the risks
experts will differ; policymakers might not listen to the right experts
Good regulation of development is much harder, and will eventually be necessary.
This is the really tricky one IMO. I think it requires pretty far-reaching regulations that would be difficult to get passed today, and would probably misfire a lot. But doesn’t seem impossible, and I know people are working on laying groundwork for this in various ways (e.g. pushing for labs to incorporate evals in their development process).
I think I agree with a lot of the specific points raised here, but I notice a feeling of wariness/unease around the overall message. I had a similar reaction to Haydn’s recent “If your model is going to sell, it has to be safe” piece. Let me try to unpack this:
On one hand, I do think safety is important for the commercial interests of labs. And broadly being better able to understand/control systems seems good from a commercial standpoint.
My biggest reservations can be boiled down into two points:
I don’t think that commercial incentives will be enough to motivate people to solve the hardest parts of alignment. Commercial incentives will drive people to make sure their system appears to do what users want, which is very different than having systems that actually do what users want or robustly do what users want even as they become more powerful. Or to put it another way: near-term commercial incentives don’t really cause me to put appropriate amounts of attention on things like situational awareness or deceptive alignment. I think commercial incentives will be sufficient to reduce the odds of Bingchat fiascos, but I don’t think they’ll motivate the kind of alignment research that’s trying to handle deception, sharp left turns, or even the most ambitious types of scalable oversight work.
The research that is directly incentivized by commercial interests is least likely to be neglected. I expect the most neglected research to be research that doesn’t have any direct commercial benefit. I expect AGI labs will invest a substantial amount of resources to prevent future Bingchat scenarios and other instances of egregious deployment harms. The problem is that I expect many of these approaches (e.g., getting really good at RLHFing your model such that it no longer displays undesirable behaviors) will not generalize to more powerful systems. I think you (and many others) agree with this, but I think the important point here is that the economic incentives will favor RLHFy stuff over stuff that tackles problems that are not as directly commercially incentivized.
As a result, even though I agree with many of your subclaims, I’m still left thinking, “huh, the message I want to spread is not something like “hey, in order to win the race or sell your product, you need to solve alignment.”
But rather something more like “hey, there are some safety problems you’ll need to figure out to sell/deploy your product. Cool that you’re interested in that stuff. There are other safety problems—often ones that are more speculative—that the market is not incentivizing companies to solve. On the margin, I want more attention paid to those problems. And if we just focus on solving the problems that are required for profit/deployment, we will likely fool ourselves into thinking that our systems are safe when they merely appear to be safe, and we may underinvest in understanding/detecting/solving some of the problems that seem most concerning from an x-risk perspective.”
My personal response would be as follows:
As Leopold presents it, the key pressure here that keeps labs in check is societal constraints on deployment, not perceived ability to make money. The hope is that society’s response has the following properties:
thoughtful, prominent experts are attuned to these risks and demand rigorous responses
policymakers are attuned to (thoughtful) expert opinion
policy levers exist that provide policymakers with oversight / leverage over labs
If labs are sufficiently thoughtful, they’ll notice that deploying models is in fact bad for them! Can’t make profit if you’re dead. *taps forehead knowingly*
but in practice I agree that lots of people are motivated by the tastiness of progress, pro-progress vibes, etc., and will not notice the skulls.
Counterpoints to 1:
Good regulation of deployment is hard (though not impossible in my view).
reasonable policy responses are difficult to steer towards
attempts at raising awareness of AI risk could lead to policymakers getting too excited about the promise of AI while ignoring the risks
experts will differ; policymakers might not listen to the right experts
Good regulation of development is much harder, and will eventually be necessary.
This is the really tricky one IMO. I think it requires pretty far-reaching regulations that would be difficult to get passed today, and would probably misfire a lot. But doesn’t seem impossible, and I know people are working on laying groundwork for this in various ways (e.g. pushing for labs to incorporate evals in their development process).