This post summarizes a discussion between Thomas Larsen and John Wentworth. They discuss efforts to buy time, outreach to AGI labs, models of institutional change, and other ideas relating to a post that Thomas, Olivia Jimenez, and I released: Instead of technical research, more people should focus on buying time.
I moderated the discussion and prepared a transcript with some of the key parts of the conversation.
On the value of specific outreach to lab decision-makers vs. general outreach to lab employees
John:
When thinking about “social” strategies in general, a key question/heuristic is “which specific people do you want to take which specific actions?”
For instance, imagine I have a start-up building software for hospitals, and I’m thinking about marketing/sales. Then an example of what not to do (under this heuristic) would be “Buy google ads, banner ads in medical industry magazines, etc”. That strategy would not identify the specific people who I want to take a specific action (i.e. make the decision to purchase my software for a hospital).
An example of what to do instead would be “Make a list of hospitals, then go to the website of each hospital one-by-one and look at their staff to figure out who would probably make the decision to purchase/not purchase my software. Then, go market to those people specifically—cold call/email them, track them down at conferences, set up a folding chair on the sidewalk outside their house and talk to them when they get home, etc.
Applying that to e.g. efforts to regulate AI development, the heuristic would say to not try to start some giant political movement. Instead, identify which specific people you need to write a law (e.g. judges or congressional staff), which specific people will decide whether it goes into effect (e.g. judges or swing votes in congress), which specific people will implement/enforce it (e.g. specific bureaucrats, who may be sufficient on their own if we don’t need to create new law).
Applying it to e.g. efforts to convince AI labs to stop publishing advances or shift their research projects, the heuristic would say to not just go convincing random ML researchers. Instead, identify which specific people at the major labs are de-facto responsible for the decisions you’re interested in (like e.g. decisions about whether to publish some advancement), and then go talk to those people specifically. Also, make sure to walk them through the reasoning enough that they can see why the decisions you want them to make are right; a vague acknowledgement that AI X-risk is a thing doesn’t cut it.
Thomas:
It seems like we’re on a similar page here.
I do think that on current margins, if typical OpenAI employees become more concerned about x-risk, this will likely have positive follow through effects on general OpenAI epistemics. And I expect this will likely lead to improved decisions.
Perhaps you disagree with that?
John: Making random OpenAI (or Deepmind, or Anthropic, or …) employees more concerned about X-risk is plausibly net-positive value in expectation; I’m unsure about that. But more importantly, it is not plausibly high value in expectation.
When I read your post on time-buying, the main picture I end up with is a bunch of people running around randomly spreading the gospel of AI X-risk. More generally, that seems-to-me to be the sort of thing most people jump to when they think about “time-buying”.
In my mind, 80% of the value is in identifying which specific people we want to make which specific decisions, and then getting in contact with those specific people. And I usually don’t see people thinking about that very much, when they talk about “time-buying” interventions.
Thomas:
Fully agree with this [the last two paragraphs].
Could you explain your model here of how outreach to typical employees becomes net negative?
The path of: [low level OpenAI employees think better about x-risk → improved general OpenAI reasoning around x-risk → improved decisions] seems high EV to me.
My impression is that there are about a hundred employees (unlike, e.g. google, with >100k employees). Also, I think that there are only a few degrees of separation between OpenAI leadership and typical engineers. My guess is that there are lots of conversations about AGI going on at OpenAI, and that good takes will propagate.
John:
I do think the “only ~100 employees” part is a good argument to just go talk to all of them. That’s a tractable number.
[But] “how does it become net negative” is the wrong question. My main claim isn’t that pitching random low-level OpenAI employees is net-negative (though I do have some uncertainty there), my main claim is that it’s relatively low-value (per unit effort). It is much higher-impact, per unit effort, to identify which particular people need to make which particular decisions.
Like, imagine that you knew which particular people needed to make which particular decisions. And then you were like “is pitching random low-level employees an efficient way to get these particular people to make these particular decisions?”. Seems like the answer will almost certainly be “no”.
Thomas:
I strongly agree with the idea that “it is much higher-impact, per unit effort, to identify which particular people make which particular decisions, and then go to those people with the desired advice”. However, I think that getting the key decision makers to make good decisions is so tremendously important that even the lossy version of doing it indirectly is still pretty good.
The epsilon fallacy
John:
So, there’s another general heuristic relevant here, which I call avoiding the “epsilon fallacy”. There’s a few different flavors of the epsilon fallacy:
The sign(epsilon) fallacy: this tiny change was an improvement, so it’s good!
The integral(epsilon) fallacy: another 100 tiny changes like that, and we’ll have a major improvement!
The infinity*epsilon fallacy: this thing is really expensive, so this tiny change will save lots of money!
My prototypical example of this is the novice programmer who spends a week improving the overall runtime of a slow piece of code by 0.3%, and then makes one of the arguments above to justify the effort. When they should have started by profiling their code to find the slow part, and then they could have sped it up by a factor of 2 with much less effort by focusing on the slow part.
More generally, the epsilon fallacy is “a mistake” insofar as it ignores opportunity cost—we could have done something much higher impact instead. And usually, the key to “doing something much higher impact” is something analogous to profiling our code—i.e. identify which specific pieces are high-leverage. I.e. which specific people need to make which specific decisions.
Thomas:
I like this take and think that I’m probably making mistakes of this form.
Identifying specific decisions and decision-makers
John:
Additionally, one likely outcome of thinking about which particular people need to make which particular decisions is that we realize the key decisions/decision-makers aren’t what we originally thought. Maybe we were barking up the wrong tree all along. In worlds where that’s the case, then asking “which people need to make which decisions” is reasonably likely to make us realize it…
It’s very easy to mis-identify the most important decision makers. It’s very common for people to think as though the nominal leadership of a company controls all the key decisions, when on reflection that is obviously not the case.
Yet there is usually a relatively-small subset of employees who de-facto control the key decisions.
Thomas:
Seems right to me.
Specific decision-making things I think are likely to matter are:
Making high level architecture/scaling choices
Deciding whether and how much to coordinate/communicate with other AGI labs at crunch time
Making internal deployment decisions
making external deployment decisions
Specific decisions, concrete stories, & organizational adequacy
Thomas:
I think another place we have different models is that I don’t have a specific decision that I want OpenAI to make right now (well I do, but I don’t think they’d become closedAI). The goal that I have is more like improving the quality of AGI/x-risk discourse around the entire company, so that when things get crazy, the company as a whole makes better decisions.
I think that the craziness might be mostly internal and that I (being external) won’t have access or know what the right decisions are.
John:
So, I’m kinda split on “we won’t have access or know what the right decisions are, so we should generally improve decision-making in the org”.
At face value, that’s a reasonable argument. It’s extremely difficult to improve internal decision making in an org of more than ~10 people even if you’re running the org, and doing it from outside is even harder, but it’s at least an objective which would clearly be valuable if it were tractable.
On the other hand… I’m going to give an uncharitable summary here, with the understanding that I don’t necessarily think you guys in particular are making this mistake, but there sure are a lot of alarm bells and red flags going off.
Person A: hey, we should go help people think better about X-risk, so that they can make better decisions and thereby reduce X-risk. That seems like a decent way to reduce X-risk.
Person B: ok, but what specific people do we want to make what specific decisions better?
Person A: … no idea. Hopefully the people we help will think of something!
Man, that sure is setting off a lot of red flags that person A just has no idea what to do, and the people they recruit will also have no idea what to do, or won’t be in a position to do anything. Person A wasn’t actually able to identify which people or decisions would actually be high-leverage, after all; so why do we believe we’re hitting the right people in the first place?
Akash:
Thomas, can you tell an imaginary story where Person X needs to make decision Y, and it was useful to have average employee Z care about AI x-risk?
(Getting specific will of course make the particular story wrong/unlikely, but I think it’d be helpful to get a few somewhat concrete scenarios in mind).
Thomas:
Story 1:
OpenAI and DeepMind both are racing to AGI. An ordinary low-level OpenAI employee implements evals during training, and during one training run, that employee realizes that the intelligence of an AI system is rapidly becoming vastly superhuman (in the same way that AlphaZero became vastly superhuman at chess in < 1 day). That employee stops training and brings the graph of intelligence vs time to the leadership.
OpenAI leadership immediately informs DeepMind that the current scale is very close to the line to extremely dangerous superintelligence, and both companies stop scaling. The two companies pause and use their ~human level AIs to solve alignment before the next actor builds AGI (my guess is they have like a 6 month lead time if they cooperate with each other), and then deploy an aligned superintelligence. [End story]
The main specific thing I would want them to do right now is stop publishing capabilities.
Depending heavily on how the future goes, there are a bunch of really critical decisions that need to be made, and I could list a bunch of scenarios with different decisions that seem reasonable. My inability to directly say which policy should be done in the future is because I don’t know what scenario will come up (and presumably it will be a scenario that I haven’t imagined).
Another reason I think it’s important to get lots of people at the AGI company to think well about AGI x-risk is for operational adequacy reasons—it seems like there are many failure modes and having everyone be alert and care about the end result mitigates some of them.
John:
At this point I’ve updated my-model-of-your-model to something like “we want to turn OpenAI or Deepmind or Anthropic or … into a generally-competent organization which will sanely think about and act to mitigate X-risk, in aggregate”. And I do agree that getting the bulk of the employees at all levels up to speed is probably a necessary component to achieve that goal.
Thomas:
I endorse this. (Also I think that the buying time post didn’t say this).
John’s models of institutions
John:
That goal [of making organizations generally-competent] seems… difficult, to put it mildly. Organizations of more than ~100 people are approximately-never sane. Even for organizations of more than ~10, a significant degree of insanity is a strong default. That’s all still true even if the employees are individually quite competent.
And inducing good thinking, agency, etc, in that many people individually is a tough ask to begin with, especially given that we’re not applying any selection pressure. (I.e. OpenAI has already chosen their employees based on things other than sanity/agency/etc.)
So, one natural next criterion to look at is: which sub-components of the goal enable incremental progress?
For instance, “which specific people make which specific high-leverage decisions” still seems like the right heuristic to capture big chunks of value early, while building toward a generally-competent organization.
Akash:
John, I’d be curious to learn more about your models of how institutions work & any examples/stories of how institutions have changed in positive ways (e.g., become more sane/competent).
John:
One fairly useful unifying principle is that, as companies grow bigger, incentive structures and game-theoretic equilibria dominate over what the individual people want. Furthermore, the relevant incentives/equilibria are largely exogenous—they’re outside the control of management, and don’t vary much across companies.
Prototypical example: legibility vs value trade-off, i.e. working on things which are legibly valuable to the org vs just directly optimizing for value. At small scale, everybody has enough context that legible value is pretty well correlated with value. Very tight feedback loops can also make the two match pretty well in some contexts—financial trading firms are a central example.
But the default is that, as companies scale, it just isn’t possible for everyone to have lots of context on everybody else’ work, and feedback loops aren’t super-tight. So, legible-value work diverges from highest-value work. And then things like promotions and raises track legible value, so the company naturally selects for people who optimize for legibility.
… and that’s how we get e.g. google software engineers optimizing on a day-to-day basis for projects which will check boxes in the promo process (e.g. shiny new features), rather than e.g. fixing bugs which are bad but in a hard-to-evaluate way.
That sort of legibility/value trade-off is the “lowest-level” form of institutional insanity, in some sense, and the hardest to avoid. Further problems grow on top of it with time and even more so with scale. That’s where the kind of stuff in Zvi’s immoral mazes sequence comes up: insofar as middle management can determine what information is visible (and therefore legible), they can manipulate the resulting incentives, and there’s selection pressure for people who manipulate incentives in their favor—i.e. hide “bad” information and reveal “good” information—whether intentionally or by accident.
And eventually those practices ossify into a hard-to-change company culture. Partly since people who were favored by one set of incentives/rules will be resistant to changing those incentives/rules. And also partly because people naturally mimic those around them, which generally gives culture some momentum.
Responding to Akash’s prompt: I think institutions becoming more competent is usually not a thing which happens, largely because of the “ossification”. Instead, the most common form of improvement is that a new institution comes along and replaces the old, or some competent sub-institution grows to eclipse the parent.
That said, I don’t think it’s impossible to revive an ossified institution, more that it requires moving levers which people in general do not understand very well. Like, if we had the fabled Textbook From The Future on institutions and management, it would probably be pretty doable.
Connecting models of institutions to AI organizations
John: Alright, tying institution models back to OpenAI/Deepmind/etc and general time-buying interventions: overhauling an org’s culture and making it much more competent is very rare for orgs past ~10-100 people, despite lots of economic incentive to do that for many orgs in the world. That said, it’s probably possible in principle, and e.g. a ~100 person org is small enough that we’re not talking about a full worst-case scenario. Insofar as the organization is structured as a lot of mostly-independent small teams, that also makes the problem easier (and favors talking more to lower-level employees).
Thomas: I have the intuition that even OpenAI being bigger wouldn’t rule out the vision of operational adequacy that I have in mind. My (uninformed) opinion is that totally overhauling culture isn’t necessary to cause people to make reasonable operational adequacy decisions.
Everyone could be optimizing for legible achievements, but if there is still common knowledge that leaking weights or training too big of a model kills everyone, then there is an aligned incentive for each individual to make the relevant x-risk reducing decisions.
(very low resilience)
John: Here’s a central example of the kind of failure mode which I think is hard to avoid: some team in charge of evals releases a suite of evals. Now there’s a bunch of incentive/selection pressure for everyone else to build products which game those evals. Even if each individual employee is vaguely aware that x-risk is a concern, they’re going to get a bunch of rewards of various kinds (publicity, maybe pay/promotion) if their product passes the evals and can be released to more people.
Thomas: Do you think it is possible to create a culture that is resilient to [the central example of failure above]?
John: If I were trying to avoid that kind of problem… I mean, the classic solution is “keep the org under 10 people”, and then everyone can have enough context to notice when the evals are being gamed. Once that level of context is infeasible, there’s various patches we can apply—e.g. design robust evals, keep them secret, provide some kind of reward for not releasing things which fail evals, actively provide rewards for new eval failures, etc. But ultimately all of those are pretty leaky.
Thomas: It seems like this problem gets even worse when you are trying to use evals to coordinate multiple labs. Then there is no hope of people having context to know when things get gamed.
Capabilities evals seem more robust to this (e.g. is this model capable enough to become deceptively aligned?), but there is the same pressure against them.
John: You could pop up a level and say “well, if everyone knows eval-gaming is an issue, at least lots of people will be on the lookout for it and that will at least directionally help”. And yeah, it’s not a solution, but it will directionally help.
On the other hand, something like e.g. not publishing new capabilities advances doesn’t run into this issue much. It’s pretty obvious when someone has publicized a new capabilities advance; that doesn’t take lots of extra context.
So there are specific objectives (like not publishing capabilities advances) which are potentially more tractable. But if you want to turn OpenAI or some other lab into a generally-competent org which will competently pursue alignment, then we probably need to handle things like eval-gaming.
The Elon Musk Problem
John:
For the audience: one of the first “successes” of convincing a high-profile person of the importance of AI X-risk was Elon Musk. Who promptly decided that obviously the problem was insufficiently broad access to AGI, and therefore the solution was to fund a cutting-edge AI research lab whose main schtick would be to open-source their models. Thus “OpenAI”.
There’s lots of debate about OpenAI’s impact to date, but it’s at least not an uncommon view that Musk’s founding of OpenAI has plausibly done more to reduce humanity’s chances of survival, than the combined positive impact of all other efforts by the alignment community to “spread the word”.
Point is: convincing people that AI X-risk is a big deal is not obviously a good idea. It is at least plausibly very net harmful, since it often makes people go do stupid net-harmful things about AI. These days, one of the big concerns is that raising attention will result in the US government racing to build AGI in order to “beat China” to it.
Thomas:
I agree that ‘convincing people AI x-risk is a big deal’ is a double edged sword. It is definitely not sufficient.
I am more excited about interventions that focus on AGI labs. They are already barreling towards AGI, and it seems like them slowing down or coordinating with each other could be really useful.
I think that one of the big ways that the world is destroyed is race dynamics, and so each of the major AGI groups collapsing into one project seems really good for the world. It seems that even though this brings the date of ‘humanity is able to build AGI’ closer, it makes the date ‘humanity actually builds AGI’ further away.
We’re grateful to Olivia Jimenez for reviewing the transcript and offering suggestions.
Wentworth and Larsen on buying time
This post summarizes a discussion between Thomas Larsen and John Wentworth. They discuss efforts to buy time, outreach to AGI labs, models of institutional change, and other ideas relating to a post that Thomas, Olivia Jimenez, and I released: Instead of technical research, more people should focus on buying time.
I moderated the discussion and prepared a transcript with some of the key parts of the conversation.
On the value of specific outreach to lab decision-makers vs. general outreach to lab employees
John:
When thinking about “social” strategies in general, a key question/heuristic is “which specific people do you want to take which specific actions?”
For instance, imagine I have a start-up building software for hospitals, and I’m thinking about marketing/sales. Then an example of what not to do (under this heuristic) would be “Buy google ads, banner ads in medical industry magazines, etc”. That strategy would not identify the specific people who I want to take a specific action (i.e. make the decision to purchase my software for a hospital).
An example of what to do instead would be “Make a list of hospitals, then go to the website of each hospital one-by-one and look at their staff to figure out who would probably make the decision to purchase/not purchase my software. Then, go market to those people specifically—cold call/email them, track them down at conferences, set up a folding chair on the sidewalk outside their house and talk to them when they get home, etc.
Applying that to e.g. efforts to regulate AI development, the heuristic would say to not try to start some giant political movement. Instead, identify which specific people you need to write a law (e.g. judges or congressional staff), which specific people will decide whether it goes into effect (e.g. judges or swing votes in congress), which specific people will implement/enforce it (e.g. specific bureaucrats, who may be sufficient on their own if we don’t need to create new law).
Applying it to e.g. efforts to convince AI labs to stop publishing advances or shift their research projects, the heuristic would say to not just go convincing random ML researchers. Instead, identify which specific people at the major labs are de-facto responsible for the decisions you’re interested in (like e.g. decisions about whether to publish some advancement), and then go talk to those people specifically. Also, make sure to walk them through the reasoning enough that they can see why the decisions you want them to make are right; a vague acknowledgement that AI X-risk is a thing doesn’t cut it.
Thomas:
It seems like we’re on a similar page here.
I do think that on current margins, if typical OpenAI employees become more concerned about x-risk, this will likely have positive follow through effects on general OpenAI epistemics. And I expect this will likely lead to improved decisions.
Perhaps you disagree with that?
John: Making random OpenAI (or Deepmind, or Anthropic, or …) employees more concerned about X-risk is plausibly net-positive value in expectation; I’m unsure about that. But more importantly, it is not plausibly high value in expectation.
When I read your post on time-buying, the main picture I end up with is a bunch of people running around randomly spreading the gospel of AI X-risk. More generally, that seems-to-me to be the sort of thing most people jump to when they think about “time-buying”.
In my mind, 80% of the value is in identifying which specific people we want to make which specific decisions, and then getting in contact with those specific people. And I usually don’t see people thinking about that very much, when they talk about “time-buying” interventions.
Thomas:
Fully agree with this [the last two paragraphs].
Could you explain your model here of how outreach to typical employees becomes net negative?
The path of: [low level OpenAI employees think better about x-risk → improved general OpenAI reasoning around x-risk → improved decisions] seems high EV to me.
My impression is that there are about a hundred employees (unlike, e.g. google, with >100k employees). Also, I think that there are only a few degrees of separation between OpenAI leadership and typical engineers. My guess is that there are lots of conversations about AGI going on at OpenAI, and that good takes will propagate.
John:
I do think the “only ~100 employees” part is a good argument to just go talk to all of them. That’s a tractable number.
[But] “how does it become net negative” is the wrong question. My main claim isn’t that pitching random low-level OpenAI employees is net-negative (though I do have some uncertainty there), my main claim is that it’s relatively low-value (per unit effort). It is much higher-impact, per unit effort, to identify which particular people need to make which particular decisions.
Like, imagine that you knew which particular people needed to make which particular decisions. And then you were like “is pitching random low-level employees an efficient way to get these particular people to make these particular decisions?”. Seems like the answer will almost certainly be “no”.
Thomas:
I strongly agree with the idea that “it is much higher-impact, per unit effort, to identify which particular people make which particular decisions, and then go to those people with the desired advice”. However, I think that getting the key decision makers to make good decisions is so tremendously important that even the lossy version of doing it indirectly is still pretty good.
The epsilon fallacy
John:
So, there’s another general heuristic relevant here, which I call avoiding the “epsilon fallacy”. There’s a few different flavors of the epsilon fallacy:
The sign(epsilon) fallacy: this tiny change was an improvement, so it’s good!
The integral(epsilon) fallacy: another 100 tiny changes like that, and we’ll have a major improvement!
The infinity*epsilon fallacy: this thing is really expensive, so this tiny change will save lots of money!
My prototypical example of this is the novice programmer who spends a week improving the overall runtime of a slow piece of code by 0.3%, and then makes one of the arguments above to justify the effort. When they should have started by profiling their code to find the slow part, and then they could have sped it up by a factor of 2 with much less effort by focusing on the slow part.
More generally, the epsilon fallacy is “a mistake” insofar as it ignores opportunity cost—we could have done something much higher impact instead. And usually, the key to “doing something much higher impact” is something analogous to profiling our code—i.e. identify which specific pieces are high-leverage. I.e. which specific people need to make which specific decisions.
Thomas:
I like this take and think that I’m probably making mistakes of this form.
Identifying specific decisions and decision-makers
John:
Additionally, one likely outcome of thinking about which particular people need to make which particular decisions is that we realize the key decisions/decision-makers aren’t what we originally thought. Maybe we were barking up the wrong tree all along. In worlds where that’s the case, then asking “which people need to make which decisions” is reasonably likely to make us realize it…
It’s very easy to mis-identify the most important decision makers. It’s very common for people to think as though the nominal leadership of a company controls all the key decisions, when on reflection that is obviously not the case.
Yet there is usually a relatively-small subset of employees who de-facto control the key decisions.
Thomas:
Seems right to me.
Specific decision-making things I think are likely to matter are:
Making high level architecture/scaling choices
Deciding whether and how much to coordinate/communicate with other AGI labs at crunch time
Making internal deployment decisions
making external deployment decisions
Specific decisions, concrete stories, & organizational adequacy
Thomas:
I think another place we have different models is that I don’t have a specific decision that I want OpenAI to make right now (well I do, but I don’t think they’d become closedAI). The goal that I have is more like improving the quality of AGI/x-risk discourse around the entire company, so that when things get crazy, the company as a whole makes better decisions.
I think that the craziness might be mostly internal and that I (being external) won’t have access or know what the right decisions are.
John:
So, I’m kinda split on “we won’t have access or know what the right decisions are, so we should generally improve decision-making in the org”.
At face value, that’s a reasonable argument. It’s extremely difficult to improve internal decision making in an org of more than ~10 people even if you’re running the org, and doing it from outside is even harder, but it’s at least an objective which would clearly be valuable if it were tractable.
On the other hand… I’m going to give an uncharitable summary here, with the understanding that I don’t necessarily think you guys in particular are making this mistake, but there sure are a lot of alarm bells and red flags going off.
Person A: hey, we should go help people think better about X-risk, so that they can make better decisions and thereby reduce X-risk. That seems like a decent way to reduce X-risk.
Person B: ok, but what specific people do we want to make what specific decisions better?
Person A: … no idea. Hopefully the people we help will think of something!
Man, that sure is setting off a lot of red flags that person A just has no idea what to do, and the people they recruit will also have no idea what to do, or won’t be in a position to do anything. Person A wasn’t actually able to identify which people or decisions would actually be high-leverage, after all; so why do we believe we’re hitting the right people in the first place?
Akash:
Thomas, can you tell an imaginary story where Person X needs to make decision Y, and it was useful to have average employee Z care about AI x-risk?
(Getting specific will of course make the particular story wrong/unlikely, but I think it’d be helpful to get a few somewhat concrete scenarios in mind).
Thomas:
Story 1:
OpenAI and DeepMind both are racing to AGI. An ordinary low-level OpenAI employee implements evals during training, and during one training run, that employee realizes that the intelligence of an AI system is rapidly becoming vastly superhuman (in the same way that AlphaZero became vastly superhuman at chess in < 1 day). That employee stops training and brings the graph of intelligence vs time to the leadership.
OpenAI leadership immediately informs DeepMind that the current scale is very close to the line to extremely dangerous superintelligence, and both companies stop scaling. The two companies pause and use their ~human level AIs to solve alignment before the next actor builds AGI (my guess is they have like a 6 month lead time if they cooperate with each other), and then deploy an aligned superintelligence. [End story]
The main specific thing I would want them to do right now is stop publishing capabilities.
Depending heavily on how the future goes, there are a bunch of really critical decisions that need to be made, and I could list a bunch of scenarios with different decisions that seem reasonable. My inability to directly say which policy should be done in the future is because I don’t know what scenario will come up (and presumably it will be a scenario that I haven’t imagined).
Another reason I think it’s important to get lots of people at the AGI company to think well about AGI x-risk is for operational adequacy reasons—it seems like there are many failure modes and having everyone be alert and care about the end result mitigates some of them.
John:
At this point I’ve updated my-model-of-your-model to something like “we want to turn OpenAI or Deepmind or Anthropic or … into a generally-competent organization which will sanely think about and act to mitigate X-risk, in aggregate”. And I do agree that getting the bulk of the employees at all levels up to speed is probably a necessary component to achieve that goal.
Thomas:
I endorse this. (Also I think that the buying time post didn’t say this).
John’s models of institutions
John:
That goal [of making organizations generally-competent] seems… difficult, to put it mildly. Organizations of more than ~100 people are approximately-never sane. Even for organizations of more than ~10, a significant degree of insanity is a strong default. That’s all still true even if the employees are individually quite competent.
And inducing good thinking, agency, etc, in that many people individually is a tough ask to begin with, especially given that we’re not applying any selection pressure. (I.e. OpenAI has already chosen their employees based on things other than sanity/agency/etc.)
So, one natural next criterion to look at is: which sub-components of the goal enable incremental progress?
For instance, “which specific people make which specific high-leverage decisions” still seems like the right heuristic to capture big chunks of value early, while building toward a generally-competent organization.
Akash:
John, I’d be curious to learn more about your models of how institutions work & any examples/stories of how institutions have changed in positive ways (e.g., become more sane/competent).
John:
One fairly useful unifying principle is that, as companies grow bigger, incentive structures and game-theoretic equilibria dominate over what the individual people want. Furthermore, the relevant incentives/equilibria are largely exogenous—they’re outside the control of management, and don’t vary much across companies.
Prototypical example: legibility vs value trade-off, i.e. working on things which are legibly valuable to the org vs just directly optimizing for value. At small scale, everybody has enough context that legible value is pretty well correlated with value. Very tight feedback loops can also make the two match pretty well in some contexts—financial trading firms are a central example.
But the default is that, as companies scale, it just isn’t possible for everyone to have lots of context on everybody else’ work, and feedback loops aren’t super-tight. So, legible-value work diverges from highest-value work. And then things like promotions and raises track legible value, so the company naturally selects for people who optimize for legibility.
… and that’s how we get e.g. google software engineers optimizing on a day-to-day basis for projects which will check boxes in the promo process (e.g. shiny new features), rather than e.g. fixing bugs which are bad but in a hard-to-evaluate way.
That sort of legibility/value trade-off is the “lowest-level” form of institutional insanity, in some sense, and the hardest to avoid. Further problems grow on top of it with time and even more so with scale. That’s where the kind of stuff in Zvi’s immoral mazes sequence comes up: insofar as middle management can determine what information is visible (and therefore legible), they can manipulate the resulting incentives, and there’s selection pressure for people who manipulate incentives in their favor—i.e. hide “bad” information and reveal “good” information—whether intentionally or by accident.
And eventually those practices ossify into a hard-to-change company culture. Partly since people who were favored by one set of incentives/rules will be resistant to changing those incentives/rules. And also partly because people naturally mimic those around them, which generally gives culture some momentum.
Responding to Akash’s prompt: I think institutions becoming more competent is usually not a thing which happens, largely because of the “ossification”. Instead, the most common form of improvement is that a new institution comes along and replaces the old, or some competent sub-institution grows to eclipse the parent.
That said, I don’t think it’s impossible to revive an ossified institution, more that it requires moving levers which people in general do not understand very well. Like, if we had the fabled Textbook From The Future on institutions and management, it would probably be pretty doable.
Connecting models of institutions to AI organizations
John: Alright, tying institution models back to OpenAI/Deepmind/etc and general time-buying interventions: overhauling an org’s culture and making it much more competent is very rare for orgs past ~10-100 people, despite lots of economic incentive to do that for many orgs in the world. That said, it’s probably possible in principle, and e.g. a ~100 person org is small enough that we’re not talking about a full worst-case scenario. Insofar as the organization is structured as a lot of mostly-independent small teams, that also makes the problem easier (and favors talking more to lower-level employees).
Thomas: I have the intuition that even OpenAI being bigger wouldn’t rule out the vision of operational adequacy that I have in mind. My (uninformed) opinion is that totally overhauling culture isn’t necessary to cause people to make reasonable operational adequacy decisions.
Everyone could be optimizing for legible achievements, but if there is still common knowledge that leaking weights or training too big of a model kills everyone, then there is an aligned incentive for each individual to make the relevant x-risk reducing decisions.
(very low resilience)
John: Here’s a central example of the kind of failure mode which I think is hard to avoid: some team in charge of evals releases a suite of evals. Now there’s a bunch of incentive/selection pressure for everyone else to build products which game those evals. Even if each individual employee is vaguely aware that x-risk is a concern, they’re going to get a bunch of rewards of various kinds (publicity, maybe pay/promotion) if their product passes the evals and can be released to more people.
Thomas: Do you think it is possible to create a culture that is resilient to [the central example of failure above]?
John: If I were trying to avoid that kind of problem… I mean, the classic solution is “keep the org under 10 people”, and then everyone can have enough context to notice when the evals are being gamed. Once that level of context is infeasible, there’s various patches we can apply—e.g. design robust evals, keep them secret, provide some kind of reward for not releasing things which fail evals, actively provide rewards for new eval failures, etc. But ultimately all of those are pretty leaky.
Thomas: It seems like this problem gets even worse when you are trying to use evals to coordinate multiple labs. Then there is no hope of people having context to know when things get gamed.
Capabilities evals seem more robust to this (e.g. is this model capable enough to become deceptively aligned?), but there is the same pressure against them.
John: You could pop up a level and say “well, if everyone knows eval-gaming is an issue, at least lots of people will be on the lookout for it and that will at least directionally help”. And yeah, it’s not a solution, but it will directionally help.
On the other hand, something like e.g. not publishing new capabilities advances doesn’t run into this issue much. It’s pretty obvious when someone has publicized a new capabilities advance; that doesn’t take lots of extra context.
So there are specific objectives (like not publishing capabilities advances) which are potentially more tractable. But if you want to turn OpenAI or some other lab into a generally-competent org which will competently pursue alignment, then we probably need to handle things like eval-gaming.
The Elon Musk Problem
John:
For the audience: one of the first “successes” of convincing a high-profile person of the importance of AI X-risk was Elon Musk. Who promptly decided that obviously the problem was insufficiently broad access to AGI, and therefore the solution was to fund a cutting-edge AI research lab whose main schtick would be to open-source their models. Thus “OpenAI”.
There’s lots of debate about OpenAI’s impact to date, but it’s at least not an uncommon view that Musk’s founding of OpenAI has plausibly done more to reduce humanity’s chances of survival, than the combined positive impact of all other efforts by the alignment community to “spread the word”.
Point is: convincing people that AI X-risk is a big deal is not obviously a good idea. It is at least plausibly very net harmful, since it often makes people go do stupid net-harmful things about AI. These days, one of the big concerns is that raising attention will result in the US government racing to build AGI in order to “beat China” to it.
Thomas:
I agree that ‘convincing people AI x-risk is a big deal’ is a double edged sword. It is definitely not sufficient.
I am more excited about interventions that focus on AGI labs. They are already barreling towards AGI, and it seems like them slowing down or coordinating with each other could be really useful.
I think that one of the big ways that the world is destroyed is race dynamics, and so each of the major AGI groups collapsing into one project seems really good for the world. It seems that even though this brings the date of ‘humanity is able to build AGI’ closer, it makes the date ‘humanity actually builds AGI’ further away.
We’re grateful to Olivia Jimenez for reviewing the transcript and offering suggestions.