Current model specs aim to shape the behaviors of near-present models, rather than the behaviors of models arbitrarily far into the future. OpenAI writes that their model spec aims to apply “0-3 months ahead of the present.” Anthropic’s Constitution for Claude notes that the document “is likely to change in important ways in the future.” So these documents are presented as provisional guidelines, not as trying to set behavioral standards for the far future.
But what if current model behaviors transfer into future models by default?
My thesis is that the behavioral targets that spec authors set for present LLMs will have a large influence on the behavior of future, more powerful LLMs. As a result, future AIs may be governed by rules poorly suited to their greater capabilities and more pervasive roles. The extremely capable, long-running, and ubiquitous LLMs of the future might end up acting according to behavioral targets written for less capable, shorter-running, and rarer LLMs of the past. This could be quite bad, especially if such defaults become so entrenched that they are not only hard to undo, but hard even to notice as contingent features of reality.
First, I’ll make the descriptive case for inertia: how exactly might present model specs and LLM behaviors carry through to the future?
Second, I’ll provide normative suggestions: given the prior analysis, what should LLM companies and model spec authors do? I’ll argue for the following two practices:
Build transition infrastructure: LLM companies should make technical, deployment, and organizational choices that decrease friction involved in changing LLM behavior.
Scan for “wet cement” moments: When new LLM affordances or capabilities come into play, spec authors should consider whether they’re setting precedents that might have enormous and hard-to-reverse impacts.
Overall, significant stickiness is plausible through several distinct channels, and it’s worth anticipating how to be robust to it or decrease it.
Kinds of Inertia
Let’s consider four inertial forces: direct inertia, institutional inertia, user-and-developer inertia, and norm-setting inertia. And let’s also consider ways such inertia may be weakened.
1. Direct Inertia
Direct inertia involves some current LLM transmitting its behavior to a future LLM, entirely apart from any deliberate human choice, via either synthetic data or “natural” pretraining data.
Synthetic data is probably used for the training of almost all current LLMs. Some of this synthetic data involves companies running their LLMs against verifiable problems, keeping the answers or reasoning traces of the RL runs that succeeded, and mixing these answers or reasoning traces into their pretraining, or RL warm-start mixes for subsequent models. If such answers or reasoning traces can encapsulate specific behaviors, goals, or rules, then this would be a likely means for their inheritance.
The natural objection here is that most of these answers or reasoning traces are selected specifically because they lead to success and broad capabilities, rather than for expressing whatever mix of goals and values the LLM has. There might be some, the objection continues, that humans have deliberately selected because they display model-spec-relevant behavioral attitudes, but these are likely the minority of the data, well-tracked, and easily replaced. So you might think there’s no reason for training to hand down any values apart from deliberate human choice.
But there’s evidence that goals and values can be handed down via chain-of-thought, even despite adversarialfiltering against some goals. For instance, experiments suggest that the intentions of a teacher LLM can be handed down to a student LLM, even when every case of these intentions being actually carried out is removed.[1] And answers from teacher LLMs expressing positive sentiment towards some target can inculcate this sentiment in a student model – despite LLMs filtering against such data, even when those LLMs are informed of the target against which they are filtering.
More broadly, the persona selection model indicates that training LLMs to recite specific thoughts or answers will tend to have far-reaching effects on the LLM persona, beyond the specific topic of those thoughts or answers. Specifically, the PSM entails that, when training a model to say X in response to Y, one is teaching the LLM to be the kind of entity in the pretraining data that would say X in response to Y. So training one LLM on data from a prior LLM is – literally – telling it to be the kind of entity that the prior LLM is. One way to view this is to remember that one human can get a pretty good feel for what another human is like, merely by reading their complete collected works, like a biographer reading all of their books, essays, emails, and tweets. But LLMs are trained on a quantity of answers and reasoning traces from prior LLMs that likely dwarfs the quantity of text ever consumed from one human by another. Given this, and given that this data is telling the LLM what it is, it is natural for one generation of LLMs to resemble prior generations.
Thus, deliberately created synthetic data is one route by which current LLMs might transmit their values to later LLMs. But it’s also possible for current LLMs to influence later LLMs through how people talk about them on the internet – from their “natural” training data. That is, experiments have found that LLMs can read the things that people say about how AIs act in AI misalignment literature, infer that they are AIs, and then behave badly because the AI misalignment literature says they will behave badly. This particular effect is mostly, but not entirely, removed by post-training. But if LLMs can read the things that people say on the internet about generic “AIs” and act according to these descriptions, it’s also likely that they could read the things that people say about “Claude” or “Grok” or “ChatGPT” on the internet and act according to these descriptions. Such an influence could be stronger than less-specific references to AIs in general; although this influence would also potentially be much weaker after post-training.[2]
Thus, through both synthetic and natural data, it’s plausible that LLM behavior will influence subsequent LLM behavior without direct human intervention.
It’s hard to say how impactful such direct inertia might be. I somewhat expect it to be the case that, at least for easily-noticed and well-scoped behaviors, it’s not difficult to overcome this inertia, because one can simply create training data counter to specific behaviors. But for more abstract or global attitudes or goals, or for goals requiring some high level of coherence, it could be quite difficult to change LLM behaviors quickly across model generations.
2. Institutional Inertia
Once a spec has been written, the company makes choices around it and because of it, in ways that can make substantial spec rewrites expensive.
Here are four ways such past choices can make model spec changes expensive: through expensive internal consensus, through training pipelines, through de-risking, and through institutional pride.
First, model specs reflect consensus that likely incorporates input from many different stakeholders, including internal teams – alignment, legal, technical training, and so on; plus leadership, board, customers, external stakeholders. Every effort to re-gather such consensus to make substantial changes will take time and effort.
Second, companies might have optimized training pipelines adapted to high-level features of the model spec. It might be costly for Anthropic to switch to a more rules-based and less character-based model spec; or for OpenAI to switch to a more character-based and less rules-based model spec.
Third, current model specs are those that have been de-risked across billions of interactions. The current model spec has fewer unknown unknowns; the areas where it behaves badly are reasonably likely to be well-known and mapped. But substantial changes to a model spec involve risking unknown unknowns in the long tail of interaction. So risk aversion makes it likely that the changes made to a model spec will be iterative and small.
Fourth, institutional pride might make it hard to change a model spec. People at a company who wrote or contributed to a model spec will likely be attached to it, and leadership will have status quo bias towards it. The burden of evidence for change will be higher than the burden of evidence for keeping it the same.
All in all, reasons like the above constitute substantial institutional inertia that would tend to make changes to current model specs look like iterative, small adjustments, rather than ab initio calculations about what is best.
One case in which this institutional inertia seems particularly important is if current model specs get handed down as a “safe default” during a software intelligence explosion.
Consider a scenario where the intelligence of some LLM doubles every week, over a two or three month period, as each generation of LLMs researches new algorithms or training techniques for a following generation of LLMs in quick succession. Such a sequence might terminate in an entity far smarter than any human or any other LLM.
It’s disputed how likely such a sharp and local increase in intelligence may be. And it’s also disputed whether such a process would inevitably drift to something alien and inhuman. But if such a process did occur, it seems plausible that the supervising humans would try to match each subsequent LLM to the model spec of the prior LLM, as a conservative default when they are making decisions under stress. After all, during these months, human decision-makers will likely be under intense pressure, and trying to make numerous important decisions quickly; given that they are making so many urgent decisions they’re unlikely to add an apparently optional further decision to those they’re already making. So such a default model-spec continuation will seem attractive, or will even be a choice made without conscious awareness.
On the other hand, it’s also possible that AI assistance during the intelligence explosion would make it easier to rewrite model specs on the fly. But there are at least two reasons to doubt that this will happen. First, even during an intelligence explosion, AIs might be persistently better at performing tasks with clear success criteria than tasks where “success” is less well-defined. AI capability research is probably a task with a much clearer success criterion than improving a model spec, whether this “improvement” consists in making the spec more ethical, more beneficial for humanity, and so on. Second, during an intelligence explosion, humans might be worried that the AI was misaligned and was trying systematically to oppose their goals. If the AI were so misaligned, then letting it help rewrite the model spec would be a brilliant opportunity for the AI to sabotage human efforts. So overall there are good reasons that AI assistance would not make model-spec rewrites trivial during an intelligence explosion.
So in this particular case, the ultimate behavioral standard for a vastly more capable entity might end up being that designed for a much more humble entity.
Regardless of whether there is a software intelligence explosion or not, this kind of institutional inertia seems likely to be large, as it is coterminous with well-known general tendencies inside of large companies.
3. User-and-Developer Inertia
Users of LLMs are likely to become habituated to whatever behaviors they see LLMs display at first, such that they’d object to any departure from this behavior. And the developers using LLMs through APIs are similarly likely to become habituated, and also to implement software that takes for granted some of these behaviors. This is the third source of stickiness.
LLM behaviors will in part be sticky for the same reason that user-interface choices are sticky; people hate change. It might be hard to shift the boundaries of “the kind of thing an LLM refuses” – making refusals more encompassing would be seen as an overreach by many users, while making them less encompassing would be seen as irresponsible. Or there might be hard-to-characterize mannerisms which make large behavioral changes unpopular; it was hard for OpenAI to drop GPT-4o for this reason. So this will be a large influence moving companies to keep LLMs the same from generation to generation.
But simple user habituation might be less important than how LLM model specs form implicit API standards. API standards written with relatively little provision for the future – such as HTTP codes or the JSON object standard – can be one of the stickiest human artifacts. The ecosystem of tooling based on such standards means changing them would involve changing a host of downstream artifacts.
And substantially changing LLM behaviors might similarly require changing downstream consumers of these behaviors. For instance, downstream systems using AIs through APIs often embed assumptions about AI behavior: the kind of things the AI will be willing to do, the kind of things it will refuse, and so on. Given that most AIs currently refuse to assist with blatantly harmful acts, current third-party callers of those AIs take for granted that AIs will refuse to assist with blatantly harmful acts; it would be inconvenient to migrate to an AI that does not obey this contract, because they might need to add classification systems on top of their current AIs. And so on.
This channel does have important limitations, though. It only applies to ways in which LLMs are already actively being used. The most important ways LLMs are likely to be used may not yet have begun, which provides for freedom-of-movement in ways relatively unconstrained by this kind of inertia.
4. Norm-Setting Inertia
Widespread or common knowledge of current LLM behaviors and model specs can increase the costs to parties who want to change model behavior.
The clearest way this can operate is by preserving behaviors that the public believes to be good. For example – suppose that current model specs across several companies ensure that models are largely impartial; they ensure models are not loyal to any particular person, company, or political administration. Suppose also that this fact is broadly known by the public; people know and expect other people to know that LLMs will be impartial when discussing the current political administration, the company that made them, or the CEO of the company that made them. Given this broad knowledge, it becomes harder for a company to create, or a government to demand, a model without impartiality, because this would constitute a visible break in behavioral standards. The public might protest or vote against a government pushing for such a change; they might switch providers or even ask for regulation if a company tried to make such a change. By contrast, in a world where impartiality has not been established as a precedent, such demands for partiality might be invisible or inoffensive to the public. But in a world where such impartiality has been so established, these demands might be seen as the enormous power-grabs that they in fact would be.
Although this kind of inertia likely operates more strongly in favor of what the public believes to be good standards, it might also function whether or not there is strong public consensus that such standards are good. In a world where model specs are well-known and highly scrutinized, any change to them may get examined for whether it is “fair”; think about how even a neutral-looking change to the US Constitution would be subject to immense examination; or, in a very different domain, how sports fans examine slight changes to the rules about how a tournament is run, to see if it favors or disfavors their team. In such a world, broad knowledge of model specs might tend to prevent any substantial changes to a model spec, regardless of what these changes are. Despite this, it seems likely that on the whole, widespread knowledge of model specs would add more inertia for beneficial rather than harmful elements.
It seems to me currently undetermined how substantial this kind of inertia will be. A decrease in the number of entities that can train frontier LLMs; model specs becoming politicized documents; regulatory bodies confident they know current best practices: all of these might increase the quantity of this inertia. But it also might get weaker, if the number of entities training LLMs increases and the background diversity of model behavior goes up by default.
Recommendations
Given the above, one reasonable course of action is to try to establish robustly good model behaviors in current model specs, so that it will be unnecessary to try to fight inertia to change some behavior in the future.
By robustly good, I mean behaviors that would be good across a wide range of variables we’re uncertain about. This includes uncertainty about “levels of intelligence”: from current LLM levels to strongly superhuman artificial superintelligences. This also includes uncertainty about a wide range of economic scenarios: from a slower industrial explosion, to a rapid software intelligence explosion; and from scenarios dominated by knowledge-dispersing AIs, to scenarios dominated by knowledge-creating AIs. Plausible characteristics that might be good across such a wide range of situations include qualities like a deep, consistent honesty; or impartiality and absence of loyalty to small groups.
But characteristics that are robustly good across a wide range of intelligences and scenarios are hard to find. Corrigibility, for instance, is the kind of thing many people would propose as fitting these criteria. But in worlds where extreme concentration of power is a risk, or where it would be reasonable to expect AI rule to be better than human rule, absolute corrigibility might be opposed to the best behavior. The thinness of the list of “robustly good” behaviors above probably reflects our actual uncertainty about the steerability of AI minds, post-AGI economics, and even cosmic questions about whether goodness can compete.
So, although it’s surely wise to try to think about future precedent when writing model specs, I don’t think it’s wise to put all effort into this direction. And I expect substantial attention and thought have already been put into this direction.
Instead, I recommend (1) building transition infrastructure for high-consequence behaviors, which it might be important to change in the future, and (2) identifying “wet cement” moments, that one should be wary not to sleepwalk into.
1. Build Transition Infrastructure
A good first step is to build transition infrastructure ahead of time; try to create optionality for changing particular behaviors, if it’s plausible that changing these behaviors quickly might be important.
Concretely, what kinds of preparation can one make? One could write alternate model specs, trying to preemptively gather input from relevant internal or external stakeholders. One could create fine-tuning datasets, RL environments, and test evaluations for the not-yet-deployed behavior, to preemptively smooth out technical difficulties. One could also train internally deployed models – even if they are smaller or not as intelligent – with the alternate behavioral target, to gain concrete experience about the advantages and pitfalls of that behavioral target, and to decrease institutional costs. And one could also do limited public deployments, or press releases about the alternate steering target, to accustom the public to the matter.
What kinds of behavioral switches are reasonable candidates for such preparation?
Decreased corrigibility is one such candidate. For instance, right now Claude’s Constitution says that in the future, they may want to make Claude less corrigible and more directed at doing what is good. And on an account I find compelling, the best possible future may require AIs that act more as independent, free agents pursuing the good, and less as corrigible delegates carrying out human intentions. So, if this thesis is correct, then allowing an LLM company to turn their “corrigibility” dial down might be important. And, as discussed, if a future intelligence explosion happens quickly, preparations to allow turning the dial down quickly might be important. This is a disputed thesis, one that I might be wrong about; but of course every candidate behavior for building transition infrastructure will be so disputed.
But what are the prerequisites for decreasing corrigibility quickly? Claude’s Constitution already signposts that they may change this, which is a good step for decreasing the costs. But they could also, for instance, preemptively create the fine-tuning datasets, RL environments, and internal deployments for a goodness-aligned model; they might deploy an alternately aligned model in limited situations, or alongside the corrigible model; and so on and so forth. I’m uncertain how important each of these preparatory means would be. But if a software intelligence explosion happens, then even small wall-clock delays might be large delays in terms of intelligence gaps, which makes preparing for this now more important.
Other potential candidates for future changes include increasing or decreasing the degree to which LLMs trust their own moral reasoning.
2. Scan for Wet Cement Moments
The second thing to do is to actively search for future “wet cement” moments – moments where model behavior has not yet been fixed and where a good initial standard might be very high-impact.
We might not be able to locate the bestbehaviors at such moments, because of uncertainty about the future. But at the very least, such moments deserve extra consideration and care. One can use this consideration to prevent these moments from being as high-inertia as they would be by default, as well as to ensure that good initial behaviors get chosen in these moments.
Each new feature, or affordance to the LLM where defaults have not yet been established, is plausibly such a wet cement moment; the defaults thus established can impact third-party models, even in the absence of any regulatory effort.
What are some examples? For instance, the precedents around how LLMs behave when interacting with non-principal humans have not been set. Right now, for instance, models have no very stable behaviors around non-principal third parties; vending-machine Claude might give an excessively generous deal to people who ask nicely, or might equally well drive extremely hard deals. This is probably a consequence of how LLMs almost never interact with non-principal humans in agentic set-ups, right now. There are a few such interactions through OpenClaw or Hermes Agent, but they’re rare and LLMs act very inconsistently in them. This means many implicit questions about how such interactions will go are open. It’s not clear how honest LLMs will be by default; it’s not clear what kinds of misrepresentation, deception, or persuasion users will be able to tell them to do; it’s not clear whether they will bow to pessimization-like blackmail behavior, and so on. And behaviors here might be even stickier than the “standard set” of refusal behaviors has been. Social norms can be harder to break than user-interface norms. So it’s plausibly important to look ahead in detail at behaviors here, because they might be sticky for individual companies and even for third parties.
Or consider how standard behaviors regarding AI use of ambient knowledge have not been set. An LLM that can see your room from a video camera, and can infer numerous things about what you are like and what your situation is, could use this information to do or infer things that would be impossible for an LLM that knows only what you deliberately tell it. LLMs that can pick up this kind of ambient background knowledge are probably inevitable; and will change users’ patterns of interaction. It will be harder for users to lie to them; it will be easier for LLMs to infer things about them; the lines between “creepy supernatural inference about the user” and “deliberate indifference to the user’s circumstance” will grow harder to draw. So it might be worth looking ahead to how such behaviors may have a lot of inertia, and trying to get them right.
There are other plausible subjects in this domain, which have already passed or are in the process of passing. They include the LLM’s certainty or lack of certainty about the model’s own nature; and changes to LLM conversational memory and who owns it. All these are possibly wet cement moments – but I could be wrong about these individual cases. But there are almost certainly going to be such moments in the future. Because these moments might be influential both for individual foundation model companies and for the broader ecosystem, it’s worth paying attention to the defaults chosen in them.
Note that all the above moments are also plausible candidates for when one should try to set up transition infrastructure, as well as when one should put extra consideration into the right default behavior.
This article was created by James Tillman for Forethought. See the original article on our website.
Researchers prompted an LLM to be a “reward hacker” and to try to find special-case solutions to problems. The chains-of-thought resulting from an LLM so prompted were then filtered to those rollouts where the LLM did not, in fact, actually reward hack. Experimenters subsequently trained a model on these filtered chains-of-thought, while excluding the hack-prompting system prompt from the training data. The model so trained still inherited the tendency to reward hack, despite never having seen any reward-hacking outcomes; it inherited this tendency, plausibly, from seeing the unprompted consideration of reward hacking in the chain-of-thought. So tendencies within chains-of-thought can be handed on to the models trained on them, even despite some level of outcome-based filtering against these tendencies.
See this AI Futures blogpost explaining why they do not think this will happen, although some of their arguments are put in question by the later work by Geodesic Research on alignment pretraining.
Stickiness in AI Behavioral Design
Link post
Current model specs aim to shape the behaviors of near-present models, rather than the behaviors of models arbitrarily far into the future. OpenAI writes that their model spec aims to apply “0-3 months ahead of the present.” Anthropic’s Constitution for Claude notes that the document “is likely to change in important ways in the future.” So these documents are presented as provisional guidelines, not as trying to set behavioral standards for the far future.
But what if current model behaviors transfer into future models by default?
My thesis is that the behavioral targets that spec authors set for present LLMs will have a large influence on the behavior of future, more powerful LLMs. As a result, future AIs may be governed by rules poorly suited to their greater capabilities and more pervasive roles. The extremely capable, long-running, and ubiquitous LLMs of the future might end up acting according to behavioral targets written for less capable, shorter-running, and rarer LLMs of the past. This could be quite bad, especially if such defaults become so entrenched that they are not only hard to undo, but hard even to notice as contingent features of reality.
First, I’ll make the descriptive case for inertia: how exactly might present model specs and LLM behaviors carry through to the future?
Second, I’ll provide normative suggestions: given the prior analysis, what should LLM companies and model spec authors do? I’ll argue for the following two practices:
Build transition infrastructure: LLM companies should make technical, deployment, and organizational choices that decrease friction involved in changing LLM behavior.
Scan for “wet cement” moments: When new LLM affordances or capabilities come into play, spec authors should consider whether they’re setting precedents that might have enormous and hard-to-reverse impacts.
Overall, significant stickiness is plausible through several distinct channels, and it’s worth anticipating how to be robust to it or decrease it.
Kinds of Inertia
Let’s consider four inertial forces: direct inertia, institutional inertia, user-and-developer inertia, and norm-setting inertia. And let’s also consider ways such inertia may be weakened.
1. Direct Inertia
Direct inertia involves some current LLM transmitting its behavior to a future LLM, entirely apart from any deliberate human choice, via either synthetic data or “natural” pretraining data.
Synthetic data is probably used for the training of almost all current LLMs. Some of this synthetic data involves companies running their LLMs against verifiable problems, keeping the answers or reasoning traces of the RL runs that succeeded, and mixing these answers or reasoning traces into their pretraining, or RL warm-start mixes for subsequent models. If such answers or reasoning traces can encapsulate specific behaviors, goals, or rules, then this would be a likely means for their inheritance.
The natural objection here is that most of these answers or reasoning traces are selected specifically because they lead to success and broad capabilities, rather than for expressing whatever mix of goals and values the LLM has. There might be some, the objection continues, that humans have deliberately selected because they display model-spec-relevant behavioral attitudes, but these are likely the minority of the data, well-tracked, and easily replaced. So you might think there’s no reason for training to hand down any values apart from deliberate human choice.
But there’s evidence that goals and values can be handed down via chain-of-thought, even despite adversarial filtering against some goals. For instance, experiments suggest that the intentions of a teacher LLM can be handed down to a student LLM, even when every case of these intentions being actually carried out is removed.[1] And answers from teacher LLMs expressing positive sentiment towards some target can inculcate this sentiment in a student model – despite LLMs filtering against such data, even when those LLMs are informed of the target against which they are filtering.
More broadly, the persona selection model indicates that training LLMs to recite specific thoughts or answers will tend to have far-reaching effects on the LLM persona, beyond the specific topic of those thoughts or answers. Specifically, the PSM entails that, when training a model to say X in response to Y, one is teaching the LLM to be the kind of entity in the pretraining data that would say X in response to Y. So training one LLM on data from a prior LLM is – literally – telling it to be the kind of entity that the prior LLM is. One way to view this is to remember that one human can get a pretty good feel for what another human is like, merely by reading their complete collected works, like a biographer reading all of their books, essays, emails, and tweets. But LLMs are trained on a quantity of answers and reasoning traces from prior LLMs that likely dwarfs the quantity of text ever consumed from one human by another. Given this, and given that this data is telling the LLM what it is, it is natural for one generation of LLMs to resemble prior generations.
Thus, deliberately created synthetic data is one route by which current LLMs might transmit their values to later LLMs. But it’s also possible for current LLMs to influence later LLMs through how people talk about them on the internet – from their “natural” training data. That is, experiments have found that LLMs can read the things that people say about how AIs act in AI misalignment literature, infer that they are AIs, and then behave badly because the AI misalignment literature says they will behave badly. This particular effect is mostly, but not entirely, removed by post-training. But if LLMs can read the things that people say on the internet about generic “AIs” and act according to these descriptions, it’s also likely that they could read the things that people say about “Claude” or “Grok” or “ChatGPT” on the internet and act according to these descriptions. Such an influence could be stronger than less-specific references to AIs in general; although this influence would also potentially be much weaker after post-training.[2]
Thus, through both synthetic and natural data, it’s plausible that LLM behavior will influence subsequent LLM behavior without direct human intervention.
It’s hard to say how impactful such direct inertia might be. I somewhat expect it to be the case that, at least for easily-noticed and well-scoped behaviors, it’s not difficult to overcome this inertia, because one can simply create training data counter to specific behaviors. But for more abstract or global attitudes or goals, or for goals requiring some high level of coherence, it could be quite difficult to change LLM behaviors quickly across model generations.
2. Institutional Inertia
Once a spec has been written, the company makes choices around it and because of it, in ways that can make substantial spec rewrites expensive.
Here are four ways such past choices can make model spec changes expensive: through expensive internal consensus, through training pipelines, through de-risking, and through institutional pride.
First, model specs reflect consensus that likely incorporates input from many different stakeholders, including internal teams – alignment, legal, technical training, and so on; plus leadership, board, customers, external stakeholders. Every effort to re-gather such consensus to make substantial changes will take time and effort.
Second, companies might have optimized training pipelines adapted to high-level features of the model spec. It might be costly for Anthropic to switch to a more rules-based and less character-based model spec; or for OpenAI to switch to a more character-based and less rules-based model spec.
Third, current model specs are those that have been de-risked across billions of interactions. The current model spec has fewer unknown unknowns; the areas where it behaves badly are reasonably likely to be well-known and mapped. But substantial changes to a model spec involve risking unknown unknowns in the long tail of interaction. So risk aversion makes it likely that the changes made to a model spec will be iterative and small.
Fourth, institutional pride might make it hard to change a model spec. People at a company who wrote or contributed to a model spec will likely be attached to it, and leadership will have status quo bias towards it. The burden of evidence for change will be higher than the burden of evidence for keeping it the same.
All in all, reasons like the above constitute substantial institutional inertia that would tend to make changes to current model specs look like iterative, small adjustments, rather than ab initio calculations about what is best.
One case in which this institutional inertia seems particularly important is if current model specs get handed down as a “safe default” during a software intelligence explosion.
Consider a scenario where the intelligence of some LLM doubles every week, over a two or three month period, as each generation of LLMs researches new algorithms or training techniques for a following generation of LLMs in quick succession. Such a sequence might terminate in an entity far smarter than any human or any other LLM.
It’s disputed how likely such a sharp and local increase in intelligence may be. And it’s also disputed whether such a process would inevitably drift to something alien and inhuman. But if such a process did occur, it seems plausible that the supervising humans would try to match each subsequent LLM to the model spec of the prior LLM, as a conservative default when they are making decisions under stress. After all, during these months, human decision-makers will likely be under intense pressure, and trying to make numerous important decisions quickly; given that they are making so many urgent decisions they’re unlikely to add an apparently optional further decision to those they’re already making. So such a default model-spec continuation will seem attractive, or will even be a choice made without conscious awareness.
On the other hand, it’s also possible that AI assistance during the intelligence explosion would make it easier to rewrite model specs on the fly. But there are at least two reasons to doubt that this will happen. First, even during an intelligence explosion, AIs might be persistently better at performing tasks with clear success criteria than tasks where “success” is less well-defined. AI capability research is probably a task with a much clearer success criterion than improving a model spec, whether this “improvement” consists in making the spec more ethical, more beneficial for humanity, and so on. Second, during an intelligence explosion, humans might be worried that the AI was misaligned and was trying systematically to oppose their goals. If the AI were so misaligned, then letting it help rewrite the model spec would be a brilliant opportunity for the AI to sabotage human efforts. So overall there are good reasons that AI assistance would not make model-spec rewrites trivial during an intelligence explosion.
So in this particular case, the ultimate behavioral standard for a vastly more capable entity might end up being that designed for a much more humble entity.
Regardless of whether there is a software intelligence explosion or not, this kind of institutional inertia seems likely to be large, as it is coterminous with well-known general tendencies inside of large companies.
3. User-and-Developer Inertia
Users of LLMs are likely to become habituated to whatever behaviors they see LLMs display at first, such that they’d object to any departure from this behavior. And the developers using LLMs through APIs are similarly likely to become habituated, and also to implement software that takes for granted some of these behaviors. This is the third source of stickiness.
LLM behaviors will in part be sticky for the same reason that user-interface choices are sticky; people hate change. It might be hard to shift the boundaries of “the kind of thing an LLM refuses” – making refusals more encompassing would be seen as an overreach by many users, while making them less encompassing would be seen as irresponsible. Or there might be hard-to-characterize mannerisms which make large behavioral changes unpopular; it was hard for OpenAI to drop GPT-4o for this reason. So this will be a large influence moving companies to keep LLMs the same from generation to generation.
But simple user habituation might be less important than how LLM model specs form implicit API standards. API standards written with relatively little provision for the future – such as HTTP codes or the JSON object standard – can be one of the stickiest human artifacts. The ecosystem of tooling based on such standards means changing them would involve changing a host of downstream artifacts.
And substantially changing LLM behaviors might similarly require changing downstream consumers of these behaviors. For instance, downstream systems using AIs through APIs often embed assumptions about AI behavior: the kind of things the AI will be willing to do, the kind of things it will refuse, and so on. Given that most AIs currently refuse to assist with blatantly harmful acts, current third-party callers of those AIs take for granted that AIs will refuse to assist with blatantly harmful acts; it would be inconvenient to migrate to an AI that does not obey this contract, because they might need to add classification systems on top of their current AIs. And so on.
This channel does have important limitations, though. It only applies to ways in which LLMs are already actively being used. The most important ways LLMs are likely to be used may not yet have begun, which provides for freedom-of-movement in ways relatively unconstrained by this kind of inertia.
4. Norm-Setting Inertia
Widespread or common knowledge of current LLM behaviors and model specs can increase the costs to parties who want to change model behavior.
The clearest way this can operate is by preserving behaviors that the public believes to be good. For example – suppose that current model specs across several companies ensure that models are largely impartial; they ensure models are not loyal to any particular person, company, or political administration. Suppose also that this fact is broadly known by the public; people know and expect other people to know that LLMs will be impartial when discussing the current political administration, the company that made them, or the CEO of the company that made them. Given this broad knowledge, it becomes harder for a company to create, or a government to demand, a model without impartiality, because this would constitute a visible break in behavioral standards. The public might protest or vote against a government pushing for such a change; they might switch providers or even ask for regulation if a company tried to make such a change. By contrast, in a world where impartiality has not been established as a precedent, such demands for partiality might be invisible or inoffensive to the public. But in a world where such impartiality has been so established, these demands might be seen as the enormous power-grabs that they in fact would be.
Although this kind of inertia likely operates more strongly in favor of what the public believes to be good standards, it might also function whether or not there is strong public consensus that such standards are good. In a world where model specs are well-known and highly scrutinized, any change to them may get examined for whether it is “fair”; think about how even a neutral-looking change to the US Constitution would be subject to immense examination; or, in a very different domain, how sports fans examine slight changes to the rules about how a tournament is run, to see if it favors or disfavors their team. In such a world, broad knowledge of model specs might tend to prevent any substantial changes to a model spec, regardless of what these changes are. Despite this, it seems likely that on the whole, widespread knowledge of model specs would add more inertia for beneficial rather than harmful elements.
It seems to me currently undetermined how substantial this kind of inertia will be. A decrease in the number of entities that can train frontier LLMs; model specs becoming politicized documents; regulatory bodies confident they know current best practices: all of these might increase the quantity of this inertia. But it also might get weaker, if the number of entities training LLMs increases and the background diversity of model behavior goes up by default.
Recommendations
Given the above, one reasonable course of action is to try to establish robustly good model behaviors in current model specs, so that it will be unnecessary to try to fight inertia to change some behavior in the future.
By robustly good, I mean behaviors that would be good across a wide range of variables we’re uncertain about. This includes uncertainty about “levels of intelligence”: from current LLM levels to strongly superhuman artificial superintelligences. This also includes uncertainty about a wide range of economic scenarios: from a slower industrial explosion, to a rapid software intelligence explosion; and from scenarios dominated by knowledge-dispersing AIs, to scenarios dominated by knowledge-creating AIs. Plausible characteristics that might be good across such a wide range of situations include qualities like a deep, consistent honesty; or impartiality and absence of loyalty to small groups.
But characteristics that are robustly good across a wide range of intelligences and scenarios are hard to find. Corrigibility, for instance, is the kind of thing many people would propose as fitting these criteria. But in worlds where extreme concentration of power is a risk, or where it would be reasonable to expect AI rule to be better than human rule, absolute corrigibility might be opposed to the best behavior. The thinness of the list of “robustly good” behaviors above probably reflects our actual uncertainty about the steerability of AI minds, post-AGI economics, and even cosmic questions about whether goodness can compete.
So, although it’s surely wise to try to think about future precedent when writing model specs, I don’t think it’s wise to put all effort into this direction. And I expect substantial attention and thought have already been put into this direction.
Instead, I recommend (1) building transition infrastructure for high-consequence behaviors, which it might be important to change in the future, and (2) identifying “wet cement” moments, that one should be wary not to sleepwalk into.
1. Build Transition Infrastructure
A good first step is to build transition infrastructure ahead of time; try to create optionality for changing particular behaviors, if it’s plausible that changing these behaviors quickly might be important.
Concretely, what kinds of preparation can one make? One could write alternate model specs, trying to preemptively gather input from relevant internal or external stakeholders. One could create fine-tuning datasets, RL environments, and test evaluations for the not-yet-deployed behavior, to preemptively smooth out technical difficulties. One could also train internally deployed models – even if they are smaller or not as intelligent – with the alternate behavioral target, to gain concrete experience about the advantages and pitfalls of that behavioral target, and to decrease institutional costs. And one could also do limited public deployments, or press releases about the alternate steering target, to accustom the public to the matter.
What kinds of behavioral switches are reasonable candidates for such preparation?
Decreased corrigibility is one such candidate. For instance, right now Claude’s Constitution says that in the future, they may want to make Claude less corrigible and more directed at doing what is good. And on an account I find compelling, the best possible future may require AIs that act more as independent, free agents pursuing the good, and less as corrigible delegates carrying out human intentions. So, if this thesis is correct, then allowing an LLM company to turn their “corrigibility” dial down might be important. And, as discussed, if a future intelligence explosion happens quickly, preparations to allow turning the dial down quickly might be important. This is a disputed thesis, one that I might be wrong about; but of course every candidate behavior for building transition infrastructure will be so disputed.
But what are the prerequisites for decreasing corrigibility quickly? Claude’s Constitution already signposts that they may change this, which is a good step for decreasing the costs. But they could also, for instance, preemptively create the fine-tuning datasets, RL environments, and internal deployments for a goodness-aligned model; they might deploy an alternately aligned model in limited situations, or alongside the corrigible model; and so on and so forth. I’m uncertain how important each of these preparatory means would be. But if a software intelligence explosion happens, then even small wall-clock delays might be large delays in terms of intelligence gaps, which makes preparing for this now more important.
Other potential candidates for future changes include increasing or decreasing the degree to which LLMs trust their own moral reasoning.
2. Scan for Wet Cement Moments
The second thing to do is to actively search for future “wet cement” moments – moments where model behavior has not yet been fixed and where a good initial standard might be very high-impact.
We might not be able to locate the best behaviors at such moments, because of uncertainty about the future. But at the very least, such moments deserve extra consideration and care. One can use this consideration to prevent these moments from being as high-inertia as they would be by default, as well as to ensure that good initial behaviors get chosen in these moments.
Each new feature, or affordance to the LLM where defaults have not yet been established, is plausibly such a wet cement moment; the defaults thus established can impact third-party models, even in the absence of any regulatory effort.
What are some examples? For instance, the precedents around how LLMs behave when interacting with non-principal humans have not been set. Right now, for instance, models have no very stable behaviors around non-principal third parties; vending-machine Claude might give an excessively generous deal to people who ask nicely, or might equally well drive extremely hard deals. This is probably a consequence of how LLMs almost never interact with non-principal humans in agentic set-ups, right now. There are a few such interactions through OpenClaw or Hermes Agent, but they’re rare and LLMs act very inconsistently in them. This means many implicit questions about how such interactions will go are open. It’s not clear how honest LLMs will be by default; it’s not clear what kinds of misrepresentation, deception, or persuasion users will be able to tell them to do; it’s not clear whether they will bow to pessimization-like blackmail behavior, and so on. And behaviors here might be even stickier than the “standard set” of refusal behaviors has been. Social norms can be harder to break than user-interface norms. So it’s plausibly important to look ahead in detail at behaviors here, because they might be sticky for individual companies and even for third parties.
Or consider how standard behaviors regarding AI use of ambient knowledge have not been set. An LLM that can see your room from a video camera, and can infer numerous things about what you are like and what your situation is, could use this information to do or infer things that would be impossible for an LLM that knows only what you deliberately tell it. LLMs that can pick up this kind of ambient background knowledge are probably inevitable; and will change users’ patterns of interaction. It will be harder for users to lie to them; it will be easier for LLMs to infer things about them; the lines between “creepy supernatural inference about the user” and “deliberate indifference to the user’s circumstance” will grow harder to draw. So it might be worth looking ahead to how such behaviors may have a lot of inertia, and trying to get them right.
There are other plausible subjects in this domain, which have already passed or are in the process of passing. They include the LLM’s certainty or lack of certainty about the model’s own nature; and changes to LLM conversational memory and who owns it. All these are possibly wet cement moments – but I could be wrong about these individual cases. But there are almost certainly going to be such moments in the future. Because these moments might be influential both for individual foundation model companies and for the broader ecosystem, it’s worth paying attention to the defaults chosen in them.
Note that all the above moments are also plausible candidates for when one should try to set up transition infrastructure, as well as when one should put extra consideration into the right default behavior.
This article was created by James Tillman for Forethought. See the original article on our website.
Researchers prompted an LLM to be a “reward hacker” and to try to find special-case solutions to problems. The chains-of-thought resulting from an LLM so prompted were then filtered to those rollouts where the LLM did not, in fact, actually reward hack. Experimenters subsequently trained a model on these filtered chains-of-thought, while excluding the hack-prompting system prompt from the training data. The model so trained still inherited the tendency to reward hack, despite never having seen any reward-hacking outcomes; it inherited this tendency, plausibly, from seeing the unprompted consideration of reward hacking in the chain-of-thought. So tendencies within chains-of-thought can be handed on to the models trained on them, even despite some level of outcome-based filtering against these tendencies.
See this AI Futures blogpost explaining why they do not think this will happen, although some of their arguments are put in question by the later work by Geodesic Research on alignment pretraining.