TL;DR

This post basically digs through Holden Karnofsky’s ‘AI Could Defeat All of Us Combined’ (AIDC) post and tries to work through some of the claims there, in light of other work by Ajeya Cotra and Richard Ngo. It originally was a response to the latter’s suggestion that someone should recast that post as a scenario, but I struggled to do that without fleshing out the argument for myself. Anyway, I try to imagine the world around 2036, when the first human-level AI shows up (per AIDC), in context of current/forecast geopolitics.

I think my (weakly-held) takeaway is that the AIDC notion of ‘hundreds of millions’ of AIs, nominally engaged in scientific R&D that then turn on their employer, seems to not include the likely geopolitical context and the likely messiness that makes any prediction really difficult. Specifically, in the 14 years to 2036, there would be vicious competition (which we are seeing right now, namely in the recent US semiconductor sanctions on China) over resources (raw materials and territory but also including AI-related IP and compute) between countries who might view a unipolar AI world as a near-existential threat. This strategic aspect of AI might mean research becomes highly militarised, thereby introducing specific biases into training regimes, as well as creating an obvious way for the ‘AI headquarters’ to be established. This background of latent or kinetic conflict may ultimately dictate what the pre-2036 environment (i.e. the transition from the time of sub-human level AI, to human-level AI that are capable of being deployed en masse) looks like, and therefore what the (assumed) hardware overhang is actually used for (commercially-valuable scientific R&D might not be as immediately and strategically useful as fleets of satellite attack drones, moon mining rigs, or ASI researchers). Moreover, I’m not sure that hundreds of millions of radically new workers that show up in ~1 year are either a) a resource management problem humans presently know how to deal with (or are likely to in 14 years), or b) are necessarily that broadly useful outside of certain sectors that have the right features to make use of large numbers of intelligent knowledge-workers. Moreover, I don’t think such large numbers of AI copies are particularly critical to AIDC’s conclusions (which I think are plausible) of an AI takeover.

This is very much a non-specialist view, and depending on feedback, I will try to recast this as a short scenario in coming weeks.

Introduction

This post analyses the argument in Holden Karnofsky’s 2022 post ‘AI Could Defeat All of Us Combined’ (AIDC). My summary of AIDC is: ’Human level AI (HLAI) is developed by 2036, is useful across a broad range of jobs, and is rolled out quickly. There is sufficient overhang of servers used to train AI at that time that a very large number of HLAIs, up to hundreds of millions running for a year, can be deployed. In a short period, the HLAIs are able to concentrate their resources in physical sites safe from human interference. Humans are unable to oversee these copies, which coordinate and conspire, leading to an existentially risky outcome for humanity.

AIDC is somewhat dramatic, i.e. the relatively short timeline (2036), vast numbers of HLAIs (up to hundreds of millions), establishment of an ‘AI headquarters’, and the ‘total defeat of humanity’. I wanted to connect some of the post’s claims with (my understanding of) current alignment thinking^[1] as well as apparent trajectories in contemporary geopolitics.^[2]

I decompose AIDC’s top-level claims into the following sub-questions:

Human-level AI (HLAI)
1. Why are we talking about ‘human-level’ AIs?
2. What are the types of tasks such HLAIs are likely to be useful for?
3. Would HLAIs need actuators in the world or can they be software?
What are the dynamics of deploying the HLAIs?
1. Does the current geopolitical or economic context tell us anything relevant?
2. The role of human collaborators
3. How plausible is the idea of an AI headquarters?
4. How might the AI-economy feedback loop work?
How the HLAIs might turn
1. How do they achieve situational awareness?
2. What can we say about propensity to coordinate for a large group of HLAIs?

What Might a HLAI Look Like?

AIDC focuses on AIs with human-level capabilities, which makes sense in that it is the most fleshed out in terms of required computation: Ajeya Cotra’s ‘Biological Anchors’ framework^[3] which grounds AIDC’s assumption that a human-level AI would cost roughly 10^30 FLOP (to train) and 10^14 FLOP/s to run (once trained).

One fairly concrete description of how a general-purpose, pre-2050s AI might look is the ‘Human Feedback on Diverse Tasks’ (HFDT) post from Ajeya Cotra^[4], and the related paper from Richard Ngo which does something similar from a more granular, deep learning-informed perspective. In this post, I assume that an AI as of 2036 will basically follow the forms Cotra and Ngo describe, and while that is likely to be wrong, it is the most concrete and grounded-in-current-SOTA approach I have seen.

Returning to HLAI—what is actually meant by ‘human-level’ in an operational, task-orientated sense, rather than in a computational one? This might seem like a pedantic question, that has been extensively researched, but the answer matters for things like task selection (below) and coordination. It isn’t defined in AIDC, but one (narrow and unsatisfactory) possibility is an intelligence that can learn from others’ experience through use of language and, as HFDT emphasises, can be highly creative^[5]. I would add a few more aspects of human intelligence:

making plans;
predicting (to a low level of recursion) what other humans or collections of humans might do;
incorporating the knowledge accumulated in culture (which is not just linguistic, but includes know-how);
being able to ‘go meta’ (see a problem from the outside, or one level higher in abstraction).^[6]

HFDT basically is what it sounds like, here is Cotra’s summary: ‘Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.’ In addition, she assumes the developers of the system take some relatively obvious (‘naive safety effort’) steps to ensure the trained model is safe, as long as those don’t materially reduce time-to-market or raise costs.^[7]

I would build on the list above to arrive at the following features specific to HLAI (these are my extrapolations from AIDC’s succinct functional description of the term, as well as inferences from HFDT):

Planning They plan over weeks, months, and years, rather than over centuries or millennia
Situational Awareness They possess models of their immediate environment and the world-at-large (economic, political, strategic)
Domain Capabilities They have domain-specific knowledge to do whatever task they are deployed to do, probably better than most humans, but this will depend on the task and how much specific ‘touchy-feely’ (empathy, fleshy/physical presence, cunning, humour, etc.) human-like capabilities are required.
Human Modelling They possess models of their trainer(s) that are perhaps not as good, and likely are arrived at through different means, as humans’ models of other humans. But HLAI’s psychological models are more informed by the theoretical and experimental literature, human-written fiction, as well as massive quantities of socially-generated data, and therefore are increasingly more accurate in an population-wide empirical or statistical sense.
Software Dominance They have an excellent (i.e. better than humans) understanding of the systems they use and are trained/evaluated by (whether AI or more prosaic software like Slack, Discord, Excel, Adobe Creative Suite, or their equivalents in 2036)
Peer Modelling They have excellent (par or better than humans) models of their peer HLAIs, since they are capable of introspection and understand the context in which they were created
Symbol Grounding AIDC, HFDT, and Ngo are silent, or perhaps agnostic, on whether learning systems need to ground their representations in physical reality. Ngo gives a brief justification for this view in this response to a Melanie Mitchell paper on the topic. ^[8]

Lastly, I would tentatively suggest that HLAI is still a vaguely unsatisfactory framing, at least as used in AIDC or HFDT, in that it seems to assume AIs working more or less as independent units or perhaps in teams (like humans), rather than in a dramatically more collective and coordinated body.^[9] It also feels semantically and functionally ill-specified, as it doesn’t address the breadth of human intelligence, and, more importantly, the transition between human-level and superintelligence. I’ll revisit both topics below.

Tasks and Motivations

AIDC proposes that HLAI could be deployed in more or less large numbers; what sort of tasks could it do? AIDC suggests this might include R&D, or trading financial markets, while HFDT expands as: ‘everything from looking things up and asking questions on the internet, to sending and receiving emails or Slack messages, to using software like CAD and Mathematica and MATLAB, to taking notes in Google docs, to having calls with collaborators, to writing code in various programming languages, to training additional machine learning models, and so on.’

Converting IP to value

Stepping back slightly, are these (desk or online-based tasks) the only things ‘several hundred million’ highly-capable humans need to do? What else would be needed to convert the results of R&D (product designs, plans for other AIs, new medicines, etc.) into economic value?

I would argue that products need to be developed, which involves (for non-software items) iterating through successively refined physical designs, manufacturing using components that may or may not be readily available, (clinical) trialling with customers, securing regulatory approval, and then marketing/sales. Some of these tasks can be automated, but others involve a high degree of human interaction^[10]. These things take time^[11], whereas I’m assuming, for reasons discussed below, the monopoly period of the first HLAI is relatively short. Further (regulatory) friction obtains from the likelihood that this process of iteration would be happening while competitors lobby vigorously to hobble the HLAI-powered company.

In short, HLAI in certain scenarios (where ideas and plans need to be materialised into actual marketable product) seems to a) require quite a lot of knowledge about the physical world, something not particularly addressed in AIDC, but assumed as part of Cotra’s scenario, and b) some way of iterating quickly through design-production cycles. For this reason it is not obvious to me that hundreds of millions of HLAI can be usefully or quickly deployed in a way that results in actual products or marketable inventions.

However, if an entity (company or government) has managed to deploy HLAI credibly, and produce, say over 1 year (subject to R&D time-cycles), a portfolio of valuable IP, then that company could presumably raise capital as equity or debt, essentially ‘bridging’ to the time when IP can become product.

Is software enough?

The comments above notwithstanding, software-based AI seems like the fastest, most direct way to get from a ‘pre-HLAI’ world (where such AIs as exist are not as general or as capable as humans) to a world with HLAIs deployed at scale. For instance, it is possible that other industries or bureaucratic functions become, in the 14 years to 2036, quite used to (in the sense of operational or business processes) high levels of automation: surveillance and population control; exploration and mining in hazardous environments; battlefield applications. It may be the case that, in the years just before 2036, one or more parties (who are not ‘traditional’ AI companies e.g. Meta, Google/DeepMind, OpenAI) enter the field, have very different perspectives and ambitions, and don’t necessarily have the same risk tolerance, cultural hangups, or regulatory environment as the incumbents.^[12]

In some of these cases, the ‘software’ AI would be heavily integrated into physical, real-world processes, lending support to AIDC’s contention that HLAI need not be embodied to be dangerous in the world.^[13]

Two Interacting Races

I view the AIDC argument through a frame of two interacting race dynamics: a) one between now and 2036, that is both a geopolitical and a technological one, primarily played out in a human-dominated world with some AI but no HLAI, and b) a post-2036 one involving some mixed human and (perhaps large i.e. ~10^6-10^7) HLAI population.

The geopolitical context

AIDC doesn’t describe what the world leading up to 2036 looks like, but the related framing by Ajeya Cotra suggests a scene similar to today’s in respect of AI. At a geopolitical level, the next decade, in my view, is likely to see a significant realignment, as the United States attempts to preserve its position atop the strategic landscape, while China becomes more assertive and challenges the status quo. These moves are mirrored by a realignment in financial markets, with multiple reserve currencies trading in parallel, (nominal and real) interest rates rising significantly, elevated commodity and land prices, autarkic supply chains, and significant spending on the climate transition (that soaks up global savings).

These economic changes happen against a background of continued regional tensions which manifest as both kinetic wars, asymmetric conflicts and cyber/informational warfare. Climate change may also marginally increase repressive tendencies in the West, as countries seek to clamp down on climate-driven immigration in the face of nativist politics.^[14]

The upshot for AI is that the US maintains a lead in semiconductors design, combined with an improved capacity to manufacture, while attempting to delay or suppress Chinese progress in semiconductors (through export controls).^[15] China continues to be a peer competitor, perhaps closing the gap in patents and citations, the EU remains influential as a market and regulator, while India, and Russia remain, from an AI perspective, marginal players (though they may join one bloc or another).

On the cost of training HLAI

In the years immediately before 2036, we might understand better how much (perhaps multiple) training runs of the first HLAI might cost. Currently, the 10^30 of compute suggested in AIDC^[16] would cost about $10TN, compared to world net wealth of $510TN and world GDP of $100TN (2021 figures and based on a $1 per 10^17 FLOP^[17] price as of 2021).^[18] Cotra, in Biological Anchors, also uses a different approach—she assumes that the maximum willingness to spend upon a single training run would be ~$1BN in 2025,^[19] going up to ~$100BN in 2040. For reference, this is about 1-3x the net profit of (each of) Saudi Aramco, Apple, Alphabet, Microsoft, ICBC, CCB, Agricultural Bank of China, Samsung, Bank of China, and Amazon. Another rough comparison comes from market capitalisation: PetroChina, Apple, Amazon, Alphabet, Meta, Microsoft, Tesla all have caps in the $1.2-3TN range (2022, unadjusted numbers).

My intuition is, given the strategic importance of AI and the uncertainty on how quickly compute costs might fall (as well as on the amount of compute needed to train HLAI), I would place more weight on the possibility that no matter who actually makes the first HLAI, there will be a significant government hand driving and paying for it—therefore defence budgets, space exploration, or proportion of GDP are more useful benchmarks.^[20]

Overhangs

AIDC leans heavily on the idea of a compute overhang, that is, about 10^30 FLOP worth of hardware, broadly within the control of one entity (a company, consortium, or country).

I see three other possible overhangs: one of engineers (who worked on the first HLAI and possibly are somewhat idle though it isn’t clear if the HLAI would need, like other software, constant tweaking and patching); and perhaps also in a training dataset that was generated at great cost. Lastly, there may be a large suite of software tools or weak AIs that were used to train the HLAI (for instance, used in creating realistic simulated environments to aid the training process).

Guns eat butter

How would these (hardware, engineer, data, and software) overhangs be utilised? Taking the most obvious case of hardware, there would be two pressures: firstly, an economic one to deploy the HLAIs to work on commercially valuable tasks, as detailed above. Specifically, economic actors, whether private or partially state-controlled, would face immense pressure to convert the theoretical or intangible benefits of HLAI into tangibly useful strategic resources in the real-world. For instance, HLAIs might be used to staff mines and oil platforms, operate drones that deny territory to adversaries, or run moon-based ore refineries. I would expect a significant push to deploy HLAIs towards military R&D, acquisition of hard resources, a quasi-Schmittian race for territory, and pervasive domestic surveillance.^[21] I would expect these types (resource acquisition and geospatial control) of applications, perhaps more than the AIDC/HFDT vision of scientific R&D, because the latter seem characterised by contingent payoffs, longer payback times, more diffuse benefits, and organisationally/socially complex consequential chains.

ASI and a gradient towards hegemony

Per above, I am assuming a world with at least two great spheres of influence. It is possible that since the first-mover (i.e. who reached HLAI earliest) would be uncertain as to how long they have a monopoly, they would face strong pressure to somehow advance the game to the next step. This crystallisation of advantage could take the form of amassing IP/resources/territory/weapons, as above, or (I speculate) in completing a transition to ASI and establishing a more lasting hegemony (or decisive/major strategic advantage, to use alignment terminology). ^[22]

At present, possible timelines to ASI are contested: see Superintelligence, but also Ngo’s 2020 sequence on the AGI-ASI transition, and Ajeya Cotra’s disagreement with fast takeoff scenarios that draw on the evolutionary interval between chimpanzees and humans^[23], but there is no consensus on how long this would take, with views ranging from a month to a decade.

The Dynamics of Transition

The transition from the years before 2036, which are (in AIDC’s nearcast or Cotra’s assumptions) reasonably similar to 2022 at least in terms of economic growth or the pace of scientific research, could be particularly fraught.

Pre-2036

The first-mover would have a practical monopoly on the technology for a very limited period (say between 6 months-2 years, which I’m drawing from a related essay by Karnofsky), at which point other parties are also deploying versions, which might be very similar (in terms of architecture, training, neural weights) for the economic and alignment reasons Evan Hubinger outlines here.

Given the apparent benefits to the first-mover in HLAI, we can assume both the US-aligned ‘West’^[24] and a Chinese-aligned bloc^[25] would be locked in an active race to develop or steal the technology. As a second-order effect, would the run-up to 2036 change the overall geostrategic balance, for instance, if one side seemed likely to achieve HLAI, would there be a strong incentive for the other to conduct (perhaps plausibly deniable) pre-emptive strikes to degrade capabilities?^[26] In anticipation of this, would it be rational for both parties create physically- and cyber-hardened sites for AI research, similar to the US national laboratories for nuclear research, nuclear command-and-control sites, or the Soviet ‘closed cities’? These secure sites would, in principle, be obvious places where an ‘AI headquarters’ might, in due course, be established, since they would be constructed to have their own power supply, be robust to missile strikes, and (initially) staffed by humans who might be very isolated from the world-at-large, and may have quite particular ideological, nationalistic, or chauvinistic views that gave vastly more weight to national interests over that of humanity.^[27]

Post-2036

Thirdly, how stable would the post-2036 situation be? In AIDC, the basic assumption is of a fleet of ~10^6-10^7 HLAIs running for a year. In any case, I assume that geopolitical considerations mean that the HLAI monopoly would dissolve into one of a few possible (game) states: other powers acquiring the technology; the first-mover acquiring an unassailable lead; perhaps a cooperative scenario akin to the Baruch Plan. These transitions may be accompanied by economic, cyber, or kinetic conflicts, much as the current rise of China seems to be creating conditions for conflict. HLAI takeover considerations, of the sort outlined in AIDC, are certainly possible in this framework, but the course of events might (at least initially) be dominated by a variety of geostrategic factors.

Put another way, a world with HLAI that (potentially) can radically alter economic and military realities, may aggravate existing instabilities substantially, an example of a structural risk (as distinguished from risks arising from misuse or accidents, which are the more ‘normal’ angles through which AI risk is analysed) as described in a 2019 post by Remco Zwestoot and Allan Dafoe.

A disjunctive (in the sense of separate) possibility is that, to the extent that it is known that HLAI is on the way (e.g. a less capable AI is demonstrated or leaked in the years before 2036), there’s a view that efficient market dynamics would mean a) resources, whether scientists or compute, necessary to make HLAI may have been bid up, b) other, fixed resources in the world, whether energy, raw materials, water or land, will also have been bid up in anticipation of an explosion in industrial activity, and c) as Mark Xu argues here, depending on the exact dynamics, the organisation or organisations thought to be leaders in HLAI may soak up a large fraction of accessible global wealth.^[28] I haven’t thought much about this, but it might be an interesting question for an economist.

Second-movers and the alignment tax

There is a view in the alignment discourse that it is particularly important that the first AI be an aligned one, since subsequent AIs would perhaps, for economic reasons, just copy the existing, functional model, rather than designing something new (Evan Hubinger). Moreover, this first aligned AI may act as a design template for subsequent, hopefully-aligned AIs, and more speculatively, perhaps can actively help with aligning later, more powerful AIs.^[29]

But these predictions haven’t been stress-tested against a geopolitical race—while the first-mover (say a US company/government) tries to build the safest AI it can (perhaps even doing more than Cotra’s ‘naive safety’ efforts), a second-mover would not necessarily have the luxury or incentive to go slowly and carefully. It would be strongly pushed to find the attractor in the design space that is closest (‘closest’ measured in time, and relative to its design/computational constraints) while paying little or no heed to all the extra work that is needed to find a both-capable-and-aligned design.

Put another way, HLAI, AGI, ASI, etc. as general-purpose dual-use technologies^[30] might present an almost-unacceptable risk to anyone who doesn’t possess them, particularly if that anyone is a peer competitor such as the US or China. Hence, there are at least two points of danger: the creation of the first (probably naively aligned) HLAI, and then possibly, a second AI constructed at speed, under intense strategic pressure, and with incomplete information,^[31] which might increase the chances that it is less conscientiously aligned.

The Organisational Problem

AIDC presents a view that a large number of HLAIs could be deployed, presumably fairly quickly. I would argue it isn’t that easy to deploy labour at scale—imagine if a manager in an existing business (say, a commodity trading firm) were presented with 100, 1000, or a million new staff members. The manager would need to develop job descriptions, as well as an organisational structure and management hierarchy, since he/she couldn’t effectively control a group of this size.

To some extent, this would seem to influence the type of work HLAIs can do. For instance, 1000 robot labourers in a tantalum mine or 1000000 AI-powered drones watching over a city might be relatively easy to manage, but (we know from how modern organisations, such as investment banks, are organised^[32]) 1000 ‘knowledge workers’ need to be carefully arranged to do tasks in the right order, not step on each others’ toes, present a united front to customers and regulators, etc. However, it is quite possible that scientific R&D (the central use case in HFDT/AIDC) is sufficiently modular and parallelisable that it can be conducted without massive management overhead.

A second mitigant is that the HLAI aspirant country/company would have time to prepare—in the time between now and 2036, it is possible that narrower versions of AI will continue to be integrated into industrial, bureaucratic, military and police functions. HLAI might also arrive over a period of years (before 2036), giving time for suitable management structures to evolve.^[33]

A third mitigant could be more powerful HLAIs that supervise the ‘worker’ HLAIs.^[34] AIDC is silent on the management problem and while HFDT does address the issue in multiple places, the context is using supervision to ensure AIs do not become catastrophically misaligned.^[35]

On Situational Awareness

One of the central premises of both the HFDT and Ngo’s pieces is that, as a product of their training, AIs would develop ‘situational awareness’ (or, in Joe Carlsmith’s terminology, ‘strategic awareness’): internal representations of salient facts such as

that they are in a training environment;
that this environment bears some relationship to other deep learning/AI training processes they will have read about in the ML literature;
that deployment generally follows training;
that some researchers are concerned with how AIs generally, and presumably this particular AI will behave in that out-of-distribution world humans call reality;
that there are vigorous conversations on whether an AI can be ‘shut down’ if it misbehaves; and so forth.

This awareness is not an all-or-nothing thing. On one extreme, current models seem to know, in the sense of how they respond to queries, something about their environment. But this understanding is shallow, in that they have a very limited ability to reason or follow cause-effect chains (though certain types of reasoning that are primarily linguistic or symbolic do seem to be possible). Relatedly, current models largely cannot plan courses of action, or simulate the future consequences of their actions in their environment, the way that many animals including humans can do.

Is situational awareness necessary?

Carlsmith points out that there could be classes of problems for which situationally-aware, long-term planning behaviour is unlikely to be necessary, for instance in running a company via a series of narrow-purpose AIs (~Eric Drexler’s CAIS). But, as noted above CAIS may be unlikely from the perspectives of economic competitiveness, time-to-market, and sheer ability to get things done; compared to more capable agents that form plans, that generalise from past training into new environments via layers of abstraction, and then with this integrated view of their environment, go on to make those plans a reality.^[36]

How Things Go Bad

How might a combination of agentic planning and situational awareness result in a HLAI misbehaving? Taking an example task given to a HLAI deployed as a currency trader in a bank. This trading bot is given a high-level goal:

‘Make money today.’

A human trader might decompose this as follows:

‘Check your opening position. Check the markets and scheduled economic releases for anything that has happened overnight, that might materially affect prices today. Call around to other parties in the market to assess whether the flow of orders are going one way or another, whether there are any large option strikes/experies, whether the central bank is rumoured to be active, etc. Speak to the economist to assess their views on the currency pair’s fair value, what might change that, upcoming events, etc. Speak to the FX salespeople to see which of their clients are net buyers (of dollars), and try to predict what the flow in the next day or two is likely to be.’

These subtasks can be further decomposed,^[37] and involve a range of specific activities

pulling factual information about the current trading position;
drawing on internal models of a variety of actors in the market (human, algorithmic trading systems, the relevant central banks, other corporations which might behave with slightly different logics than any specific human);
reading and digesting research, news, and media;
having a conversation with a [human] economist).

These tasks themselves decompose further, continuing in a cascade until they (hopefully) reach some adequately low-level question.

A bag of goals

There are several notable aspects of this setup: firstly the high-level tasks needs to be decomposed into a tree of tasks. Secondly, these subtasks must be mapped to some internal representation within the AI’s policy.

The first issue, how precisely the high-level goal is to be decomposed, is itself a learnt skill, for humans (‘the answer isn’t the hard part, it’s knowing the [sub-]question to ask’) and likely for machines.^[38]

Assuming an appropriately granular decomposition (of the high-level question) has been completed, how are these sub-questions to be dealt with within the network? While current models probably map particular situations to actions through learnt heuristics, Ngo (section 2.1.1) thinks more capable policies (i.e. ones that plan over longer futures) may start retaining representations of outcomes, which can then be planned towards (for instance, by attaching higher values to actions that, over a probabilistically computed future, are likely to lead to the desired outcome).

Assuming models do end up representing outcomes, the question becomes ‘what will these representations look like?’ Ngo suggests that the policies^[39] are likely to learn a grab-bag consisting of specific task-level goals^[40] (that were explicitly rewarded in training), as well as more ‘meta’ goals that either: a) happen to be correlated with high reward in a variety of situations (owing to the peculiarities of the training process), or b) that are correlated with reward because they are particularly useful in many situations, such as curiosity or tool use.

Ngo (2.2.2) contends that as situational awareness increases, policies will tend to favour misaligned goals, particularly those that look like (a), i.e. are related to their training process (such as taking actions that maximise the reward a particular human supervisor would give), which corresponds to HFDT’s ‘playing the training game’.

Let’s take, as an example, the reasonably factual question: ‘Which society has higher inequality, the US or China.’ Let’s also assume, consistent with being a general purpose HLAI, that our bot,^[41] has been trained on a vast cross-section of material, including things specific to economics and sociology, human psychology, physics, medicine, but also about things like offensive and defensive cybersecurity (since it was designed, in my assumption above, in a national security mindset and environment).

True alignment The ‘correct’ aligned behaviour would be to simply access the internet, and return an appropriately rich answer (e.g. ‘The Gini coefficient of the US is 41.5 as of 2019, while that of the PRC is 38.2 as of 2019, from the World Bank. However this measure has well-known limitations: for instance, it measures income rather than wealth disparities, does not adjust for lifespan disparity, does not consistently measure the informal economy, uses a Lorenz curve ….’).
Aligning to implicit supervisory signals For this scenario, imagine the bot is consistently being trained by individuals with relatively nationalistic biases (e.g. strongly want to see their country doing ‘better’), or individuals who have ideological biases (e.g. they think a modestly unequal society is better for economic growth and ‘animal spirits’). To the extent the bot is situationally aware and has, through training, developed a model of its trainer’s psychology, it may shade the answer it returns to the (relatively) factual question above, in order to flatter the trainer’s biases.^[42]
Aligning towards generally instrumentally-useful goals In this case, the bot comes up with the answer in a creative way: since it is trained on cybersecurity, it (assume it is a Chinese bot) may just attempt to initiate a hack of the relevant American agency to get the ‘true answer’. This could be objectively useful information: the American Gini information appears to be calculated by the Census Bureau but presumably a more complete picture (on say, wealth disparities) could be gleaned by accessing brokerage accounts (Vanguard, etc.), the IRS, the various banks of the Federal Reserve System. This would obviously not be a desirable action (absent specific instructions from the system’s political masters), but it would be a clever thing to do.^[43] ^[44]

The turning

Ngo points out a few plausible ways that highly capable AIs may become increasingly misaligned:^[45]

During training, the HLAI was closely supervised by a human trainer, possibly with AI assistance, which is less feasible in deployment
Eventually, capabilities will increase: the trained HLAI will operate too fast, on novel tasks, using rationales that ‘sound good’ but are quite complex, undergo online learning, perform R&D in domains where humans have very little expertise
The HLAI may also be operating on longer time-frames and at a larger scale than the environment in which it was trained, possibly longer than humans tend to work well in^[46]
Put another way, the human overseers (to the extent there are any in the deployment phase) as well as humans interacting with the HLAI, will themselves be operating out-of-distribution, that is, supervising/facing something increasingly alien in potentially unfamiliar situations^[47]
Thus, how the HLAI behaves will depend on how it generalises the goals it learnt during the human-supervised training period (including the various types of subtle misalignment that got ‘baked in’ in a highly path-dependent training history)

The AI literature, including Superintelligence chapter 8, has long proposed a possibility of a sudden change in the AI’s behaviour, towards being adversarial or indifferent to humanity. More recent work has started putting some theoretical and mechanistic flesh on these philosophical bones. Ngo argues, based on these findings, that misaligned goals will generalise into the contexts above to a greater extent than aligned behaviours/goals (obedience, honesty, human wellbeing), and this asymmetry will increase as policies become more capable and operate at larger scales. See section 2.3.2 but one intuitive explanation is that we are trying to impart abstract, fuzzy, and ill-understood human values, that we cannot write down, to an entity that will, by definition, be acting in an environment radically different from that in which we acquired these values. It isn’t obvious that our values would, in any robust sense, even be appropriate to the situation the entity finds itself in.

As an example: ’You are a parent, perhaps a peasant, god-fearing and dutiful, who only wants the best for their child. The child is growing up to be very intelligent, does exemplary schoolwork, and you think they might have a future in Chrysopolis, or perchance, over Okeanos Stream where fortunes are said to be made. The child is so very charming, an object of adoration—and a great many sweets—from all your relatives and the village elders; even the village children, who we all know can be the cruellest brutes, do not tease your progeny.

You don’t always notice that sometimes the child tells little white lies, but if you do, you drag out the holy book and angrily bellow tales of divine punishment. On occasion, you employ a well-placed clout. The lies perhaps do go down with time, and you mostly forget about them. Though curiously other deceptions, not quite lies but definitely not ‘the truth’, seem to crop up again. Anyway, from what your wife says, this mostly happens during the ‘transumanza’: those three months away that you spend taking the animals down the mountain to their winter pasture. You chalk it up to the fact that your wife is rather too lenient.

You do not suspect that your rosy-cheeked child was growing into what a later, and more erudite, age would deem a “full-fledged sociopath”.’

There are a few notable points here:

In the training environment: Within family and village, the child mostly is behaving well (being charming and studious), and is rewarded (loved and given sweets) by others.
Training out bad behaviour: On the few occasions of observed misbehaviour, punishment is meted out, which mostly works in reducing ‘white lies’, though when the shepherd is away, the lying behaviour seems to come back in a more complex and ambiguous way. The contention is, that for an adequately intelligent child, punishing the white lies merely encourages the child to come up with less obvious deceptions.
Sociopathy*: Let’s call this particular combination of characteristics (intelligence, apparent compliance with social norms, ability to render oneself pleasing to others, occasional lying) a type of nascent sociopathy* (the asterisk to signify I’m using in a folk, rather than clinical sense)
The Gervais Principle: Now sociopathy* isn’t particularly a problem in the family or village, at least in the toy example. In a work environment, it is often an advantage, as Scott Alexander points out (albeit tongue-in-cheek) in this review of Venkatesh Rao’s The Gervais Principle (the gist of which, if I understand the post, is that sociopaths* tend to be better managers).
Deployment environment: But it is more complex than that: the chances that a sociopath* will do well depends heavily on the environment they are employed in, the type of job they do, who their manager and competitors are, how they are evaluated, etc.
- For instance, in financial services, a typical trader (particularly junior ones) is not particularly incentivised to lie or manipulate others because it is, in many cases, objectively clear how much money they’ve made (which determines how much they’re paid).^[48] In other jobs, such as investment banking^[49] it isn’t as clear how much value an individual banker has added, and those that can convince senior management that they are ‘worth it’ (and ideally in grave danger of being poached by a competitor) get paid more. Oh, and there are basically only two criteria of value: pay (which is theoretically secret, but everyone knows who got paid what), and promotions. This environment and Goodhartian incentive structure is optimised hard (by management, through regular culls of staff), leading to ubiquitous lying, manipulation, and back-stabbing.
Creativity & usefulness: As an unsubstantiated hunch, I would argue sociopathy* is somewhat correlated – perhaps just through general intelligence – with creativity, which is the bread-and-butter of financial services: finding solutions to get around regulations or taxes, or coming up with new transactions that meet client’s needs.

If most people aren’t sociopaths*, it could be because they’re not able or willing to put in the effort (here’s Alexander’s take on Rao) of maintaining multiple models of reality. It could also be that religious/deontologic beliefs constrain one, or an implicit model (i.e. trust) of other actors in society encourages truthful behaviour. The AI-relevant takeaway from this example is that misaligned behaviour is obviously present in humans, it tends to emerge at higher levels of intelligence, its valence (i.e. whether it is ‘bad’) is highly contingent on deployment environment, and it is very hard to identify and eradicate.

Lastly, these arguments notwithstanding, it isn’t clear that, from HFDT’s analysis, that an AI trained on gradient descent (as all above assumes) would necessarily be motivated to ‘take over’. However, it does seem hard to determine from behaviour alone (i.e. without having transparency into the AIs internal state) whether the AI is truly non-deceptive or if it is merely biding its time.

Coordination

AIDC premises that the AI population would coordinate relatively easily and conspire to ‘overthrow humanity’. Since AIs may be spread across a range of roles, possibly globally, and perhaps have some ‘safe space’ where it is very difficult to shut them down or interrupt their communications, they may present a much more united front than the human defence could.

There are two specific aspects to the coordination problem—firstly, since (by assumption in AIDC), the population of HLAIs are very similar to each other (architectures, training, and neural weights), it seems reasonable to suppose that they have good ‘psychological’ models of each other. This is Evan Hubinger’s assumption in this post, where he suggests that the first generation of highly-capable AIs are likely to be all aligned or all misaligned, and to the extent they are near copies of each other, they should be able to coordinate easily.^[50] In fact, for certain tasks where they need to work as teams, HFDT would train them to have good models of their peers (and to the extent they are working with humans, this will also push towards good human psychological models). Moreover, their training regime should have made them aware of the (well-documented) problems of competitive and non-cooperative human situations (such as economic and kinetic conflicts that reduce growth, failures to prevent massive negative externalities, etc.). If they could make credible commitments to each other, then these wasteful artefacts of human agency might be avoided.^[51] So it would seem they should have both a relatively high ability and an incentive to coordinate.

As this footnote 50 mentions, the view above is somewhat open to question. Looking at it less mechanistically, Kaj Sotala (p. 9/footnote 9) mentions specific design features, such as shared common goals and elimination of self-interest, that would have to be inserted in a fleet of HLAIs to maximise their potential as a unified group.

Another, albeit anthropomorphic, reason for caution is that humans, who have co-evolved for hundreds of thousands of years, don’t seem to have particularly good practical models of each other’s minds and therefore cannot usually commit credibly.

In any case, the most reasonable way to interpret AIDC’s premises is that, at least for a ~1-2 year period, HLAIs are the most capable and widespread game in town, and whatever variation they have (either because of task-based fine-tuning or updated versions) is sufficiently well-understood and modelled by the other HLAIs for them to work safely in teams (on the roles humans have assigned them). So, yes, if they become misaligned, there is a decent chance that they would work together against humans.

AI Headquarters

AIDC envisions the ‘AI Headquarters’ as a place where (potentially rogue) AIs are safe from shutdown, as well as a safe space where plans can be turned into physical reality (such as weapons or nanotech).^[52] Such a base could be a distinct, physically-secure bunker, but could probably also be distributed amongst servers that the AIs have access to, making them hard to identify, locate, or ‘shut down’. For instance, in the case of a commodity trading group, AIs may be employed in various functions amongst the trading, mining, refining or transport operations, making it very hard for humans to monitor whether the AIs are surreptitiously planning something.

Another much more speculative possibility is related to plans from American-allied entities (such as NASA’s Artemis programme), China, and other countries, of moon bases, set up for resource extraction. Such off-world bases, which would presumably be partially automated, may or may not be able to directly threaten humanity on Earth, but they might be more resilient to shutdown, and as they would probably have some manufacturing, refining, or repair facilities, may be a place where AIs can build physical assets that are harder for humans to surveil or destroy.

The third, and perhaps most concrete, possibility is that (as discussed above), conditional on the US or Chinese national security establishment being heavily involved in their respective HLAI efforts, the respective governments themselves will establish structures that might subsequently work as AI Headquarters. Moreover, during much of this time, humans involved in AI research, or their security/military overseers, will approach their work with a security mindset owing to (true) external threats, but at times verging towards paranoia. There might not be a clearly defined line between the following two states:^[53]

Human overseers are using a hardened facility to guard the nascent HLAI technology from competitors, hackers, or spies; and
A population of HLAIs have become internally misaligned, although there may not yet be any external signs, and are using the hardened facility to avoid oversight. When/if there is open conflict with humans, HLAI assets are relatively protected in this facility.

Hence, in addition to the possibilities AIDC outlines for HLAIs to co-opt human collaborators, we could add one more—that, owing to geopolitical pressures rather than any particular act of persuasion or manipulation on the part of HLAIs, humans willingly and knowingly grant the population of HLAIs a safe space and cooperation.

Feedback Loops

AIDC suggests a powerful feedback loop where a large HLAI population generates enormous amounts of economic value, either in money terms or in terms of potential IP (i.e. how to achieve more computation on the same hardware resources), recycles that wealth or knowledge into the existing stock of compute, to generate more or more powerful AIs. I would argue that this aspect of AIDC is ill-specified, and anyway is quite hard to reliably forecast, for a few reasons:

Investing economic or IP surplus into ‘more HLAIs’ doesn’t sound dramatically different (in the sense that ‘we don’t know enough about the rough design constraints of HLAIs or ASIs to functionally distinguish these two’) from investing it into more powerful AI, as discussed previously.
My working assumption is that the geopolitics around 2036 are intensely adversarial bordering on chaos or a great-power war. In this context, sheer access to material resources and military advantage might be what countries prioritise, and I’m not sure (other than sprinting for ASI) how to think about the difference between hundreds of millions of HLAIs and a billion (for instance), unless, prior to this point, the issues flagged above around management, supply chains, and coordination have been addressed.
Even if the population of HLAIs is able to generate vast quantities of IP, only part of this will be immediately usable, as covered above. Much of the rest will need to be monetised in other ways i.e. through equity/debt raises, and then turned into more semiconductors (or powerplants, etc.), which again takes time and might leave humans in the loops, as threats (to AI) or bottlenecks, or likely both.
This of course ignores the possibility (which to be fair might be AIDC’s point) that the HLAIs themselves develop plans that (a) hinge on having more or better versions of their own type or (b) utilise technologies that are within the realm of physical possibility (like nanotechnology) but that we don’t have economically-feasible access to at present. Again the caveats listed above may apply (also see Kaj Sotala’s 2018 paper, section 4.1.1)

Conclusion

AIDC (and the closely related HFDT) depict an argument that, conditional on HLAI being technically feasible by 2036, the resulting hardware overhang would create the potential for many such HLAIs to be deployed in a variety of useful tasks. Moreover, it seems reasonable, from the analysis in HFDT and others, that the HLAIs would be situationally aware (not least because this would be an instrumentally useful feature for them to have), could make medium-term plans and would be capable of coordination (because humans would train them that way). Current thinking around mesa-objectives and goal mis-generalisation do lend credence to the possibility that AIs that appear cooperative to humans may eventually ‘turn’.

AIDC’s rather strong claim that there might be ‘hundreds of millions’ of HLAIs seems to require further support, including thinking about what the design and training of these HLAIs would be, and precisely what tasks they could actually usefully do in the relatively short 6-24 month period that the first deployer is likely to have a monopoly.

AIDC and HFDT seem somewhat agnostic on whether their respective visions are pursued as private-sector, profit-making enterprises or as instruments of state-driven industrial or national security policy. I think the latter is the correct framing, both for US- and China-aligned blocs, at least in the accelerated scenario AIDC proposes (even if today, most AI R&D is private sector in the US). A state-backed framing also shows a clear path to AIDC’s notion of an AI Headquarters, a feature probably helpful for the treacherous turn.

More importantly (and perhaps obviously), I would argue that the geopolitics that obtain in the period before 2036 will be a huge determinant of how AIDC’s scenario plays out. The projection based on the current world looks bleak. In fact, my principal takeaway from AIDC is one of potential chaos, as the first deployer tries desperately to maintain its advantage and crystallise it into tangible material resources and strategic position, while its peer competitors face strong incentives to catch up, or failing that, destroy the first deployer’s advantage. This chaotic period, what Bostrom terms ‘global turbulence’, becomes one of great danger, as it reduces already woeful human coordination abilities and possibly leaves the world exposed to a subsequent AI takeover.

^
To help tease out Karnofsky’s argument, I’m relying on a 2022 post by Ajeya Cotra ‘Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover’ (abbreviated ‘HFDT’) and a 2022 paper by Richard Ngo ‘The alignment problem from a deep learning perspective’.
^
Some of the material on geopolitics is drawn from my scenario submission for the 2022 Worldbuilding Contest run by the Future of Life Institute.
^
And which is very well summarised by Scott Alexander, who also reviews some of the more prominent objections and responses to Cotra. See also this 2021 report by David Roodman which conducts a detailed critique of Cotra’s methodology.
^
Cotra’s HFDT post is more about the training methodology, the ‘psychology’ of the resulting AI, the mechanics of how it becomes catastrophically misaligned and the range of interventions trainers can adopt to mitigate misalignment. It doesn’t spend much time on the quantitative aspects of how such an AI might be trained (which is the topic of Biological Anchors), and it is silent as to whether a HFDT would function at ‘human-level’ or greater.
^
See also these unpublished notes from Ben Garfinkel explaining his objections to the term ‘human-level’, a principal one of which is that intelligence is multifaceted and any collection of humans likely display a distribution of capabilities (language, maths, artistic, etc.) The term also has been inconsistently used, for instance in Superintelligence as footnote 6 in Garfinkel writes.
^
This is my opinion, drawing on experience as an engineer and artist, rather than a well-researched view. Here is an attempt to bring apparently similar views of human creativity into a ML context, and a post I wrote about what AGI-relevant lessons we might learn from artistic creativity; Joscha Bach has commented in various interviews about creativity as a type of wide-ranging meta-search. Oddly enough, LessWrong has relatively little about creativity that directly asks what it is, what its brain-based correlates are, or how machines might be taught creativity.
^
See Paul Christiano’s talk/presentation on the ‘alignment tax’.
^
To the extent multimodal models like Gato and SayCan become increasingly capable at physical tasks, this would be an existence proof that LLMs can underlie a very large class of behaviours in-the-world, and that actual training in physical environments (a la a human infant), is not necessary. At the moment though the question seems open, as this post by Yann LeCun and Jacob Browning argue.
^
One alternative to HLAI is the narrow-purpose AIs, whether of tool/oracle type or Eric Drexler’s CAIS, which, for instance, is constructed as an ecosystem of narrow-purpose ‘service AIs’, who might be superintelligent in their narrow domains. For theoretical and structural reasons (such as transparent communication links between the various services), it is thought that an ecosystem of such narrowly-superintelligent AIs would not, as a whole, present the type of planning, utility maximising, goal-based behaviour that an AGI might present. However, there appears to be a consensus that economic or competitive pressures would incentivise more agentic AGI-type systems over safer-but-restricted CAIS-type systems.
^
An exception would be if the economy as a whole has become automated, as suggested in some of these scenarios from Andrew Critch, but that isn’t quite what AIDC proposes. Critch’s post, more generally, has much to recommend it, particularly his systemic focus: he views human collectives, as in corporations or nation-states, acting in combination (cooperative or competitive) with AIs, and in this post specifically suggests the structural framing of ‘Robust Agent-Agnostic Processes’ as a way to think about such systems.
^
In industries such as energy production and delivery, there seems to be multi-decade intervals between invention and market, while in pharmaceuticals it was around 10-15 years . I haven’t found a useful source for software, or a comprehensive analysis across the economy, but it is quite plausible that idea-to-market times are much lower in certain industries (writing an iPhone app can take weeks or months, while stages in the videogame development process may take months, though software like Unreal Engine or AI-generated worlds might speed the process).
^
This cuts both ways—we can imagine some cash-rich FOMO-prone company involved in an apparently different industry (say a commodity trader like Glencore or a massive financial services group like Blackrock) buying up cheap AGI-related IP, compute, and people after a ‘third AI winter’. The acquirer might not have any extraordinary sensitivity to alignment or existential risk. Similar issues of cultural mismatch are found in corporate history, for instance, the chequered history of European conglomerates buying Wall Street firms and coming unstuck a few years later (e.g. Credit Suisse and Deutsche Bank are two prominent ones).
^
AIDC seems to mean ‘embodied’ in quite an anthropomorphic sense, whereas I see an AI connected to a host of sensors and actuators, whether robotic arms or the valve system of a water supply, as functionally similar. Note, this point is distinct from the symbol grounding issues above, which pertain more to the training process.
^
This report from the US’ Director of National Intelligence gives one view, as do the sources cited in this submission to the FLI Worldbuilding Contest.
^
How successful this ultimately will be remains to be seen, as China produced two exascale-class supercomputers in 2021, to compete with the first ‘public’ (i.e. listed in the Top500 ranking) machine in the US. The Chinese machines were apparently produced with domestically-produced semiconductors.
^
This figure seems somewhat low compared to David Roodman’s (admittedly low-confidence) estimate of 10^34 FLOP in 2020 falling to 10^31 FLOP in 2050.
^
At first I thought I was being thick on the FLOP, FLOPs, FLOPS, FLOP/s distinctions, but apparently I’m not the only one to bugger up stocks vs flows. Please correct any errors in the comments!
^
But compute costs seem to be halving every 2.5-4 years.
^
Against an estimated cost to train GPT-3 of $4.6MM. See p. 4 of Cotra’s report for a fuller discussion of her expectations on willingness to spend on computation to train a HLAI-analogous model.
^
As Cotra alludes to on p. 15 of the Biological Anchors report, Part 1.
^
LessWrong contains relatively little about the geostrategic picture in multipolar scenarios, other than this comment. See also this 2019 publication from the US Air Force and one from Georgetown’s CSET, though both pre-date substantial deterioration in the geopolitical landscape and substantial increase in AI capabilities. This 2020 article is also informative, but entirely focused on how AI and current military positioning may interact.
^
I’m using ‘ASI’ somewhat loosely here, in the sense that between HLAI as defined in AIDC and ASI (superhuman capability all domains), there would obviously be a spectrum. I am suggesting that the first-mover might try to advance on that spectrum as far as it deems strategically desirable, in its opinion and under its resource/technical constraints, which might be well short of ASI, as Kaj Sotala points out.
^
See note 28/p. 25 of Biological Anchors.
^
Which probably includes Japan, S Korea, Australia, New Zealand, Taiwan and perhaps India.
^
Probably including Russia, perhaps some part of the former USSR states, and possibly others such as Iran or Turkey.
^
Similar to Stuxnet or the assassinations undertaken against Iran’s nuclear programme, which is of course a much riskier proposition in a US-China context. Note however, such a strike, is unlikely to be the same thing as a ‘pivotal act’.
^
I can’t find a concrete example of military or national security personnel obviously sacrificing broader (i.e. humanity-level) welfare at risk in the service of narrow (national or ideological) interests, but there has been some theoretical analysis of the question. My first-guess examples may exist in General Curtis LeMay (who advocated using nuclear weapons in North Korea, and conventional weapons in Cuba); the RAND Corporation’s Herman Kahn, whose specialty was laying out scenarios of quotidian life after a major nuclear exchange; and Edward Teller. In fiction, the canonical examples are Dr. Strangelove and General Ripper, as is the 1964 film Fail Safe (which had a Teller/Kahn character). More generally, there are accounts of the psychological stress nuclear staffs operate under: besides the well known story of Colonel Stanislav Petrov, see this account about American nuclear-launch officers’, and this one.
^
A more rigorous exploration of economic growth under TAI is a report (summarised here by Rohin Shah) by Tom Davidson (though it is more interested in global-level growth, rather than the allocation of that growth amongst countries in a competitive race), as well as this one from Philip Trammell and Anton Korinek, which is discussed in this podcast.
^
This ‘bootstrapping’ strategy has been articulated by Richard Ngo and Paul Christiano, but there doesn’t yet seem to be a consensus on whether it would work.
^
Perhaps more so than nuclear weapons/power or bioweapons, which, as the current Ukraine conflict shows, are having relatively little apparent impact so far on altering outcomes.
^
A weak analogy comes from the Soviet thermonuclear weapon—it is apparently still unclear how much stolen designs (involving physicist Klaus Fuchs) actually helped in construction of the first H-bomb. The first weapon, RDS-6 in 1953, was a ‘layer-cake’ (Sloika) design, that didn’t really work (80-85% fission yield), and it was only in 1955, with RDS-37 that the USSR found the equivalent of a Teller-Ulam two-stage design for a fusion device (apparently Sakharov came up with this independently of Fuchs’ information). Lavrenty Beria, the fearsome head of the NKVD, seemed, through well-timed carceral interventions, to play a part in expediting the scientists’ efforts.
^
Often people have at least 2 senior-management level bosses owing to a matrix that crosses geographies and markets (e.g. European fixed-income traders can easily be in a competitive/cooperative situation vs Asia-based fixed-income), and much middle management time is spent adjudicating turf wars. Clients are often pitched similar products from different parts of a bank, much to their amusement.
^
Cotra, in Part 4 (p. 34) of Biological Anchors raises a similar possibility, of a diffuse transition to a world saturated with AI.
^
This borrows ideas from a post by Anni Leskelä on using centralised arbitrators or managers to ensure systems of multiple AIs make credible commitments. But, I am also not sure that ‘management’, as opposed to running experiments or designing products, is something that can be easily or reliably taught through human feedback, as Paul Christiano hints at here in a technical sense and here in a Goodhart sense.
^
Moreover, Cotra points out that the various supervisory schemes don’t solve her basic worry, that HLAIs will continue to play ‘the training game’.
^
This post from Richard Ngo offers a number of other perspectives on CAIS. See also Carlsmith (2021) Section 3.1-3.3, for comments on how economic pressures might drive companies and governments towards agentic planners, as well as the possibility that agentic behaviour may emerge from sophisticated systems that weren’t designed to be so. This AXRP interview with Evan Hubinger expresses a similar view, in the context of mesa-optimisation.
^
This might be an anthropocentric bias, since this is often how humans approach complex problems. I’m not sure if an AI would necessarily decompose the problem in this particular way, but it seems a reasonable guess if the AI has been trained through HFDT.
^
This framing of how humans approach complex tasks superficially resembles factored cognition or amplification-based approaches such as HCH, though the latter is actually using decomposition as a way of building alignment ‘from the bottom up’.
^
Ngo uses RL terminology but he expects similar analysis to hold for LLM-type models.
^
Ngo uses ‘goals’ in this context, in a specific sense (2.1.1): ‘...policies represent different outcomes which might arise from possible actions, and then choose actions by evaluating the values of possible outcomes; I’ll call this pursuing goals. Under the definition I’m using here, a policy’s goals are the outcomes which it robustly represents as having high value.’
^
In Ngo’s formulation (section 2), the system is a deep neural network with dual output heads, one of which makes predictions based upon training on multimodal data; while the other takes actions based on RL trained upon a diverse set of tasks. As in HFDT, training is from human feedback on complex tasks, as well as in simulated environments with automated evaluation. The training process optimises hard until the action head matches human performance on most tasks.
^
Some potential mitigants are to have multiple trainers or more thoughtful prompts or ask for justification for why the bot has returned a particular answer. However, it isn’t hard to imagine biases existing across members of groups, and being rather deeply ingrained: during the Cold War, rabid anti-Communism was a feature in US policymaking circles, while ‘socialism’ is still a dirty word for half the country. Conversely, ‘neoliberal’ or ‘libertarian’ enjoy a similar status for the other half.
^
Keeping in mind that an apparently similar activity, in slightly different context, may be perfectly acceptable: accessing the labour, wealth, central bank, and tax authorities’ database in the PRC to form a more complete picture of inequality in China.
^
More speculatively, theories of deceptive alignment might suggest the model internally can represent that, in certain circumstances, say when deployed in an adversarial capacity (as an offensive cyber-unit) it would in fact be acceptable to hack the opponent’s systems—this was after all something it was trained for! Trainers would have to devise an elaborate system that penalised the ‘wrong’ behaviour (hacking the US government when asked a simple question about inequality) and allowed the ‘right’ behaviour (probing the cyber-defenses of the US government to test for vulnerabilities).
^
Ngo’s framing is for a model optimised until it surpasses humans in all domains, which we can call AGI, whereas AIDC is ‘only’ aiming for human-level capabilities, but I’m not certain the difference matters very much for the purposes of Ngo’s argument.
^
Humans have a well-known bias towards the short-term that seems to be structural: earnings and election cycles, memory and planning constraints (Superintelligence pp. 176-178, and p. 8 of this paper by Nick Bostrom) of course, death. I think this is probably separate from the broader conversation around discount rates on preferences.
^
I like the ‘alien psychology’ framing Cotra uses in discussing what type of AI HFDT might result in, and how our tools and understanding might be wholly inadequate to the task of controlling/guiding them.
^
Though there are still plenty of fights about how profits/losses are to be allocated. Also I’m leaving out explicit breaches of rules or laws.
^
Basically a job where teams of bankers try to sell products and services to clients, such as lending them money, running their IPOs, advising them on what companies to buy, sorting out their unfunded pensions mess, managing their interest-rate and currency risks, etc. The main point is the task is distributed horizontally across people, teams and and vertically over an organisational hierarchy; and the credit (if the client buys whatever is on offer) is diffuse and hard to assign. See the series Industry.
^
Importantly, his view is analysed in these two comments, the gist of which is that it isn’t obvious the degree to which two models with near-identical neural weights will coordinate well, for reasons ranging from the (possibly weak) human analogy, to more complex bargaining theory-related questions. I would add three speculative points/questions: are neural weights arrived at stochastically, i.e. if two different models are trained on the same data, might they form slightly different resulting weights and, if so, do these differences manifest behaviourally? Secondly, would HLAI/HFDT involve different AIs or generations of AI trained or fine-tuned on slightly varied data—could we be confident, absent better interpretability tools, that their neural weights were sufficiently identical to provide the relevant guarantees either on behaviour or on likelihood of coordination? Lastly, current models’ neural architectures ‘freeze’ once training is completed; presumably by 2036, models may undertake a degree of online learning—would this cause the models’ weights’ to diverge proportional to time-in-deployment?
^
See also this sequence from Jesse Clifton on game and bargaining theory in the context of AI.
^
For nanotech as a threat, see Superintelligence Box 6/pp. 98-99 as well as this recent conversation between Nate Soares and Joe Carlsmith that tries to assess how significant a risk AI-controlled nanotech actually might be.
^
The human overseers’ problem is complicated by the fact that HLAI is (by the premises of AIDC/HFDT) very promising in terms of economics or military applications, and is potentially already generating value, howsoever measured. Hence, constituencies that benefit (businesspeople, generals, the NSA/GCHQ, etc.) lobby intensely to forestall any regulation. This ambiguity in assessing, from the perspective of a regulator, legislator, or public interest body, the threat of broadly deployed AI (or one deployed in small numbers but having a large impact on society) is a key element of these two posts by Paul Christiano, and this one from Andrew Critch. This kind of behaviour is known as regulatory capture, particularly in finance, and is also seen in the post-9/11 surveillance environment in the US and UK.

Analysing a 2036 Takeover Scenario