Intrinsic limitations of GPT-4 and other large language models, and why I'm not (very) worried about GPT-n

Introduction

Over my years of involvement in Effective Altruism, I have written several forum posts critical of various aspects of the movement’s focus on AI safety, with a particular focus on fast takeoff scenarios. It is no secret that the recent release of ChatGPT, GPT-4, and other highly capable large language models (LLMs) has sparked both tremendous public interest and significant concern in the EA community. In this article, I share some of my thoughts on why I am not as concerned about the AI alignment of LLMs as many in the EA community. In particular, while there are legitimate concerns about the safety and reliability of LLMs, I do not think it is likely that such systems will soon reach human levels of intelligence or capability in a broad range of tasks. Rather, I argue that such systems have intrinsic limitations which cannot be overcome within the existing development paradigm, and continual growth in capabilities based on increasing the number of parameters and size of the training data will only continue for a few more years before running its course. I also argue that the adoption of such systems will be slow, occurring over years to decades rather than months to years (as some have argued), and thus their impacts will be more gradual and evolutionary rather than sudden and revolutionary. As such, I do not agree with some who have argued that AI alignment should be focused on alignment of existing LLMs and driven by short timelines (on the order of years).

Before going further, I should add that the following piece is intended to be reasonable accessible to those not highly familiar with LLMs or AI alignment more generally, and therefore is more introductory and not highly technical (though it is a little technical in some parts). I am sharing my thoughts in the hope of fostering more of a discussion about how we should think about AI alignment, with a focus on the impacts and trajectories of LLMs. I also want to make my predictions public for the purpose of personal accountability. Having set out my purpose, I begin with an introduction to the current LLM paradigm.

Limits of the existing paradigm

Current large language models are based on the transformer architecture. These are very large neural networks which are trained on huge corpuses of data, most of which is from the internet. The models are usually trained to predict the next word in a sentence, and during training they learn complex statistical associations between words in natural language. Recently, OpenAI has extended this framework by adding a technique called Reinforcement Learning from Human Feedback (RLHF). This involves presenting queries and their corresponding LLM outputs to humans, who then provide ratings as to the quality of the responses. These ratings are then used to fine-tune the language model, altering its output to improve its ratings from human feedback. This technique has enabled language models to produce output that is more useful to humans, and has improved the performance of language models as chatbots.

The OpenAI team have also made other additions and modifications to their newest model (GPT-4) to improve its capabilities as a chatbot, though very little public details are available about this. Judging by the number of contributors to the GPT-4 paper (which lists 93 ‘core contributors’ and hundreds of other contributors) relative to previous GPT-3 paper (which lists only 31 authors), it appears that OpenAI has devoted a lot of time to adjusting, augmenting, and modifying the model in various ways. We know that systems have been put in place to filter out queries likely to lead to harmful or offensive results. There is also evidence that GPT-4 has a limited ability to check for faulty assumptions in the queries or instructions it is given, though it is unclear how this has been done. Nonetheless, it appears that extensive development work has been done beyond the initial stage of training the transformer on a large text corpus.

In my view, the fact that such extensive augmentations and modifications are necessary is an indication of the underlying weaknesses and limitations of the transformer architecture. These models learn complex associations between words, but do not form the same structured, flexible, multimodal representations of word meaning like humans. As such they do not truly ‘understand’ language in the same sense as humans can. For many applications this does not matter, but in other cases it can manifest in extremely bizarre behaviour, including models accepting absurd premises, making faulty inferences, making contradictory statements, and failing to incorporate information that is provided.

A related issue is the known tendency of LLMs to ‘hallucinate’, making up facts, information, or non-existent libraries of computer code when giving responses. I dislike the term hallucination because it implies there is some fundamental distinction between veridical knowledge that the LLM has correctly learned and hallucinations which it simply makes up. In fact there is no such distinction, because LLMs do not form memories of events or facts in the way humans do. All they are capable of is storing complex statistical associations in their billions of learned parameters. When the model produces some string of words as an output, this is equally the product of its internal learned parameters regardless of whether humans would evaluate the string as true or false. Furthermore, an LLM has no notion of truth or falsity; it simply learns word associations. (Here I am setting aside the possibility that GPT-4 may be augmented with capabilities beyond its basic transformer architecture, since there is no public information about this, and at any rate the underlying architecture is still a transformer model). As such, the problem of ‘hallucinations’ is not some teething issue or minor annoyance, but is intrinsic to the architecture and method of training of LLMs. Of course, various proposals exist for how to mitigate this limitation, such as augmenting LLMs with curated datasets of encyclopaedic facts or common-sense knowledge. While promising, such proposals are not new and face many problems of their own right. Though they may be successful in the long run, I do not believe there is any simple or easily implemented solution to the problem of ‘hallucinations’ in LLMs.

LLMs are also known to be susceptible to what are called adversarial attacks. While recently the term has been used to refer to breaking past the restrictions placed on ChatGPT and GPT-4 by their developers, I am using the term ‘adversarial attack’ to refer to an older usage of the term relating to research which attempts to craft training data or prompts which highlight weaknesses or limitations of LLMs. Numerous analysis in this tradition have found that LLMs often rely on superficial heuristics and spurious correlations in their training data, resulting in poor performance on cases carefully selected to highlight these shortcomings. Other adversarial attacks have found that LLMs can assign high probabilities to answers with nonsense scrambled word order, are easily distracted by irrelevant content, and in some cases are not even sensitive to whether their prompts actually make sense or not. There is some debate about precisely how to interpret such results. For instance, LLMs are often inconsistent, performing a task well when asked in one way but failing miserably when the wording or context is changed slightly. Furthermore, humans can also show such sensitivity to phrasing and context just like the LLMs. However, well conducted adversarial research focuses on cases where humans would either clearly recognise the input as nonsense while the model does not, or conversely in cases where humans would not distinguish between the standard and the adversarial inputs at all but when LLMs perform drastically differently. There are enough examples of this across a range of tasks and different models, including very large models like GPT-3, that I think it is reasonable to conclude that LLMs do not ‘understand’ the textual input they process in anything like the way a human does.

Another core limitation of LLMs which has been the focus of extensive research is their difficulty in exhibiting compositionality. This refers to the ability to combine known elements in novel ways by following certain abstract rules. Many cognitive scientists have argued that compositionality is a critical component of the human ability to understand novel sentences with combinations of words and ideas never previously encountered. Prior to the release of GPT-4, the best transformer models still struggled to perform many compositional tasks, often only succeeding when augmented with symbolic components (which is difficult to scale to real-world tasks), or when given special task-specific training. At the time of writing, I am not aware of GPT-4 having been subjected to these types of tests. Although I anticipate it would outperform most existing models, given that it shares the same transformer architecture I doubt it will be able to completely solve the problem of compositionality. The underlying limitation is that transformer-based language models do not learn explicit symbolic representations, and hence struggle to generalise appropriately in accordance with systematic rules.

There have also been efforts circumvent some of these limitations and use LLMs for a wider range of tasks by developing them into a partially autonomous agent. The approach is to chain together a series of instructions, allowing the model to step through subcomponents of a task and reason its way to the desired conclusion. One such project called Auto-GPT involves augmenting GPT with the ability to read and write from external memory, and allowing it access to various external software packages through their APIs. Thus far it is too early to say what will become of such projects, though early investigations indicate some promising results but also plenty of difficulties. In particular, the model often gets stuck in loops, fails to correctly incorporate contextual knowledge to constrain solutions to the problem, and has no ability to generalise results to similar future problems. Such difficulties illustrate that LLMs are not designed to be general purpose agents, and hence lack many cognitive faculties such as planning, learning, decision making, or symbolic reasoning. Furthermore, it is exceedingly unlikely that simply ‘plugging in’ various components to an LLM in an ad hoc manner will result in an agent capable of performing competently in a diverse range of environments. The way the components are connected and interact is absolutely crucial to the overall capabilities of the system. The structure of the different cognitive components of an agent is called a cognitive architecture, and there has been decades of research into this topic in both cognitive psychology and computer science. As such, I think it is naïve to believe that such research will be rendered irrelevant or obsolete by the simple expedient of augmenting LLMs with a few additional components. Instead, I expect that LLMs will form one component of many that will need to be incorporated into a truly general-purpose intelligent system, one which will likely take decades of further research to develop.

Cost of training of language models

Recent improvements in LLMs have primarily occurred as a result of dramatic increases in both the number of model parameters and the size of the training datasets. This has led to a rapid increase in training costs, largely due to the electricity usage and rental or opportunity cost of the required hardware. Some illustrative figures for the growth of the size and training cost of LLMs are shown in the table below. Sources are given for the numbers for GPT-2s, −3, and −4, while the estimates for the hypothetically named ‘GPT-5’ and ‘GPT-6’ are extrapolated using rough estimates from the think tank Epoch. The point of this table is not to make any definitive predictions, but rather to illustrate that already development costs of LLMs are within the reach only of large corporations and governments, and that costs will only continue to escalate in the coming years.

Model	Year	Params	Training cost	Source
GPT-2	2019	1.5 billion	$100,000	Wiki, based on BERT cost x10 for model size
GPT-3	2020	175 billion	$10 million	Blog post and forum post, maybe at high end
GPT-4	2023	2 trillion	$100 million	Sam Altman quote, anonymous estimate of 1 trillion which I rounded for consistency
GPT-5	2025	20 trillion	$1 billion	Epoch estimate of 1 OOM every 2 years
GPT-6	2027	200 trillion	$10 billion	As above

Assuming current growth rates continue, within about five years further increasing model size will become infeasible even for the biggest governments and tech firms, as training costs will reach tens of billions of dollars. For comparison, the US military spends about $60 billion on R&D, while Apple spends about $30 billion, and Microsoft about $25 billion. The general thrust of my argument and numbers is further supported by a separate analysis in this EA forum post.

Separately from the issue of training cost, there is also the question of the availability of training data. Existing models require enormous training datasets, with the size increasing exponentially from one iteration to the next. For example, GPT3 was trained on a primary corpus of 300 billion words derived from the internet. Based on historical trends, Epoch estimates that high quality language data will be exhausted by 2024 or 2025, and low quality data by 2032. I also expect this to restrict the rate of performance improvement of LLMs.

I am not arguing here that the development of LLMs will cease within five years, or that further improvements are impossible. Already there has been extensive work on ways to achieve high levels of performance using much smaller versions of an existing model. There is also continual development of hardware capability as described by Moore’s Law, and various improvements that can be made to training algorithms and server overheard to improve efficiency. Yet none of this affects the key thrust of my argument, because the past five years have seen massive improvements in the capability of LLMs due to increasing model size plus all of these additional methods. A few years from now, continual growth of model size will not be economically feasible, so any improvements will only come from these other methods. The result will almost certainly be a significant slowdown in the rate of increase in LLM performance, at least those operating within the existing transformer paradigm. Similar views have been expressed by other researchers, including Ben Goertzel, Gary Marcus, and Sam Altman. In light of these considerations, along with the intrinsic limitations discussed in the previous section, I do not think it is plausible that LLMs will reach or exceed human performance in a wide range of tasks in the near future, or will be able to overcome all of the limitations discussed in the previous section simply through increased size.

Likely future trajectory

Currently we are in the early stages of large language models, analogous to stage of development of the personal computer in 1980 or the internet in 1995. In the coming years I expect large tech companies to continue improving their own large language models and attempting to find profitable uses of them. This is a critical phase, in which there will be much experimentation and failed attempts as companies compete to find the best way to deploy the technology. It will take considerable time and effort to turn LLMs into a viable product, and even longer to adapt its use to various speciality applications and for the technology to become widely adopted. Many companies and organisations will seek for ways to use LLMs to augment their existing internal processes and procedures, which also will take a great deal of time and trial and error. Contrary to what some have implied, no new technology can ever simply be ‘plugged in’ to existing processes without substantial change or adaptation. Just as automobiles, computers, and the internet took decades to have major economic and social impacts, so too I expect LLMs will take decades to have major economic and social impacts. Yet other technologies, such as nuclear fusion, reusable launch vehicles, commercial supersonic flight, are still yet to achieve their promised substantial impact.

One of the major limitations of using existing LLMs is their unreliability. No important processes can currently be trusted to LLMs, because we have very little understanding of how they work, limited knowledge of the limits of their capabilities, and a poor understanding of how and when they fail. They are able to perform impressive feats, but then fail in particularly unexpected and surprising ways. Unpredictability and unreliability both make it very difficult to use LLMs for many business or government tasks. Of course humans regularly make mistakes, but human capabilities and fallibilities are better understood than those of LLMs, and existing political, economic, and governance systems have been developed over many decades to manage human mistakes and imperfections. I expect it will similarly take many years to build systems to effectively work around the limitations of LLMs and achieve sufficient reliability for widespread deployment.

It is also valuable to take a historical perspective, as the field of artificial intelligence has seen numerous examples of excessive hype and inflated expectations. In the late 1950s and early 1960s there was a wave of enthusiasm about the promise of logic-based systems and automated reasoning, which was thought to be capable of overtaking humans in many tasks within a matter of years. The failure of many of these predictions lead to the first AI winter of the 1970s. The 1980s saw a resurgence of interest in AI, this time based on new approaches such as expert systems, the backpropagation algorithm, and initiatives such Japan’s Fifth Generation computer initiative. Underperformance of these systems and techniques led to another AI Winter in the 1990s and early 2000s. The most recent resurgence of interest in AI has largely been driven by breakthroughs in machine learning and the availability of much larger sources of data for training. Progress in the past 15 years has been rapid and impressive, but even so there have been numerous instances of inflated expectations and failed promises. IBN’s Watson system which won jeopardy in 2011 was heralded by IBM as a critical breakthrough in AI research, but subsequently they spent years attempting to adapt the system for use in medical diagnosis with little success. Self-driving cars developed by google attracted substantial publicity in 2012 with their ability to drive autonomously on public roads with minimal human intervention, but a decade later there considerable challenges remain in closing the last few small portion of the journey where humans still need to take over. While such comparisons can never be definitive, I believe these historical precedents should temper our expectations about the rate of progress of the latest set of techniques in artificial intelligence research.

Conclusions

In this article I argued that large language models have intrinsic limitations which are unlikely to be resolved without fundamental new paradigms. I also argued that the increasing costs of training large models and limited stock of quality training data will mean that growth of LLMs at present rates will not be able to continue for more than a few years. Furthermore, historical parallels indicate that it will take years for LLMs to become widely adopted and integrated into existing economic and social processes. Overall, in my view there is little reason to believe that LLMs are likely to exceed human capabilities in a wide range of tasks within a few years, or displace large fractions of the workforce. These outcomes may occur in thirty or fifty years time, but almost certainly not within the next five or ten years, and not solely due to the continued development of LLMs. For these reasons I do not believe the EA movement should focus too much or too exclusively on LLMs or similar models as candidates for an AGI precursor, or put too much of a focus on short time horizons. We should pursue a diverse range of strategies for mitigating AI risk, and devote significant resources towards longer time horizons.