Computer Science student from Ireland who’s interested in AI safety research.
Stephen McAleese
Geoffrey Hinton on the Past, Present, and Future of AI
A recent survey of AI alignment researchers found that the most common opinion on the statement “Current alignment research is on track to solve alignment before we get to AGI” was “Somewhat disagree”. The same survey found that most AI alignment researchers also support pausing or slowing down AI progress.
Slowing down AI progress might be net-positive if you take ideas like longtermism seriously but it seems challenging to do given the strong economic incentives to increase AI capabilities. Maybe government policies to limit AI progress will eventually enter the Overton window when AI reaches a certain level of dangerous capability.
This is a cool project! Thanks for making it. Hopefully it makes the book more accessible.
Update: the UK government has announced £8.5 million in AI safety funding for systematic AI safety.
Thanks for writing this! It’s interesting to see how MATS has evolved over time. I like all the quantitative metrics in the post as well.
I wrote a blog post in 2022 (1.5 years ago) estimating that there were about 400 people working on technical AI safety and AI governance.
In the same post, I also created a mathematical model which said that the number of technical AI safety researchers was increasing by 28% per year.
Using this model for all AI safety researchers, we can estimate that there are now people working on AI safety.
I personally suspect that the number of people working on AI safety in academia has grown faster than the number of people in new EA orgs so the number could be much higher than this.
One argument for continued technological progress is that our current civilization is not particularly stable or sustainable. One of the lessons from history is that seemingly stable empires such as the Roman or Chinese empires eventually collapse after a few hundred years. If there isn’t more technological progress so that our civilization reaches a stable and sustainable state, I think our current civilization will eventually collapse because of climate change, nuclear war resource exhaustion, political extremism, or some other cause.
Thanks for the writeup. I like how it’s honest and covers all aspects of your experience. I think a key takeaway is that there is no obvious fixed plan or recipe for working on AI safety and instead, you just have to try things and learn as you go along. Without these kinds of accounts, I think there’s a risk of survivorship bias and positive selection effects where you see a nice paper or post published and you don’t get to see experiments that have failed and other stuff that has gone wrong.
I’m sad to hear that AISC is lacking in funding and somewhat surprised given that it’s one of the most visible and well-known AI safety programs. Have you tried applying for grant money from Open Philanthropy since it’s the largest AI safety grant-maker?
“In brief, the book [Superintelligence] mostly assumed we will manually program a set of values into an AGI, and argued that since human values are complex, our value specification will likely be wrong, and will cause a catastrophe when optimized by a superintelligence”
Superintelligence describes exploiting hard-coded goals as one failure mode which we would probably now call specification gaming. But the book is quite comprehensive, other failure modes are described and I think the book is still relevant.
For example, the book describes what we would now call deceptive alignment:
“A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later”
And reward tampering:
“The proposal fails when the AI achieves a decisive strategic advantage at which point the action which maximizes reward is no longer one that pleases the trainer but one that involves seizing control of the reward mechanism.”
And reward hacking:
“The perverse instantiation—manipulating facial nerves—realizes the final goal to a greater degree than the methods we would normally use.”
I don’t think incorrigibility due to the ‘goal-content integrity’ instrumental goal has been observed in current ML systems yet but it could happen given the robust theoretical argument behind it:
If an agent retains its present goals into the future, then its present goals will be more likely to be achieved by its future self. This gives the agent a present instrumental reason to prevent alternations of its final goals.”
Some information not included in the original post:
In April 2023, the UK government announced £100m in initial funding for a new AI Safety Taskforce.
In June 2023, UKRI awarded £31m to the University of Southhampton to create a new responsible and trustworthy AI consortium named Responsible AI UK.
I think work on near-term issues like unemployment, bias, fairness and misinformation is highly valuable and the book The Alignment Problem does a good job of describing a variety of these kinds of risks. However, since these issues are generally more visible and near-term, I expect them to be relatively less neglected than long-term risks such as existential risk. The other factor is importance or impact. I believe the possibility of existential risk greatly outweighs the importance of other possible effects of AI though this view is partially conditional on believing in longtermism and weighting the value of the long-term trajectory of humanity highly.
I do think AI ethics is really important and one kind of research I find interesting is research on what Nick Bostrom calls the value loading problem which is the question of what kind of philosophical framework future AIs should follow. This seems like a crucial problem that will need to be solved eventually. Though my guess is that most AI ethics research is more focused on nearer-term problems.
Gavin Leech wrote an EA Forum post which I recommend named The academic contribution to AI safety seems large where he argues that the contribution of academia to AI safety is large even with a strong discount factor because academia does a lot of research on AI safety-adjacent topics such as transparency, bias and robustness.
I have included some sections on academia in this post though I’ve mostly focused on EA funds because I’m more confident that they are doing work that is highly important and neglected.
Good question. I haven’t done much research on this but a paper named Understanding AI alignment research: A Systematic Analysis found that the rate of new Alignment Forum and arXiv preprints grew from less than 20 per year in 2017 to over 400 per year in 2022. However, the number of Alignment Forum posts has grown much faster than the number of arXiv preprints.
The Superalignment team currently has about 20 people according to Jan Leike. Previously I think the scalable alignment team was much smaller and probably only 5-10 people.
At OpenAI, I’m pretty sure there are far more people working on near-term problems that long-term risks. Though the Superalignment team now has over 20 people from what I’ve heard.
Thanks for the post. It was an interesting read.
According to The Case For Strong Longtermism, 10^36 people could ultimately inhabit the Milky Way. Under this assumption, one micro-doom is equal to 10^30 expected lives.
If a 50%-percentile AI safety researcher reduces x-risk by 31 micro-dooms, they could save about 10^31 expected lives during their career or about 10^29 expected lives per year of research. If the value of their research is spread out evenly across their entire career, then each second of AI safety research could be worth about 10^22 expected future lives which is a very high number.
These numbers sound impressive but I see several limitations of these kinds of naive calculations. I’ll use the three-part framework from What We Owe the Future to explain them:
Significance: the value of research tends to follow a long-tailed curve where most papers get very few citations and a few get an enormous number. Therefore, most research probably has low value.
Contingency: the value of some research is decreased if it would have been created anyway at some later point in time.
Longevity: it’s hard to produce research that has a lasting impact on a field or the long-term trajectory of humanity. Most research probably has a sharp drop off in impact after it is published.
After taking these factors into account, I think the value of any given AI safety research is probably much lower than naive calculations suggest. Therefore, I think grant evaluators should take into account their intuitions on what kinds of research are most valuable rather than relying on expected value calculations.
Thanks for pointing this out. I didn’t know there was a way to calculate the exponentially moving average (EMA) using NumPy.
Previously I was using alpha = 0.33 for weighting the current value. When that value is plugged into the formula alpha = 2 / N + 1, it means I was averaging over the past 5 years.
I’ve now decided to average over the past 4 years so the new alpha value is 0.4.
I recommend this web page for a narrative on what’s happening in our world in the 21st century. It covers many themes such as the rise of the internet, the financial crisis, covid, global warming, AI and demographic decline.
Thanks for the post. Until now, I used to learn about what LTFF funds by manually reading through its grants database. It’s helpful to know what the funding bar looks like and how it would change with additional funding.
I think increased transparency is helpful because it’s valuable for people to have some idea of how likely their applications are to be funded if they’re thinking of making major life decisions (e.g. relocating) based on them. More transparency is also valuable for funders who want to know how their money would be used.
I’ve never heard this idea proposed before so it seems novel and interesting.
As you say in the post, the AI risk movement could gain much more awareness by associating itself with the climate risk advocacy movement which is much larger. Compute is arguably the main driver of AI progress, compute is correlated with energy usage, and energy use generally increases carbon emissions so limiting carbon emissions from AI is an indirect way of limiting the compute dedicated to AI and slowing down the AI capabilities race.
This approach seems viable in the near future until innovations in energy technology (e.g. nuclear fusion) weaken the link between energy production and CO2 emissions, or algorithmic progress reduces the need for massive amounts of compute for AI.
The question is whether this indirect approach would be more effective than or at least complementary to a more direct approach that advocates explicit compute limits and communicates risks from misaligned AI.