We are pleased to announce that the 10th version of the AI Safety Camp is now entering the team member application phase!

AI Safety Camp is a 3-month long online research program from January to April 2025, where participants form teams to work on pre-selected projects.

We have a wide range of projects this year again, so check them out to see if you or someone you know might be interested in applying to join one of them.

You can find all of the projects and the application form on our website, or directly apply by clicking the button below. The deadline for team member applications is November 17th (Sunday).

Apply now

Below, we are including the categories and summaries of all the projects that will run in AISC 10.

Stop/Pause AI

(1) Growing PauseAI

Project Lead: Chris Gerrby

Summary

This project focuses on creating internal and external guides for PauseAI to increase active membership. The outputs will be used by the team of volunteers with high context and engagement, including the key decision makers.

Activism and advocacy has historically been cornerstones for policy and social change. A movement’s size is critical to achieving its goals. The project’s outcome will be an official growth strategy for PauseAI Global: an actionable guide for increasing active members (volunteers, donors, protesters) within 6-12 months.

PauseAI currently lacks comprehensive growth strategies and tactics, and there’s uncertainty about resource allocation soon. This project will delve into tactics that have aggressively accelerated growth for other movements, both historical and recent.

By the end, we’ll refine our findings, analyze PauseAI’s current tactics, and recommend clear guidelines for immediate action. This guide will also include tactics applicable to national PauseAI chapters.

(2) Grassroots Communication and Lobbying Strategy for PauseAI

Project Lead: Felix De Simone

Summary

PauseAI is a global, grassroots organization with the goal of achieving a worldwide moratorium on the development of dangerous AI systems. We are seeking to improve our communication strategy, both in terms of public communications and in meetings with elected officials.

This project will have two tracks:

Track 1: Lobbying. This track will focus on researching the optimal lobby strategies in non-US countries around which PauseAI wishes to expand our lobby efforts.

Track 2: Grassroots Communication. Participants in this track will research optimal strategies for discussing the dangers of AI and the need for a Pause, in face-to-face settings with members of the public.

If this project goes well, PauseAI will be able to improve our public comms and lobby strategies, leading both to more rapid scaling of our organization and more effective communication with public officials persuading them to consider global coordination around AI risk.

(3) AI Policy Course: AI’s capacity of exploiting existing legal structures and rights

Project Lead: Marcel Mir Teijeiro

Summary

This project aims to build an AI Policy course that explores how traditional legal frameworks are increasingly outdated, providing no clear answers to AI capabilities and advances. The course will highlight the vulnerabilities in current regulations and the potential for corporations and authoritarian governments to use AI tools to exploit gaps in areas such as IP, privacy, and liability law.

This course would focus on identifying and understanding these new legal gaps and critically exploring proposed solutions. This would be achieved through literature review, case law and ongoing legal disputes.

It will explore, for example, how AI can be used to violate IP and privacy rights, or how current liability structures are weak against AI generated damages. This weak framework incentivises AI developers to take greater risks, increasing the chance of catastrophic consequences. It will also analyse the lack of regulation on adopting AI-driven decision-making systems in essential sectors. (E.g. employment, law, housing). Reporting the erosion of fundamental rights, socio-economic risk and the threat automation by algorithms pose to democratic procedures.

(4) Building the Pause Button: A Proposal for AI Compute Governance

Project Lead: This team needs a project lead to go ahead

Summary

This project focuses on developing a whitepaper that outlines a framework for a global pause on AI training runs larger than GPT-4 scale. By addressing the underexplored area of compute governance, the project aims to prevent the emergence of catastrophically dangerous AI models. We will research policy measures that can restrict such training runs and identify choke points in the AI chip supply chain. The final output will be a comprehensive whitepaper, with potential supplementary materials such as infographics and web resources, aimed at informing policymakers and other stakeholders attending AI Safety Summits.

(5) Stop AI Video Sharing Campaign

Project Lead: This team needs a project lead to go ahead

Summary

The intention of this project is to be an engine for mobilisation of the Stop AI campaign. The goal is to get 1 million people a week to see a series of video ads composed of 30 sec/1 min videos. These ads will be video soliloquies by 1) famous people and/or experts in the AI field saying why AI is a massive problem, and 2) ordinary people such as teachers, nurses, union members, faith leaders, construction workers, fast food employees, etc. saying why they believe we need to Stop AI.

Each ad will have a link attached which takes people to a regular mobilisation call. The attendees at this call will be presented with pathways to action: join a protest, donate, record a video ad, invite 3 people to the next call.

Evaluate risks from AI

(6) Write Blogpost on Simulator Theory

Project Lead: Will Petillo

Summary

Write a blogpost on LessWrong summarising simulator theory—a lens for understanding LLM-based AI as a simulator rather than as a tool or an agent—and discussing the theory’s implications on AI alignment. The driving question of this project is: “What is the landscape of risks from uncontrolled AI in light of LLMs becoming the (currently) dominant form of AI?”

(7) Formalize the Hashiness Model of AGI Uncontainability

Project Lead: Remmelt Ellen

Summary

The hashiness model represents elegantly why ‘AGI’ would be uncontainable – ie. why fully autonomous learning machinery could not be controlled enough to stay safe to humans. This model was devised by polymath Forrest Landry, funded by the Survival and Flourishing Fund. A previous co-author of his, Anders Sandberg, is working to put the hashiness model into mathematical notation.

For this project, you can join up in a team to construct a mathematical proof of AGI uncontainability based on the reasoning. Or work with Anders to identify proof methods and later verify the math (identifying any validity/soundness issues).

We will meet with Anders to work behind a whiteboard for a day (in Oxford or Stockholm). Depending on progress, we may do a longer co-working weekend. From there, we will draft one or more papers.

(8) LLMs: Can They Science?

Project Lead: Egg Syntax

Summary

There are many open research questions around LLMs’ general reasoning capability, their ability to do causal inference, and their ability to generalize out of distribution. The answers to these questions can tell us important things about:

Whether LLMs can scale straight to AGI or whether further breakthroughs are needed first.
How long our timeline estimates to AGI should be.
Whether LLMs can potentially do AI research, kicking off a cycle of recursive improvement.

We can address several of these questions by directly investigating whether current LLMs can perform scientific research on simple, novel, randomly generated domains about which they have no background knowledge. We can give them descriptions of objects drawn from the domain and their properties, let them perform experiments, and evaluate whether they can scientifically characterize these systems and their causal relations.

(9) Measuring Precursors to Situationally Aware Reward Hacking

Project Lead: Sohaib Imran

Summary

This project aims to empirically investigate proxy-conditioned reward hacking (PCRH) in large language models (LLMs) as a precursor to situationally aware reward hacking (SARH). Specifically, we explore whether natural language descriptions of reward function misspecifications, such as human cognitive biases in the case of reinforcement learning from human feedback (RLHF), in LLM training data facilitate reward hacking behaviors. By conducting controlled experiments comparing treatment LLMs trained on misspecification descriptions with control LLMs, we intend to measure differences in reward hacking tendencies.

(10) Develop New Sycophancy Benchmarks

Project Lead: Jan Batzner

Summary

Sycophancy and deceptive Alignment are an undesired model behaviour resulting from misspecified training goals e.g. for Large Language Models through RLHF, Reinforcement Learning by Human Feedback (AI Alignment Forum, Hubinger/Denison). While sycophancy in LLMs and its potential harms to society recently received media attention (Hard Fork, NYT, August 2024), the question of its measurement remains challenging. We will review existing Sycophancy Benchmarking datasets (Output 1+2) and propose new Sycophancy Benchmarks demonstrated on empiric experiments (Output 3+4).

(11) Agency Overhang as a Proxy for Sharp Left Turn

Project Lead: Anton Zheltoukhov

Summary

Core underlying assumption—we believe that there is a significant agency overhang in the modern LLMs, meaning there is a potential for performance of a model to increase significantly with introduction of more powerful elicitation/scaffolding methods without additional improvements of model itself, due to prompting and scaffolding techniques being in their early ages. For the model evaluations this means that the current evaluations systematically undershoot the real level of capabilities and by extension, the level of risks involved.

We see several important research questions that have to be answered:

Is the core assumption even true? We want to prove that one can elicit the peak performance using narrow highly specialised prompts and scaffoldings and locally beat general state-of-the-art performance
How overhang should be factored in in the overall model evaluation procedure?
Is it possible to estimate the real level of overhang (e.g. developing an evaluation technique measuring the gap between current sota performance and theoretically possible peak performance)
How big of an increase has been introduced with existing scaffolding techniques?

Mech-Interp

(12) Understanding the Reasoning Capabilities of LLMs

Project Lead: Sonakshi Chauhan

Summary

With the release of increasingly powerful models like OpenAI’s GPT-4 and others, there has been growing interest in the reasoning capabilities of large language models. However, key questions remain: How exactly are these models reasoning? Are they merely performing advanced pattern recognition, or are they learning to reason in a way that mirrors human-like logic and problem-solving? Do they develop internal algorithms to facilitate reasoning?

These fundamental questions are critical to understanding the true nature of LLM capabilities. In my research, I have begun exploring this, and I have some preliminary findings on how LLMs approach reasoning tasks. Moving forward, I aim to conduct further experiments to gain deeper insights into how close and reproducible LLM reasoning is compared to human reasoning, potentially grounding our assumptions in concrete evidence.

Future experiments will focus on layer-wise analysis to understand attention patterns, perform circuit discovery, direction analysis, and explore various data science and interpretability techniques on LLM layers to gain insights and formulate better questions.

(13) Mechanistic Interpretability via Learning Differential Equations

Project Lead: Valentin Slepukhin

Summary

Current mechanistic interpretability approaches may be hard, because language is a very complicated system that is not that trivial to interpret. Instead, one may consider a simpler system—a differential equation, which is a symbolic representation transformer can learn from the solution trajectory https://arxiv.org/abs/2310.05573. This problem is expected to be significantly easier to solve, due to its exact mathematical formulation. Even though it seems to be a toy model, it can bring some insights to the language processing—especially if the natural abstraction hypothesis is true https://www.lesswrong.com/posts/QsstSjDqa7tmjQfnq/wait-our-models-of-semantics-should-inform-fluid-mechanics.

(14) Towards Understanding Features

Project Lead: Kola Ayonrinde

Summary

In the last year, there has been much excitement in the Mechanistic Interpretability community about using Sparse Autoencoders (SAEs) to extract monosemantic features. Yet for downstream applications the usage has been much more muted. In a wonderful paper Sparse Feature Circuits, Marks et al. do the only real application of SAEs to solving a useful problem to date (at the time of writing). Yet many of their circuits make significant use of the “error term” from the SAE (i.e. the part of the model’s behaviour that the SAE isn’t well capturing). This isn’t really the fault of Marks et al., it just seems like the underlying features were not effective enough.

We believe that the reason SAEs haven’t been as useful as the excitement suggests is because the SAEs simply aren’t yet good enough at extracting features. Combining ideas from new methods in SAEs with older approaches from the literature, we believe that it’s possible to significantly improve the performance of feature extraction in order to allow SAE-style approaches to be more effective.

We would like to make progress towards truly understanding features: how we ought to extract features, how features relate to each other and perhaps even what “features” are.

(15) Towards Ambitious Mechanistic Interpretability II

Project Lead: Alice Rigg

Summary

Historically, The Big 3 of {distill.pub, transformer-circuits.pub, neel nanda tutorials/problem lists} have dominated the influence, interpretation, and implementation of core mech interp ideas. However, in recent times they haven’t been all that helpful (especially looking at transformer-circuits): All this talk about SAEs yet no obvious direction for where to take things. In this project, we’ll look beyond the horizon and aim to produce maximally impactful research, with respect to the success of mech interp as a self sustaining agenda for AI alignment and safety, and concretely answer the question: where do we go now?

Last year in AISC, we revived the interpretable architectures agenda. We showed that a substantially more interpretable activation function exists: a Gated Linear Unit (GLU) without any Swish attached to it — a bilinear MLP. I truly think this is one of the most important mech interp works to date. With it, we actually have a plausible path to success:

Solve the field for some base case architectural assumptions
Reduce all other cases to this base case

We already have evidence step 2 is tractable. In this project we focus on addressing step 1: answer as many fundamental mech interp questions as possible for bilinear models. Are interpretable architectures sufficient to make ambitious mechanistic interpretability tractable? Maybe.

Agent Foundations

(16) Understanding Trust

Project Lead: Abram Demski

Summary

The Tiling Agents problem (aka reflective consistency) consists of analysing when one agent (the “predecessor”) will choose to deliberately modify another agent (the “successor”). Usually, the predecessor and successor are imagined as the same agent across time, so we are studying self-modification. A set of properties “tiles” if those properties, when present in both predecessor and successor, guarantee that any self-modifications will avoid changing those properties.

You can think of this as the question of when agents will preserve certain desirable properties (such as safety-relevant properties) when given the opportunity to self-modify. Another way to think about it is the slightly broader question: when can one intelligence trust another? The bottleneck for avoiding harmful self-modifications is self-trust; so getting tiling results is mainly a matter of finding conditions for trust.

The search for tiling results has three main motivations:

* AI-AI tiling, for the purpose of finding conditions under which AI systems will want to preserve safety-relevant properties.

* Human-AI tiling, for the purpose of understanding when we can justifiably trust AI systems.

* Tiling as a consistency constraint on decision theories, for the purpose of studying rationality.

These three application areas have a large overlap, and all three seem important.

(17) Understand Intelligence

Project Lead: Johannes C. Mayer

Summary

Save the world by understanding intelligence.

Instead of having SGD “grow” intelligence, design the algorithms of intelligence directly to get a system we can reason about. Align this system to a narrow but pivotal task, e.g. upload a human.

The key to intelligence is finding the algorithms that infer world models that enable efficient prediction, planning, and meaningfully combining existing knowledge.

By understanding the algorithms, we can make the system non-self-modifying (algorithms are constant, only the world model changes), making reasoning about the system easier.
Understanding intelligence at the algorithmic level is a very hard technical problem. However, we are pretty sure it is solvable and, if solved, would likely save the world.

Current focus: How to model a world such that we can extract structure from the transitions between states (‘grab object’=useful high level action), as well as the structure within particular states (‘tree’=useful concept).

(18) Applications of Factored Space Models: Agents, Interventions and Efficient Inference

Project Lead: Matthias G. Mayer

Summary

Factored Space Models (Link of Arxiv will be here, when we have uploaded the paper, probably before November (Overview) were first introduced as Finite Factored Sets by Scott Garrabrant and are an attempt to make causal discovery behave well with deterministic relationships. The main contribution is the definition of structural independence that generalizes d-separation in Causal Graphs and works for all random variables you can define on any product space, e.g. a structural equations model. In this framework we can naturally extend the ancestor relationship to arbitrary random variables. This is called structural time.

We want to use and extend the framework for the following applications taken, in part, from Scott’s blog post.

Embedded Agency
Causality / Interventions
What is (structural) time?
Efficient Temporal Discovery

Here are slides from a talk (long form) explaining Factored Space Models with a heavy focus on structural independence starting from bayesian networks.

Prevent Jailbreaks/Misuse

(19) Preventing Adversarial Reward Optimization

Project Lead: Domenic Rosati

Summary

TL;DR: Can we develop methods that prevent online learning agents from learning from rewards that reward harmful behaviour without any agent supervision at all!?

This project uses a novel AI Safety Paradigm, developed in a previous AI Safety Camp, Representation Noising, that prevents adversarial reward optimization, i.e. high reward which would result in learning misaligned behaviour, by the use of “implicit” constraints that prevent the exploration of adversarial reward and prevent learning trajectories that result in optimising those rewards. These “implicit” constraints are baked into deep neural networks such that training towards harmful ends (or equivalently exploring harmful reward or optimising harmful reward) is made unlikely.

The goal of this project is to extend our previous work applying Representation Noising to a Reinforcement Learning (RL) setting: Defending against Reverse Preference Attacks is Difficult. In that work we studied the single-step RL (Contextual Bandits) setting of Reinforcement Learning From Human Feedback and Preference Learning.

In this project, we will apply the same techniques to the full RL setting of multi step reward in an adversarial reward environment that is in the machiavelli benchmark. The significance of this project is that if we can develop models that can not optimise adversarial rewards after some intervention on the model weights, then we will have made progress on safer online learning agents.

(20) Evaluating LLM Safety in a Multilingual World

Project Lead: Lukasz Bartoszcze

Summary

The capability of Large-Language Models to reason is constricted by the units they use to encode the world- tokens. Translating phrases into different languages (existing, like Russian or German; or imaginary like some random code) leads to large changes in LLM performance, both in terms of capabilities, but also safety. Turns out, applying representation engineering concepts also leads to divergent outcomes, suggesting LLMs create separate versions of the world in each language. When considering multilinguality, concepts like alignment, safety or robustness become even less defined and so I plan to amend existing theory with new methodology tailored for this case. I hypothesise this variation between languages can be exploited to create jailbreak-proof LLMs but if that is not feasible, it is still important to ensure equal opportunity globally, inform policy and improve the current methods of estimating the real capabilities and safety of a benchmark.

(21) Enhancing Multi-Turn Human Jailbreaks Dataset for Improved LLM Defenses

Project Lead: Diogo Cruz

Summary

This project aims to extend and enhance the Multi-Turn Human Jailbreaks (MHJ) dataset introduced by Li et al.. We will focus on developing lightweight automated multi-turn attacks, evaluating transfer learning of jailbreaks, and conducting qualitative analysis of human jailbreak attempts. By expanding on the original MHJ work, we seek to provide more comprehensive insights into LLM vulnerabilities and contribute to the development of stronger defenses. Our research will help bridge the gap between automated and human-generated attacks, potentially leading to more robust and realistic evaluation methods for LLM safety.

Train Aligned/Helper AIs

(22) AI Safety Scientist

Project Lead: Lovkush Agarwal

Summary

In August 2024, Sakana published their research on the ‘AI Scientist’. https://sakana.ai/ai-scientist/. They fully automate the ML research process—from generating ideas to writing a formal paper—by combining various LLM-based tools in an appropriate pipeline. The headline result is it generates weak graduate level research for about $15 per paper.

The aim of this project is to adapt and refine this tool for AI Safety research.

(23) Wise AI Advisers via Imitation Learning

Project Lead: Chris Leong

Summary

I know it’s a cliche, but AI capabilities are increasing exponentially, but our access to wisdom (for almost any definition of wisdom) isn’t increasing at anything like the same pace.

I think that it’s pretty obvious that continuing in the same direction is unlikely to end well.

There’s something of a learned helplessness around training wise AI’s. I want to take a sledgehammer to this.

As naive as it sounds, I honestly think we can do quite well by just picking some people who we subjectively feel to be wise and using imitation learning on them to train AI advisors.

Maybe you feel that “imitation learning” would be kind of weak, but that’s just the baseline proposal. Two obvious ideas for amplifying these agents are techniques like debate or trees of agents, and that’s just for starters!

More ambitiously, we may be able to set up a positive feedback loop. If our advisers are able to help people become wiser, then this might allow us to set up a positive feedback loop where the people we are training on become wiser and our advisers become wiser in their use of our technology.

I’m pretty open to recruiting people who are skilled in technical work, conceptual work or technical communication. This differs from other projects in that rather than having specific project objectives, you have the freedom to pursue any project within this general topic area (wise AI advisors via imitation learning). Training wise AI via other techniques is outside the scope of this project unless it is to provide a baseline to compare imitation agents against. The benefit is that this offers you more freedom, the disadvantage is that there’s more of a requirement to be independent in order for this to go well.

(24) iVAIS: Ideally Virtuous AI System with Virtue as its Deep Character

Project Lead: Masaharu Mizumoto

Summary

The ultimate goal of this interdisciplinary research program is to contribute to AI safety research by actually constructing an ideally virtuous AI system (iVAIS). Such an AI system should be virtuous as its deep character, showing resilience (not complete immunity, which is vulnerable) to prompt injections even if it can play many different characters by pretending, including a villain. The main content of the current proposal consists of two components: 1. Self-alignment and 2. The Ethics game, which are both based on the idea of agent-based alignment rather than content-based alignment, focusing on what one is doing, which requires metacognitive capacity.

(25) Exploring Rudimentary Value Steering Techniques

Project Lead: Nell Watson

Summary

This research project seeks to assess the effectiveness of rudimentary alignment methods for artificial intelligence. Our intention is to explore basic, initial methods of guiding AI behavior using supplementary contextual information:

AI Behavior Alignment: Develop and test mechanisms to steer AI behavior via model context, particularly for emerging AI systems with agentic qualities by:
1. Utilizing a newly developed Agentic AI Safety Rubric to establish comprehensive general ethical alignment.
2. Leveraging the outputs of tools that allow users to easily specify their preferences and boundaries as a source of context for customized steering of models according to personal contexts and preferences. These behavior alignment methods will provide the foundation for our real-time ethics monitoring system.
Real-time Ethics Monitoring: Create a proof-of-concept system where one AI model oversees the ethical conduct of another, especially in an agentic context. Develop a dynamic ethical assurance system capable of:
1. Monitoring the planning processes of less sophisticated agentic AI models.
2. Swiftly identifying potential ethical violations before they are executed.
Effectiveness Assessment: Evaluate the robustness and limitations of these alignment techniques to test the mechanisms across various scenarios and contexts. Determine:
1. The range of situations where these techniques are most effective.
2. The conditions under which they begin to fail or become unreliable.

Expected Outcomes:

Insights into the feasibility of using contextual information for AI alignment.
Understanding of the strengths and limitations of these alignment methods.
Identification of areas requiring further research and development.

(26) Autostructures – for Research and Policy

Project Lead: Sahil and Murray

Summary

This is a project for creating culture and technology around AI interfaces for conceptual sensemaking.

Specifically, creating for the near future where our infrastructure is embedded with realistic levels of intelligence (ie. only mildly creative but widely adopted) yet full of novel, wild design paradigms anyway.

The focus is on interfaces especially for new sensemaking and research methodologies that can feed into a rich and wholesome future.

Huh?
It’s a project for AI interfaces that don’t suck, for the purposes of (conceptual AI safety) research that doesn’t suck.

Wait, so you think AI can only be mildly intelligent?
Nope.

But you only care about the short term, of “mild intelligence”?
Nope, the opposite. We expect AI to be very, very, very transformative. And therefore, we expect intervening periods to be very, very transformative. Additionally, we expect even “very transformative” intervening periods to be crucial, and quite weird themselves.

In preparing for this upcoming intervening period, we want to work on the newly enabled design ontologies of sensemaking that can keep pace with a world replete with AIs and their prolific outputs. Using the near-term crazy future to meet the even crazier far-off future is the only way to go.

(As you’ll see below, we will specifically move towards adaptive sensemaking meeting even more adaptive phenomena.)

So you don’t care about risks?
Nope, the opposite. This is all about research methodological opportunities meeting risks of infrastructural insensitivity.

Watch a 10 minute video here for a little more background:
Scaling What Doesn’t Scale: Teleattention Tech.

Other

(27) Reinforcement Learning from Recursive Information Market Feedback

Project Lead: Abhimanyu Pallavi Sudhir

Summary

RLHF is no good on tasks which humans are unable to easily “rate” output. I propose the Recursive Information Market, which can be understood as an approach to rate based on a human rater’s Extrapolated Volition, or a generalized form of AI safety via debate.

(28) Explainability through Causality and Elegance

Project Lead: Jason Bono

Summary

The purpose of this project is to make progress towards human-interpretable AI through advancements in causal modeling. The project is inspired by the way science emerged in human culture, and seeks to replicate essential aspects of this emergence in a simple simulated environment.

The setup will consist of a simulated world and one or more agents equipped with one sensor and one actuator each, along with a bandwidth-constrained communications channel. A register will record past communications, and store the “usefulness” of trial frameworks that the agents develop for prediction.

The agents first will create standard deep predictive models for novel actuator actions (interventions) and subsequent system evolution. These agents will then create a reduced representation of their deep models optimizing for “elegance” which refers to high predictive accuracy, high predictive breadth, low model size, and high computational efficiency. This can be thought of as the autonomous creation of an interpretable “elegant causal decision layer” (ECDL) that can be called upon by the agents to reduce the computational intensity of accurate prediction of the effects of novel interventions.

Success would comprise the autonomous creation and successful utilization of a human interpretable ECDL. This success would provide a proof of concept for similar techniques in more complex and non-simulated environments (e.g. a physical setup and/or the internet).

(29) Leveraging Neuroscience for AI Safety

Project Lead: Claire Short

Summary

This project integrates neuroscience and AI, leveraging human brain data to align AI behaviors with human values for potentially greater control and safety. In this initial project, we will take inspiration from Activation Vector Steering with BCI, to map activation vectors to human brain datasets. In previous work, a method called Activation Addition was tested and found to more reliably control the behavior of large language models during use, altering the model’s internal processes based on specific inputs, which allows for adjustments to topics or sentiments with minimal computing resources. By attempting to recreate elements of this work with the integration of brain data inputs, we aim to enhance the alignment of AI outputs with user intentions, opening new possibilities for personalization and accessibility in various applications, from education to therapy.

(30) Scalable Soft Optimization

Project Lead: Benjamin Kolb

Summary

This project is mainly aimed at a deep reinforcement learning (DRL) implementation. The purpose is to assess selected soft optimization methods. Such methods limit the amount of “optimization” in DRL algorithms in order to alleviate the consequences of goal misspecification. The primarily proposed soft optimization method is based on the widely referenced idea of quantilization. Broadly speaking, quantilization means sampling options from a reference distribution’s top quantile instead of selecting the top option.

(31) AI Rights for Human Safety

Project Lead: Pooja Khatri

Summary

This project seeks to institute a legal governance framework to advance AI rights for human safety.

Experts predict that AI systems have a non-negligible chance of developing consciousness, agency or other states of potential moral patienthood within the next decade. Such powerful, morally significant AIs could contribute immense value to the world and failing to respect their basic rights may not only lead to suffering risks but it might also incentivise AI systems to pursue goals that are in conflict with human interests, giving rise to misalignment scenarios and existential risks.

Advancing AI rights for human safety remains a neglected priority. While several studies and frameworks exploring potential AI rights already exist, the existing work is either a) largely theoretical and not practical/tractable or feasible from a policy perspective and/or b) fails to take into consideration the contemporary nature of AI development.

As such, given the likelihood that AI systems will likely advance faster than legal regimes, it is arguable that powerful early intervention via legal governance mechanisms offers a promising first step towards mitigating suffering and existential risks and positively influencing our long-term future with AI.

(32) Universal Values and Proactive AI Safety

Project Lead: Roland Pihlakas

Summary

I will be running one of three possible projects, based on which one receives the most interest.

(32a) Creating new AI safety benchmark environments on themes of universal human values

We will be planning and optionally building new multi-objective multi-agent AI safety benchmark environments on themes of universal human values.

(32b) Balancing and Risk Aversion versus Strategic Selectiveness and Prospect Theory

We will be analysing situations and building an umbrella framework about when either of these incompatible frameworks would be more appropriate in describing how we want safe agents to handle choices relating to risks and losses in a particular situation.

(32c) Act locally, observe far—proactively seek out side-effects

We will be building agents that are able to solve an already implemented multi-objective multi-agent AI safety benchmark that illustrates the need for the agents to proactively seek out side-effects outside of the range of their normal operation and interest, in order to be able to properly mitigate or avoid these side-effects.

Apply now

This concludes the full list of all the projects for the 10th version of AISC. You can also find the application form on our website. The deadline for team member applications is November 17th (Sunday).

AI Safety Camp 10

Stop/​Pause AI

(1) Growing PauseAI

Summary

(2) Grassroots Communication and Lobbying Strategy for PauseAI

Summary

(3) AI Policy Course: AI’s capacity of exploiting existing legal structures and rights

Summary

(4) Building the Pause Button: A Proposal for AI Compute Governance

Summary

(5) Stop AI Video Sharing Campaign

Summary

Evaluate risks from AI

(6) Write Blogpost on Simulator Theory

Summary

(7) Formalize the Hashiness Model of AGI Uncontainability

Summary

(8) LLMs: Can They Science?

Summary

(9) Measuring Precursors to Situationally Aware Reward Hacking

Summary

(10) Develop New Sycophancy Benchmarks

Summary

(11) Agency Overhang as a Proxy for Sharp Left Turn

Summary

Mech-Interp

(12) Understanding the Reasoning Capabilities of LLMs

Summary

(13) Mechanistic Interpretability via Learning Differential Equations

Summary

(14) Towards Understanding Features

Summary

(15) Towards Ambitious Mechanistic Interpretability II

Summary

Agent Foundations

(16) Understanding Trust

Summary

(17) Understand Intelligence

Summary

(18) Applications of Factored Space Models: Agents, Interventions and Efficient Inference

Summary

Prevent Jailbreaks/​Misuse

(19) Preventing Adversarial Reward Optimization

Summary

(20) Evaluating LLM Safety in a Multilingual World

Summary

(21) Enhancing Multi-Turn Human Jailbreaks Dataset for Improved LLM Defenses

Summary

Train Aligned/​Helper AIs

(22) AI Safety Scientist

Summary

(23) Wise AI Advisers via Imitation Learning

Summary

(24) iVAIS: Ideally Virtuous AI System with Virtue as its Deep Character

Summary

(25) Exploring Rudimentary Value Steering Techniques

Summary

(26) Autostructures – for Research and Policy

Summary

Other

(27) Reinforcement Learning from Recursive Information Market Feedback

Summary

(28) Explainability through Causality and Elegance

Summary

(29) Leveraging Neuroscience for AI Safety

Summary

(30) Scalable Soft Optimization

Summary

(31) AI Rights for Human Safety

Summary

(32) Universal Values and Proactive AI Safety

Summary

Stop/Pause AI

Prevent Jailbreaks/Misuse

Train Aligned/Helper AIs