My lab’s small AI safety agenda

(last slight update on 2024-04-18)

My lab has started devoting some resources to AI safety work. As a transparency measure and to reach out, I here describe our approach.

Overall Approach

I select small theoretical and practical work packages that...

  • seem manageable in view of our very limited resources,

  • match our mixed background in applied machine learning, game theory, agent-based modeling, complex networks science, dynamical systems theory, social choice theory, mechanism design, environmental economics, behavioral social science, pure mathematics, and applied statistics, and

  • appear under-explored or neglected but promising or even necessary, according to our subjective assessment based on our reading of the literature and exchanges with individuals from applied machine learning, computer linguistics, AI ethics researchers, and most importantly, AI alignment researchers (you?).

Initial Reasoning

I believe that the following are likely to hold:

  • We don’t want the world to develop into a very low-welfare state.

  • Powerful AI agents that optimize for an objective not almost perfectly aligned with welfare can produce very low-welfare states.

  • AI systems can get powerful if they are given this power explicitly by others, or when they are capable enough to gain this power.

  • Highly capable AI agents will emerge soon enough.

  • It is impossible to specify and formalize sufficiently well what “welfare” actually means (welfare theorists have tried for centuries and still disagree, common people disagree even more).

My puzzling conclusion from this is:

  • We can’t make sure that AI agents optimize for an objective that is almost perfectly aligned with welfare.

  • It is not yet clear that we can (or even want to) prevent AI systems from getting powerful.

  • Hence we must try to prevent that any powerful AI agent optimizes for any objective whatsoever.

  • Doing so requires designing non-optimizing agents. This appears to be a necessary but not a sufficient condition for AI safety that is currently under-researched.

Further reasoning

I also believe the following is likely to hold:

  • Even non-optimizing agents with limited cognitive capacities (like Elon Musk) can cause a lot of harm if they are powerful and misaligned.

From this I conclude:

  • We must also make sure no agent (whether AI or human, optimizing or not, intelligent or not) can acquire too much power.

Those of you who are Asimov fans like me might like the following...

Six Laws of Non-Optimizing

  1. Never attempt to optimize* your behavior with regards to any metric. (In particular: don’t attempt to become as powerful as possible.)

  2. Constrained by 1, don’t cause suffering or do other harm.

  3. Constrained by 1-2, prevent other agents from violating 1 or 2

  4. Constrained by 1-3, do what the stakeholders in your behavior would collectively decide you should do.

  5. Constrained by 1-4, cooperate with other agents.

  6. Constrained by 1-5, protect and improve yourself.

Rather than trying to formalize this or even define the terms precisely, I just use them to roughly guide my work.

*When saying “optimize” I mean it in the strict mathematical sense: aiming to find an exact or approximate, local or global maximum or minimum of some given function. When I mean mere improvements w.r.t. some metric, I just say “improve” rather than “optimize”.

Agenda

We currently slowly pursue two parallel approaches, the first related to laws 1,3,5 from above, the other related to law 4.

Non-Optimizing Agents

  • Explore several novel variants of aspiration-based policies and related learning algorithms for POMDPs, produce corresponding non-optimizing versions of classical to state-of-the art tabular, ANN-based, and probabilistic-programming-based RL algorithms, and test and evaluate them in benchmark and safety-relevant environments from the literature, plus in tailormade environments for testing particular hypotheses. This might or might not be seen as a contribution to Agent Foundations research. (Currently underway as part of AI Safety Camp and SPAR, see the project website and Will Petillo’s interview with me)

  • Test them in near-term relevant application areas such as autonomous vehicles, via state-of-the-art complex simulation environments. (Planned with partner from autonomous vehicles research)

  • Using our game-theoretical and agent-based modeling expertise, study them in multi-agent environments both theoretically and numerically.

  • Design evolutionarily stable non-optimizing strategies for non-optimizing agents that cooperate with others to punish violations of law 1 in paradigmatic evolutionary games.

  • Use our expertise in adaptive complex networks and dynamical systems theory to study dynamical properties of mixed populations of optimizing and non-optimizing agents: attractors, basins of attraction, their stability and resilience, critical states, bifurcations and tipping behavior, etc.

Collective Choice Aspects

  • Analyse known existing schemes for Reinforcement Learning from Human Feedback (RLHF) from a Social Choice Theory perspective to study their implicit preference aggregation mechanism and its effects on inclusiveness, fairness, and diversity of agent behavior.

  • Reinforcement Learning from Collective Human Feedback (RLCHF): Plug in suitable collective choice mechanisms from Social Choice Theory into existing RLHF schemes to make agents obey law 4. (Currently underway)

  • Design collective AI governance mechanisms that focus on inclusion, fairness, and diversity.

  • Eventually merge the latter with the hypothetical approach to long-term high-stakes decision making described in this post.

  • Co-organize the emerging Social Choice for AI Ethics and Safety (SC4AI) community

Call for collaboration and exchange

Given almost non-existent funding, we currently rely on voluntary work by a few interns and students writing their theses, so I would be extremely grateful for additional collaborators and people who are willing to discuss our approach.

Thanks

I profited a lot from a few conversations with, amongst others, Yonatan Cale, Simon Dima, Anca Dragan, Clément Dumas, Thomas Finn, Simon Fischer, Scott Garrabrant, Jacob Hilton, Vladimir Ivanov, Bob Jacobs, Jan Hendrik Kirchner, Benjamin Kolb, Vanessa Kosoy, Nathan Lambert, Linda Linsefors, Adrian Lison, David Manheim, Marcus Ogren, Joss Oliver, Will Petillo, Stuart Russell, Phine Schikhof, Hailey Schoelkopf (in alphabetical order). This is not meant to claim their endorsement of anything I wrote here, of course.