Projects I would like to see (possibly at AI Safety Camp)

I recently discussed with my AISC co-organiser Remmelt, some possible project ideas I would be excited about seeing at the upcoming AISC, and I thought these would be valuable to share more widely.

Thanks to Remmelt for helfull suggestions and comments.

What is AI Safety Camp?

AISC in its current form is primarily a structure to help people find collaborators. As a research lead we give your project visibility, and help you recruit a team. As a regular participant, we match you up with a project you can help with.

I want to see more good projects happening. I know there is a lot of unused talent wanting to help with AI safety. If you want to run one of these projects, it doesn’t matter to me if you do it as part of AISC or independently, or as part of some other program. The purpose of this post is to highlight these projects as valuable things to do, and to let you know AISC can support you, if you think what we offer is helpful.

Project ideas

These are not my after-long-consideration top picks of most important things to do, just some things I think would be net positive if someone would do. I typically don’t spend much cognitive effort on absolute rankings anyway, since I think personal fit is more important for ranking your personal options.

I don’t claim originality for anything here. It’s possible there is work on one or several of these topics, which I’m not aware of. Please share links in comments, if you know of such work.

Is substrate-needs convergence inevitable for any autonomous system, or is it preventable with sufficient error correction techniques?

This can be done as an adversarial collaboration (see below) but doesn’t have to be.

The risk from substrate-needs convergence can be summarised as such:

  1. If AI is complex enough to self-sufficiently maintain its components, natural selection will sneak in.

  2. This would select for components that cause environmental conditions needed for artificial self-replication.

  3. An AGI will necessarily be complex enough.

Therefore natural selection will push the system towards self replication. Therefore it is not possible for an AGI to be stably aligned with any other goal. Note that this line of reasoning does not necessitate that the AI will come to represent self replication as its goal (although that is a possible outcome), only that natural selection will push it towards this behaviour.

I’m simplifying and skipping over a lot of steps! I don’t think there currently is a great writeup of the full argument, but if you’re interested you can read more here or watch this talk by Remmelt or reach out to me or Remmelt. Remmelt has a deeper understanding of the arguments for substrate-needs convergence than me, but my communication style might be better suited for some people.

I think substrate-needs convergence is pointing at a real risk. I don’t know yet if the argument (which I summarised above) proves that building an AGI that stays aligned is impossible, or if it points to one more challenge to be overcome. Figuring out which of these is the case seems very important.

I’ve talked to a few people about this problem, and identified what I think is the main crux: How well you can execute error correction mechanisms?

When Forest Laundry and Anders Sandberg discussed substrate-needs convergence, they ended up with a similar crux, but unfortunately did not have time to address it. Here’s a recording of their discussion, however Landry’s mic breaks about 20 minutes in, which makes it hard to hear him from that point onward.

Any alignment-relevant adversarial collaboration

What is adversarial collaborations?
Se this SSC post for an explanation: Call or Adversarial Collaborations | Slate Star Codex

Possible topic:

  • For and against some alignment plan. Maybe yours?

  • Is alignment of super human systems possible or not?

I expect this type of project to be most interesting if both sides already have strong reasons for believing the side they are advocating for. My intuition says that different frames will favour different conclusions, and that you will miss one or more important frame if either or both of you start from a weak conviction. The most interesting conversation will come from taking a solid argument for A and another solid argument for not-A, and finding a way for these perspectives to meet.

I think AISC can help finding good matches. The way I suggest doing this is that one person (the AISC Research Lead) lays out their position in their project proposal. Then we post this for everyone to see. Then when we open up for team member applications, anyone who disagrees can submit their application to join this project. Possibly you can have more than one person defending and attacking the position in question, and you can also add a moderator to the team if that seems useful.

However, if the AISC structure seems a bit overkill or just not the right fit for what you want to do in particular, there are other options too. For example you’re invited to post ideas in the comments of this post.

Haven’t there already been several AI Safety debates?
Yes, and those have been interesting. But also, doing an adversarial collaboration as part of AISC is a longer time commitment than most of these debates, which will allow you to go deeper. I’m sure there have also been long conversations in the past, which continue back and forth over months, and I’m sure many of those have been useful too. Let’s have more!

What capability thresholds along what dimensions should we never cross?

This is a project for people who think that alignment is not possible, or at least not tractable in the next couple of decades. I’d be extra interested to see someone work on this from the perspective of risk due to substrate-needs convergence, or at least taking this risk into account, since this is an underexplored risk.

If alignment is not possible and we have to settle for less than god-like AI, then where do we draw the boundary for safe AI capabilities? What capability thresholds along what dimensions should we never cross?

Karl suggested something similar here: Where are the red lines for AI?

A taxonomy of: What end-goal are “we” aiming for?

In what I think of as “classical alignment research” the end goal is a single aligned superintelligent AI, which will solve all our future problems, including defending us against any future harmful AIs. But the field of AI safety has broadened a lot since then. For example there are much more efforts into coordination now. But is the purpose of regulation and other coordination, just to slow down AI so we have time to solve alignment, so we can build our benevolent god later on? Or are we aiming for a world where humans stay in control? I expect different people and different projects to have different end-goals in mind. However, this isn’t talked about much, so I don’t know.

It is likely that some of the disagreement around alignment is based on different agendas aiming for different things. I think it would be good for the AI safety community to have an open discussion about this. However the first step should not be to argue who is right or wrong, but just to map out what end-goal different people and groups have in mind.

In fact, I don’t think consensus on what the end-goal should be is necessarily something we want at this point. We don’t know yet what is possible. It’s probably good for humanity to keep our options open, which means different people preparing the path for different options. I like the fact that different agendas are aiming at different things. But I think the discourse and understanding could be improved by more common knowledge about who is aiming for what.

Crossposted from LessWrong (22 points, 12 comments)
No comments.