Guardrails vs Goal-directedness in AI Alignment

Many EAs see AI Alignment research efforts as an important route to mitigating x-risk from AGI. However, others are concerned that alignment research overall increases x-risk by accelerating AGI timelines. I think Michael Nielsen explains this well:

Practical alignment work makes today’s AI systems far more attractive to customers, far more usable as a platform for building other systems, far more profitable as a target for investors, and far more palatable to governments. The net result is that practical alignment work is accelerationist.

I think we can distinguish between two broad subtypes of alignment efforts: 1) “implementing guardrails” and 2) “improving goal-directedness”.

I would categorise approaches such as running AI models on custom chips, emergency shutdown mechanisms, red teaming, risk assessments, dangerous capability evaluations and safety incident reporting as “implementing guardrails”. This can be thought of as getting AI systems to not do the worst thing possible.

I would categorise approaches such as RLHF and reward shaping as “improving goal-directedness”. This could also be thought of as getting AI systems to do the best thing possible.

I think “implementing guardrails” has much weaker acceleration effects than “improving goal-directedness”. An AI system which can be shutdown and does not show dangerous capabilities, is still not very useful if it can’t be directed towards the specific goal of the user.

So I think people who are worried that AI Alignment efforts might be net-negative because of acceleration effects should consider prioritising “implementing guardrails” approaches to AI Alignment.