Guardrails vs Goal-directedness in AI Alignment
Many EAs see AI Alignment research efforts as an important route to mitigating x-risk from AGI. However, others are concerned that alignment research overall increases x-risk by accelerating AGI timelines. I think Michael Nielsen explains this well:
“Practical alignment work makes today’s AI systems far more attractive to customers, far more usable as a platform for building other systems, far more profitable as a target for investors, and far more palatable to governments. The net result is that practical alignment work is accelerationist.”
I think we can distinguish between two broad subtypes of alignment efforts: 1) “implementing guardrails” and 2) “improving goal-directedness”.
I would categorise approaches such as running AI models on custom chips, emergency shutdown mechanisms, red teaming, risk assessments, dangerous capability evaluations and safety incident reporting as “implementing guardrails”. This can be thought of as getting AI systems to not do the worst thing possible.
I would categorise approaches such as RLHF and reward shaping as “improving goal-directedness”. This could also be thought of as getting AI systems to do the best thing possible.
I think “implementing guardrails” has much weaker acceleration effects than “improving goal-directedness”. An AI system which can be shutdown and does not show dangerous capabilities, is still not very useful if it can’t be directed towards the specific goal of the user.
So I think people who are worried that AI Alignment efforts might be net-negative because of acceleration effects should consider prioritising “implementing guardrails” approaches to AI Alignment.
Both approaches are important components of a comprehensive AI safety strategy. With that said, I think that improving goal-directedness (as you’ve defined it here) is likely to yield more fruitful long-term results for AI safety because:
A sufficiently advanced AGI (what is often labeled ASI, above human level) could outsmart any guardrails implemented by humans given enough time and compute power
Guardrails seem (as you mentioned) to be specifically an approach dedicated stopping an unaligned AI from causing damage. It does not actually get us closer to an aligned AI. If our goal is alignment, why should the primary focus be on an activity that doesn’t get us any closer to aligning an AI?
Thanks for your comment!
I think a sufficiently intelligent ASI is equally likely to outsmart human goal-directedness efforts as it is to outsmart guardrails.
I think number 2 is a good point.
There are many people who actively want to create an aligned ASI as soon as possible to reap its benefits, for whom my suggestion is not useful.
But there are others who primarily want to prevent the creation of a misaligned ASI, and are willing to forgo the creation of an ASI if necessary.
There are also others who want to create an aligned ASI, but are willing to considerably delay this to improve the chances that the ASI is aligned.
I think my suggestion is mainly useful for these second and third groups.