An important source of capabilities / safety overlap, via Ben Garfinkel:
Let’s say you’re trying to develop a robotic system that can clean a house as well as a human house-cleaner can… Basically, you’ll find that if you try to do this today, it’s really hard to do that. A lot of traditional techniques that people use to train these sorts of systems involve reinforcement learning with essentially a hand-specified reward function…
One issue you’ll find is that the robot is probably doing totally horrible things because you care about a lot of other stuff besides just minimizing dust. If you just do this, the robot won’t care about, let’s say throwing out valuable objects that happened to be dusty. It won’t care about, let’s say, ripping apart a couch cushion to find dust on the inside… You’ll probably find any simple line of code you write isn’t going to capture all the nuances. Probably the system will end up doing stuff that you’re not happy with.
This is essentially an alignment problem. This is a problem of giving the system the right goals. You don’t really have a way using the standard techniques of making the system even really act like it’s trying to do the thing that you want it to be doing. There are some techniques that are being worked on actually by people in the AI safety and the AI alignment community to try and basically figure out a way of getting the system to do what you want it to be doing without needing to hand-specify this reward function...
These are all things that are being developed by basically the AI safety community. I think the interesting thing about them is that it seems like until we actually develop these techniques, probably we’re not in a position to develop anything that even really looks like it’s trying to clean a house, or anything that anyone would ever really want to deploy in the real world. It seems like there’s this interesting sense in which we have the storage system we’d like to create, but until we can work out the sorts of techniques that people in the alignment community are working on, we can’t give it anything even approaching the right goals. And if we can’t give anything approaching the right goals, we probably aren’t going to go out and, let’s say, deploy systems in the world that just mess up people’s houses in order to minimize dust.
I think this is interesting, in the sense in which the processes to give things the right goals bottleneck the process of creating systems that we would regard as highly capable and that we want to put out there.
He sees this as positive: it implies massive economic incentives to do some alignment, and a block on capabilities until it’s done. But it could be a liability as well, if the alignment of weak systems is correspondingly weak, and if mid-term safety work fed into a capabilities feedback loop with greater amplification.
An important source of capabilities / safety overlap, via Ben Garfinkel:
He sees this as positive: it implies massive economic incentives to do some alignment, and a block on capabilities until it’s done. But it could be a liability as well, if the alignment of weak systems is correspondingly weak, and if mid-term safety work fed into a capabilities feedback loop with greater amplification.