What is “capabilities”? What is “safety”? People often talk about the alignment tax: the magnitude of capabilities/time/cost a developer loses by implementing an aligned/safe system. But why should we consider an unaligned/unsafe system “capable” at all? If someone developed a commercial airplane that went faster than anything else on the market, but it exploded on 1% of flights, no one would call that a capable airplane.
This idea overlaps with safety culture and safety engineering and is not new. But alongside recent criticism of the terms “safety” and “alignment”, I’m starting to think that the term “capabilities” is unhelpful, capturing different things for different people.
I don’t think the airplane analogy makes sense because airplanes are not intelligent enough to be characterized as having their own preferences or goals. If there were a new dog breed that was stronger/faster than all previous dog breeds, but also more likely to attack their owners, it would be perfectly straightforward to describe the dog as “more capable” (but also more dangerous).
I think people would say that the dog was stronger and faster than all previous dog breeds, not that it was “more capable”. It’s in fact significantly less capable at not attacking its owner, which is an important dog capability. I just think the language of “capability” is somewhat idiosyncratic to AI research and industry, and I’m arguing that it’s not particularly useful or clarifying language.
More to my point (though probably orthogonal to your point), I don’t think many people would buy this dog, because most people care more about not getting attacked than they do about speed and strength.
As a side note, I don’t see why preferences and goals change any of this. I’m constantly hearing AI (safety) researchers talk about “capabilities research” on today’s AI systems, but I don’t think most of them think those systems have their own preferences and goals. At least not in the sense that a dog has preferences or goals. I just think it’s a word that AI [safety?] researchers use, and I think it’s unclear and unhelpful language.
I think game playing AI is pretty well characterized as having the goal of winning the game, and being more or less capable of achieving that goal at different degrees of training. Maybe I am just too used to this language but it seems very intuitive to me. Do you have any examples of people being confused by it?
What is “capabilities”? What is “safety”? People often talk about the alignment tax: the magnitude of capabilities/time/cost a developer loses by implementing an aligned/safe system. But why should we consider an unaligned/unsafe system “capable” at all? If someone developed a commercial airplane that went faster than anything else on the market, but it exploded on 1% of flights, no one would call that a capable airplane.
This idea overlaps with safety culture and safety engineering and is not new. But alongside recent criticism of the terms “safety” and “alignment”, I’m starting to think that the term “capabilities” is unhelpful, capturing different things for different people.
I don’t think the airplane analogy makes sense because airplanes are not intelligent enough to be characterized as having their own preferences or goals. If there were a new dog breed that was stronger/faster than all previous dog breeds, but also more likely to attack their owners, it would be perfectly straightforward to describe the dog as “more capable” (but also more dangerous).
I think people would say that the dog was stronger and faster than all previous dog breeds, not that it was “more capable”. It’s in fact significantly less capable at not attacking its owner, which is an important dog capability. I just think the language of “capability” is somewhat idiosyncratic to AI research and industry, and I’m arguing that it’s not particularly useful or clarifying language.
More to my point (though probably orthogonal to your point), I don’t think many people would buy this dog, because most people care more about not getting attacked than they do about speed and strength.
As a side note, I don’t see why preferences and goals change any of this. I’m constantly hearing AI (safety) researchers talk about “capabilities research” on today’s AI systems, but I don’t think most of them think those systems have their own preferences and goals. At least not in the sense that a dog has preferences or goals. I just think it’s a word that AI [safety?] researchers use, and I think it’s unclear and unhelpful language.
#taboocapabilities
I think game playing AI is pretty well characterized as having the goal of winning the game, and being more or less capable of achieving that goal at different degrees of training. Maybe I am just too used to this language but it seems very intuitive to me. Do you have any examples of people being confused by it?