A caveat is that some essential subareas of safety may be neglected. This is not a problem when subareas substitute each other: e.g. debate substitutes for amplification so it’s okay if one of them is neglected. But there’s a problem when subareas complement each other: e.g. alignment complements robustness so we probably need to solve both. See also When causes multiply.
It’s ok when a subarea is neglected as long as there’s a substitute for it. But so far it seems that some areas are necessary components of AI safety (perhaps both inner and outer alignment are).
This makes sense. I don’t mean to imply that we don’t need direct work.
AI strategy people have thought a lot about the capabilities : safety ratio, but it’d be interesting to think about the ratio of complementary parts of safety you mention. Ben Garfinkel notes that e.g. reward engineering work (by alignment researchers) is dual-use; it’s not hard to imagine scenarios where lots of progress in reward engineering without corresponding progress in inner alignment could hurt us.
Important question, and nicely researched!
A caveat is that some essential subareas of safety may be neglected. This is not a problem when subareas substitute each other: e.g. debate substitutes for amplification so it’s okay if one of them is neglected. But there’s a problem when subareas complement each other: e.g. alignment complements robustness so we probably need to solve both. See also When causes multiply.
It’s ok when a subarea is neglected as long as there’s a substitute for it. But so far it seems that some areas are necessary components of AI safety (perhaps both inner and outer alignment are).
This makes sense. I don’t mean to imply that we don’t need direct work.
AI strategy people have thought a lot about the capabilities : safety ratio, but it’d be interesting to think about the ratio of complementary parts of safety you mention. Ben Garfinkel notes that e.g. reward engineering work (by alignment researchers) is dual-use; it’s not hard to imagine scenarios where lots of progress in reward engineering without corresponding progress in inner alignment could hurt us.