This is a really great write-up, thanks for doing this so conscientiously and thoroughly. It’s good to hear that Surge is mostly meeting researchers’ needs.
Re whether higher-quality human data is just patching current alignment problems—the way I think about it is more like: there’s a minimum level of quality you need to set up various enhanced human feedback schemes. You need people to actually read and follow the instructions, and if they don’t do this reliably you really won’t be able to set up something like amplification or other schemes that need your humans to interact with models in non-trivial ways. It seems good to get human data quality to the point where it’s easy for alignment researchers to implement different schemes that involve complex interactions (like the humans using an adversarial example finder tool or looking at the output of an interpretability tool). This is different from the case where we e.g. have an alignment problem because MTurkers mark common misconceptions as truthful, whereas more educated workers correctly mark them as false, which I don’t think of as a scalable sort of improvement.
This is a really great write-up, thanks for doing this so conscientiously and thoroughly. It’s good to hear that Surge is mostly meeting researchers’ needs.
Re whether higher-quality human data is just patching current alignment problems—the way I think about it is more like: there’s a minimum level of quality you need to set up various enhanced human feedback schemes. You need people to actually read and follow the instructions, and if they don’t do this reliably you really won’t be able to set up something like amplification or other schemes that need your humans to interact with models in non-trivial ways. It seems good to get human data quality to the point where it’s easy for alignment researchers to implement different schemes that involve complex interactions (like the humans using an adversarial example finder tool or looking at the output of an interpretability tool). This is different from the case where we e.g. have an alignment problem because MTurkers mark common misconceptions as truthful, whereas more educated workers correctly mark them as false, which I don’t think of as a scalable sort of improvement.