Thank you for writing this post, I think I learnt a lot from it (including about things I didn’t expect I would, such as waste sites and failure modes in cryonics advocacy—excellent stuff!).
Question for anyone to chip in on:
I’m wondering whether if we’re to make the “conditional pause system” the post is advocating for universal, would it imply that the alignment community needs to drastically scale up (in terms of quantity of researchers) to be able to do similar work to what ARC Evals is doing?
After all, someone would actually need to check if systems at a given capability are safe, and as the post argues, you would not want AGI labs to do it for themselves. However, if all current top labs were to start throwing their cutting-edge models at ARC Evals, I imagine they would be quite overwhelmed. (And the demand for evals would just increase over time)
I could see this being less of an issue if the evaluations only need to happen for the models that are really the most capable at a given point in time, but my worry would be that as capabilities increase, even if we test the top models rigorously, the second-tier models could still end up doing harm.
(I guess it would also depend on whether you can “reuse” some of the insights you gain from evaluating the top models at a given time on the second-tier models at a given time, but I certainly don’t know enough about this topic to know if that would be feasible)
Yes, in the longer term you would need to scale up the evaluation work that is currently going on. It doesn’t have to be done by the alignment community; there are lots of capable ML researchers and engineers who can do this (and I expect at least some would be interested in it).
I could see this being less of an issue if the evaluations only need to happen for the models that are really the most capable at a given point in time
Yes, I think that’s what you would do.
my worry would be that as capabilities increase, even if we test the top models rigorously, the second-tier models could still end up doing harm.
The proposal would be that if the top models are risky, then you pause. So, if you haven’t paused, then that means your tests concluded that the top models aren’t risky. In that case I don’t know why you expect the second-tier models to be risky?
Thank you for writing this post, I think I learnt a lot from it (including about things I didn’t expect I would, such as waste sites and failure modes in cryonics advocacy—excellent stuff!).
Question for anyone to chip in on:
I’m wondering whether if we’re to make the “conditional pause system” the post is advocating for universal, would it imply that the alignment community needs to drastically scale up (in terms of quantity of researchers) to be able to do similar work to what ARC Evals is doing?
After all, someone would actually need to check if systems at a given capability are safe, and as the post argues, you would not want AGI labs to do it for themselves. However, if all current top labs were to start throwing their cutting-edge models at ARC Evals, I imagine they would be quite overwhelmed. (And the demand for evals would just increase over time)
I could see this being less of an issue if the evaluations only need to happen for the models that are really the most capable at a given point in time, but my worry would be that as capabilities increase, even if we test the top models rigorously, the second-tier models could still end up doing harm.
(I guess it would also depend on whether you can “reuse” some of the insights you gain from evaluating the top models at a given time on the second-tier models at a given time, but I certainly don’t know enough about this topic to know if that would be feasible)
Yes, in the longer term you would need to scale up the evaluation work that is currently going on. It doesn’t have to be done by the alignment community; there are lots of capable ML researchers and engineers who can do this (and I expect at least some would be interested in it).
Yes, I think that’s what you would do.
The proposal would be that if the top models are risky, then you pause. So, if you haven’t paused, then that means your tests concluded that the top models aren’t risky. In that case I don’t know why you expect the second-tier models to be risky?