How would this address the criticisms raised of METR’s methodology?
How would this not? It doesn’t use the same tasks nor does it use the same human baseliner panel as the HCAST dataset.
How would this not? It doesn’t use the same tasks nor does it use the same human baseliner panel as the HCAST dataset.