The blog post by the Australian AI safety organization says, âWe apply METRâs time-horizon methodologyâŚâ How would this address the criticisms raised of METRâs methodology?
At a glance, the FutureTech pre-print makes some interesting choices, e.g., task quality is only scored up to above-average and above-average gets a perfect score, and acknowledges some of the limitations with their methodology, e.g., all tasks used for this experiment must contain all relevant information in the LLM prompt. (Is that realistic for most work tasks?) I wonder if this pre-print will be submitted for publication in a journal? FutureTech seems to be one of those weird MIT hybrids between an academic research group and a management consultancy. Iâm not sure if theyâve ever published a peer-reviewed paper.
[Edit on 2026-05-14 at 18:56 UTC: After reading Peter Slatteryâs comment below, I spent a few more minutes looking into it, and Iâm still not sure what FutureTech is or what kind of stuff they publish. If someone knows and can explain it, that would be helpful. I could spend more time and get to the bottom of it, but I donât want to spend more time on it right now.
Please also note the EA Forum team has limited my ability to reply to comments, so I canât reply further. But if you want to continue the discussion, Iâm reachable here.]
Someone could take the time to do a deep dive into the FutureTech pre-print and write a review, but I wonder if thatâs a good use of anyoneâs time? Is there a reason to think this group publishes high-quality research that is worth getting into?
If someone thinks itâs worthwhile, and they also think the pre-print is unlikely to be submitted for peer review, one option would be to ask the EA organization called The Unjournal to commission a review by an external expert.
Are you sure you are thinking of the correct organization when you say:
FutureTech seems to be one of those weird MIT hybrids between an academic research group and a management consultancy. Iâm not sure if theyâve ever published a peer-reviewed paper.
I say that because the lab has many publications, including in top peer-reviewed journals like Science. For more context, here is the publications page and here is the bio for Neil Thompson, the head of the lab:
Dr. Thompsonâs work has over 3000 citations with an h-index of 21 across his publication portfolio, including such well known and renowned papers as Expertise, The Computational Limits of Deep Learning, and Thereâs plenty of room at the Top: What will drive computer performance after Mooreâs law? Dr. Thompson has been invited to present his work and recommendations to Congressional Staffers (House and Senate), the US Federal Reserve, the Pentagon, National Security Staff, the Department of Commerce, the Department of Energy, Brookings Institute, and most recently presented at a World Summit on the same program as the Prime Minister of India and Former Prime Ministers of England and Australia. With experience in 80+ countries, Dr. Thompsonâs research and impact is on a global scale.
Okay, if we suspect peer review will eventually happen but the process will be very slow, then it might still be worthwhile to commission an external review, whether through The Unjournal. I once actually did this with my own money just because I was really, desperately curious about a pre-print published by a company that would never be submitted for peer review. I think it ended up costing me $400-500, something like that.
Whether itâs worth the time, effort, and money depends on how much people actually care about this pre-print and think itâs important. Does anyone actually, sincerely think whether weâre on the cusp of apocalypse/âutopia hangs on whether this pre-print is correct or not? How much is this particular pre-print actually a crux for anyone?
If it is actually a crux on which peopleâs expectations around AGI within the next decade hang, then itâs probably worth paying the $500 or $1,000 or whatever it costs to do a review. But if it isnât on anyoneâs top 10 list or even top 20 list of most important pieces of evidence for near-term AGI, then I guess⌠it probably doesnât matter whether the pre-printâs findings are true or false.
The argument from an AI safety perspective about why it would be a cost-effective use of funds is straightforward. First, knowing whether the pre-printâs findings stand up under scrutiny are important insofar as the informational content of the pre-print is important for understanding AI. Second, there is currently very little high-quality evidence, and especially very little academic-calibre evidence, to present to skeptics who want to be convinced that an existentially consequential AGI is on the horizon. What could convince them? Well, potentially scientific evidence of this sort. And if your hopes or plans for AI safety depend on, or would be greatly helped by, the ability to bring skeptics on board, well, then itâs worth a relatively small investment to marshal evidence to convince skeptics.
Another potential candidate for external review is the Remote Labor Index pre-print. But the same caveat applies.
The blog post by the Australian AI safety organization says, âWe apply METRâs time-horizon methodologyâŚâ How would this address the criticisms raised of METRâs methodology?
At a glance, the FutureTech pre-print makes some interesting choices, e.g., task quality is only scored up to above-average and above-average gets a perfect score, and acknowledges some of the limitations with their methodology, e.g., all tasks used for this experiment must contain all relevant information in the LLM prompt. (Is that realistic for most work tasks?) I wonder if this pre-print will be submitted for publication in a journal? FutureTech seems to be one of those weird MIT hybrids between an academic research group and a management consultancy. Iâm not sure if theyâve ever published a peer-reviewed paper.
[Edit on 2026-05-14 at 18:56 UTC: After reading Peter Slatteryâs comment below, I spent a few more minutes looking into it, and Iâm still not sure what FutureTech is or what kind of stuff they publish. If someone knows and can explain it, that would be helpful. I could spend more time and get to the bottom of it, but I donât want to spend more time on it right now.
Please also note the EA Forum team has limited my ability to reply to comments, so I canât reply further. But if you want to continue the discussion, Iâm reachable here.]
Someone could take the time to do a deep dive into the FutureTech pre-print and write a review, but I wonder if thatâs a good use of anyoneâs time? Is there a reason to think this group publishes high-quality research that is worth getting into?
If someone thinks itâs worthwhile, and they also think the pre-print is unlikely to be submitted for peer review, one option would be to ask the EA organization called The Unjournal to commission a review by an external expert.
Are you sure you are thinking of the correct organization when you say:
I say that because the lab has many publications, including in top peer-reviewed journals like Science. For more context, here is the publications page and here is the bio for Neil Thompson, the head of the lab:
Dr. Thompsonâs work has over 3000 citations with an h-index of 21 across his publication portfolio, including such well known and renowned papers as Expertise, The Computational Limits of Deep Learning, and Thereâs plenty of room at the Top: What will drive computer performance after Mooreâs law? Dr. Thompson has been invited to present his work and recommendations to Congressional Staffers (House and Senate), the US Federal Reserve, the Pentagon, National Security Staff, the Department of Commerce, the Department of Energy, Brookings Institute, and most recently presented at a World Summit on the same program as the Prime Minister of India and Former Prime Ministers of England and Australia. With experience in 80+ countries, Dr. Thompsonâs research and impact is on a global scale.
Oh, and the preprint will almost certainly be submitted for peer review, but it might take 1-2 years before it is published.
Okay, if we suspect peer review will eventually happen but the process will be very slow, then it might still be worthwhile to commission an external review, whether through The Unjournal. I once actually did this with my own money just because I was really, desperately curious about a pre-print published by a company that would never be submitted for peer review. I think it ended up costing me $400-500, something like that.
Whether itâs worth the time, effort, and money depends on how much people actually care about this pre-print and think itâs important. Does anyone actually, sincerely think whether weâre on the cusp of apocalypse/âutopia hangs on whether this pre-print is correct or not? How much is this particular pre-print actually a crux for anyone?
If it is actually a crux on which peopleâs expectations around AGI within the next decade hang, then itâs probably worth paying the $500 or $1,000 or whatever it costs to do a review. But if it isnât on anyoneâs top 10 list or even top 20 list of most important pieces of evidence for near-term AGI, then I guess⌠it probably doesnât matter whether the pre-printâs findings are true or false.
The argument from an AI safety perspective about why it would be a cost-effective use of funds is straightforward. First, knowing whether the pre-printâs findings stand up under scrutiny are important insofar as the informational content of the pre-print is important for understanding AI. Second, there is currently very little high-quality evidence, and especially very little academic-calibre evidence, to present to skeptics who want to be convinced that an existentially consequential AGI is on the horizon. What could convince them? Well, potentially scientific evidence of this sort. And if your hopes or plans for AI safety depend on, or would be greatly helped by, the ability to bring skeptics on board, well, then itâs worth a relatively small investment to marshal evidence to convince skeptics.
Another potential candidate for external review is the Remote Labor Index pre-print. But the same caveat applies.
How would this not? It doesnât use the same tasks nor does it use the same human baseliner panel as the HCAST dataset.