Against Credulous AI Hype
I have been increasingly surprised over the past year at what I perceive to be the lack of critical thinking in response to new announcements regarding AI model capabilities. I’ve written about many of these issues before, but as AI hype accelerates the problems seem to be getting worse. Here I want to highlight a few points that seem to be neglected in coverage of the growth of AI capabilities.
Advertising is treated as fact
Every new model release by a major tech company represents billions of dollars of expenditure and the potential for billions of dollars of additional investment. Model releases and their associated documentation should therefore be interpreted as advertising designed primarily to attract customers and investors. Their goal is not to provide an objective and thorough analysis of the strengths and weaknesses of each new model. Of course, this does not mean that everything reported is false, but it does mean that such releases should be treated with significant skepticism. It is therefore disappointing to me to see write-ups such as this one by Rob Wiblin, which consists almost entirely of uncritical restatement of claims made by Anthropic, with little to no additional analysis.
For example, in discussing the capabilities of Claude Mythos to identify software vulnerabilities, Wiblin says:
“Anthropic’s previous model Opus 4.6 could only successfully convert a bug it identified in the browser Firefox into an effective way to accomplish something really bad 1% of the time. Mythos could do it 72% of the time.”
This statement is selective and misleading. What Anthropic actually did was provide their models with a testing harness which mimicked Firefox 147 but without critical defence components. They then prompted the models to devise and implement a certain type of exploit. Mythos fully accomplished this 72% of the time, but nearly always did so using two specific bugs that have since been fixed. When Anthropic removed these two bugs, Mythos only fully succeeded 4.4% of the time. (It was unclear to me whether Mythos may have had any knowledge of these bugs in its training data, given they had already been fixed). While I do not doubt that Mythos has improved capabilities relative to previous models, the reporting here unjustifiably hypes the significance of the results without providing any substantive critical analysis. This is further highlighted by the fact that an independent analysis was able to find many of the same vulnerabilities using much smaller open-source models.
Wiblin also extrapolates well beyond what is even claimed by Anthropic, such as when he argues:
“Now, Anthropic doesn’t say this directly in their reports, but I think a common-sense interpretation of the above is that in any deployment where this AI has access to the kind of tools that would make it actually useful to people — the ability access some parts of the network and execute code — could probably break out of whatever software box we try to put it in, because the systems that we would be trying to restrain it are themselves made of software, and that software is going to have vulnerabilities nobody knows about that this model is superhumanly good at finding and taking advantage of”
In my view, it is reasonable to think that humans armed with improved automated techniques for identifying software vulnerabilities would be better, rather than worse, at constraining the behaviour of new models. This is in fact what Anthropic argues in their report. There may be differences of opinion about this, but this is an example of where hype seems to be substituting for genuine analysis.
Wiblin also comments on the fact that Anthropic has yet to release Mythos publicly:
“And also keep in mind that on Monday — the day before Anthropic published all of this — we learned that their annualised revenue run rate had grown from $9 billion at the end of December to $30 billion just three months later…
That exploding revenue is a pretty good proxy for how much more useful the previous release, Opus 4.6, has become for real-world tasks. If the past relationship between capability measures and usefulness continues to hold, the economic impact of Mythos once it becomes available is going to dwarf everything that came before it — which is part of why Anthropic’s decision not to release it is a serious one, and actually quite a costly one for them.
They’re sitting on something that would likely push their revenue run rate into the hundreds of billions, but they’ve decided it’s simply not worth the risk.”
Wiblin does not consider the possibility that the reason why Anthropic is publishing these claims without the corresponding model is to build hype for a model that is not actually ready for release yet, especially in the lead up to Anthropic’s upcoming IPO. Wiblin does not explain where his estimate of ‘hundreds of billions of dollars’ of revenue comes from, but it reads to me like pure marketing for potential investors. Nor does it make sense to claim that revenue is a measure of economic value when Anthropic, OpenAI, and others are massively subsidising usage. There is a discussion to be had about the implication of these issues, but it is not to be found in this (or similar) pieces I’ve seen on the subject. We need to do better than just uncritically repeating advertising talking points of billion-dollar tech companies.
Benchmarks are interpreted uncritically
Much of the claimed improvement in performance of models derives from rapidly increasing scores on various benchmarks, which are standardised tests designed to quantify model capabilities on tasks such as language, coding, reasoning, and image recognition. While these benchmark scores give the appearance of precise and objective tests, in practice they often have very limited value in assessing the rate of capability improvement in a meaningful way.
First, most benchmarks have not been validated. Validity is an important concept in research generally and especially human psychometrics. It refers to the extent to which a metric has been assessed as adequately measuring the underlying phenomenon of interest. There are many components of validity, and validity assessments require careful research to assess the relationship between test performance and the target phenomenon. However, few AI benchmarks report this sort of research. Most simply come up with tasks the researchers hope are related to the target capability. This is simply poor research practice. It cannot simply be determined by intuition whether a given set of tasks will provide reliable and valid information about the intended capability of interest. This requires carefully designed research.
Second, almost as soon as they are released the benchmark solutions begin to contaminate the training data of new models. For instance, memorisation is known to be a major problem for SWE-Bench, a widely used benchmark of software engineering tasks. A recent analysis of visual benchmarks has even found that models could outperform humans on a standard X-ray question-answering benchmark without even being provided with any images. A particularly concerning analysis found it was possible to achieve 100% on several major benchmarks without solving a single task, usually by exploiting simple vulnerabilities in the test pipeline or the way scores are computed.
Third, even when the test solutions are not publicly available, the training data often sufficiently resemble the test data that a model trained on the training data will see dramatically improved performance on the test questions as well. This would not be a problem if the train and test problems constituted a representative sample of the domain of interest, but for so many important topics (language, reasoning, coding, image recognition), the domain is so vast and hard to characterise that it is not possible to construct a representative sample in this way. Sampling also tends to favour more common and simpler problems, and even very subtle changes in the sampling method can lead to the model learning radically different representations. This means that models tend to overfit to the training data, diminishing the value of the benchmarks in assessing out-of-distribution generalisation capabilities.
The issue of benchmark contamination is granted only a few pages of the 244 in the Mythos model card. Only a few benchmarks are assessed for contamination, with Anthropic arguing that most of the improvement on these cannot be attributed to memorisation. However, their results show that model performance degrades significantly when restricted to the 20% of the benchmark questions they assess as having least probability of memorisation. This was even true for the SWE-Bench Pro benchmark, which is supposedly ‘a contamination-resistant testbed’. This highlights the importance of devoting more attention to these issues in order to better interpret the meaning of benchmark improvements.
Negative results are ignored
I rarely see discussion in EA circles of various results which indicate fundamental limitations of existing LLM-based approaches. Numerous studies have found that these models often fail to learn the appropriate task structure, but instead learn to answer questions by learning spurious correlations and superficial heuristics that work in some constrained domain or training task, but do not generalize to variations of the task. There are also significant known limitations of the chain of thought approach which underpins reasoning models, with thought chains often being unfaithful to the actual computations that generate model predictions. In my view, there is reason to believe that these problems reflect fundamental limitations of the machine learning techniques that underpin leading models.
As a further interesting example, Claude Opus 4.7 shows a significant regression in performance on long context tasks based on the MRCR benchmark, which interestingly is precisely a benchmark that uses adversarial methods to distract the model from the task. Anthropic’s response is:
“We kept MRCR in the system card for scientific honesty, but we’ve actually been phasing it out slowly. It’s built around stacking distractors to trick the model, which isn’t how people actually use long context.”
In my view, this response indicates that Anthropic is more interested in ensuring their model works in typical use cases, rather than assessing whether it actually has robust generalisable capabilities that are indicative of what we might call ‘genuine intelligence’. This is particularly relevant for arguments relying on extrapolations of improvements in model capabilities to novel tasks and in more complex settings.
Conclusions
There is no doubt that LLM-based models have shown significant improvements in recent years. However, it is important to carefully and critically assess these advances in order to make accurate inferences about their social, political, and economic impacts. One cannot infer AI 2027-like superintelligence takeover scenarios from recent trends and developments without making significant additional assumptions about the nature of generalized intelligence, the relevance of benchmark results, and the limitations of LLM-based models. Humans have a very bad track record of predicting what tasks require ‘general intelligence’ to accomplish, and I suspect that it may be possible to develop machine learning models that can automatically perform any task with known solutions without this implying any superintelligence takeoff. These issues are complex and demand a more nuanced, informed consideration than I often see in contemporary discussions.
You quote him as observing that their revenue tripled over the past 3 months, and some basic math tells us that another ~tripling gets them to $100B.
I’m in favor of rigor and would also have preferred him to share a more detailed model, but “pure marketing for potential investors” seems like an unfair characterization of a “predict trends will continue unchanged” forecast.
Credulous really is the right word. There is a strand of dialogue in EA circles that feels like “we called much of this many years ago” therefor “everything that transpires will mimic our thought experiments perfectly.” The marketing from frontier labs is the offspring of early EA/LW ideas. The potential for confirmation bias here is astronomical.
We should expect to get constantly nerdsniped by frontier labs. And we have. Most EAs I talk to think Claude Code has made (or nearly made) software engineering a closed loop or RSI. They see the METR graph as a direct line pointing to AGI. They see AI 2027 as a principled, ballpark estimate for encroaching doom.
More skepticism and more posts like this seem incredibly important.
I also think that when it comes to assessing whether they’re overly trusting of the claims of frontier labs because it fits their broader views, it’s probably more relevant that EAs generally believed Altman and Musk when they said they were founding OpenAI to do philanthropic research when basically everybody else understood what they were really trying to do than that EAs correctly called transformers being a big deal when the average computer scientist was a bit more cautious.
GPT2 was “too dangerous to release” as a marketing strategy too.
Thank you for writing this. I do not agree with either of the criticisms expressed in the other comments. It is clear to me from the title of this article that the point is that more skepticism is appropriate towards the materials published by major AI laboratories, and then the article justifies this by outlining data that is problematic for a naïve interpretation of major lab press publications.
I do not agree with dismissing the writeup by AISLE. They have been publicly doing this work and writing about it for sometime, and in the write-up, they are hardly baselessly critical of Anthropic. Their fundamental point, which is backed up by their own results in their article and other writings, is that the success of models at cybersecurity tasks largely is the result of a larger apparatus around the models. We see similar things with agentic coding, where the harness is as paramount to the actual utility as the specific model.
On the financial side, I agree that EA’s should take a more critical stance regarding the financial circumstances of major AI labs. These labs are racing to IPO. The underlying economics of the AI industry are well known to be problematic. You don’t have to go full Zitron to see that the financial picture is more complicated than can be inferred by just charting Anthropic’s reported ARR growth.
I work with AI everyday as a software engineer. I’m not some sort of luddite, but precisely because of my experience as a consumer of the technology, it is impossible not to notice the marketing hype cycle that has come to engulf the industry. Probably the dominant category of ads I personally see on Facebook now are for coding harnesses by OpenAI and Anthropic. Anyone who peruses the relevant subreddits is used to seeing a flood of astroturfed threads intended to sway readers’ loyalties as customers from one to the other. These companies are spending incredible sums of money to market their products, and that should inform how we approach claims made by company figureheads. I still recall the way my stomach churned about a year ago now, maybe a month after the release of deep research, when Sam Altman, having been asked what he does in his free time, responded by saying that of course he doesn’t have any free time, but if he did, he would spend it all day reading deep research reports, or something to that effect. For me, that moment was breaking the fourth wall. He was obviously being disingenuous, and so how was I to interpret everything else he had said, which I had been happily nodding along to up until that point?
Doubtless many examples could be added to the OP, but I will satisfy myself with just one. One of the earliest sources of information about Mythos was actually the Claude Code source leak, and one thing we learned from that leak is that the quality of code being generated internally at Anthropic is incredibly low. It is not difficult to find numerous reviews of the Claude Code source tearing it apart for the low quality of craftsmanship and the bugginess of the code therein (links here, here, commentary on the former here). How does that update your priors on the idea that Mythos is a huge leap forward in terms of cybersecurity capabilities? Doubtless there is some sort of way to harmonize the two—and to be clear, I do expect Mythos to be an improvement—but is it possible that current model capabilities are being overstated by an organization pumping itself before an IPO?
None of this is to say that we shouldn’t be concerned about AGI. Nor is the point of the OP as I read it that we shouldn’t take AGI seriously. It is that it is aggravating to see so many people in EA circles uncritically accept and repeat claims by major AI labs that seem quite dubious. I actually don’t see why skepticism of major laboratory pronouncements should have any bearing on our stance on x-risk and AGI. The two issues are not the same, other than that it should cause us to distrust said labs and be more willing to do our own homework. Furthermore, I’m not stating that model capabilities aren’t advanced either—I barely ever write code by hand nowadays. Again, I took the point of the OP’s article and I agree with it, to be that statements by major labs about model capabilities should not be taken as straightforward recitations of the objective truth. They are embedded in a highly competitive context involving competition for vast sums of money and huge numbers of users, and they are intended to influence that context, including by yes, scaring people into buying a subscription. The OP is attempting to help others see this possibility by providing additional data and argumentation that would be hard to account for if things were as straightforward as major lab publications suggest.
On the take by AISLE, maybe I’m missing something here, but if their headline claim was correct (that the harness is more important than the model), shouldn’t they have been able to find the vulnerabilities that Anthropic hasn’t published? Or find hundreds more similarly impactful ones?
Re-discovering the ones Anthropic had already published seems much less impressive, because there are lots of ways to cheat, and from their write up it sounded to me like they were essentially admitting that they had cheated.
Of course Anthropic could be lying about the existence or significance of the vulnerabilities they haven’t published. But they have committed in advance to what those vulnerabilities are (I think they have already made some kind of cryptographic commitment to their unpublished write ups..?) which seems impressive to me.
Either they have used the new model to find significant vulnerabilities in every major OS and browser that are too dangerous to be released, or they haven’t. If they have, it seems genuinely scary and impressive (not just marketing hype), because I’m not aware people working on fancy harnessing have had similar results (or have they?) And if they haven’t, then it’s a very weird marketing ploy, because they’re going to get found out very quickly!
So, a couple of things to note.
AISLE has been operating their agentic system for I think about six months and have found numerous vulnerabilities in highly vetted software themselves of basically the same flavor as the Anthropic announcement. They are not cranks on this topic. See this post for an example.
I think you are misunderstanding the purpose of the specific exercise and the broader claims in the AISLE article. The point of the examples on the isolated code snippets is to show models of various levels of size and architecture are quite capable of discovering the bugs. Indeed, model size and architecture seemingly has a complicated relationship to the ability to recognize bugs of various types. They do not attempt to demonstrate how they go about the larger task of exploring and partitioning a codebase for this sort of narrow task, but if you read more of their other posts, you will see that is the exact sort of product they have built and that other clever AI-app developers will probably be producing in the short term future.
As to why they didn’t just find all of the thousands of unpublished bugs themselves, I think you should consider the following:
Anthropic has a huge amount of resources at their disposal. Project Glasswing is providing free compute to partners to the tune of $100,000,000. Per Carl Brown, HackerOne’s bug bounty program paid out about $80,000,000 in total last year.
Per Anthropic’s writeup, they spent $20,000 in compute to discover the OpenBSD bugs.
Even without making the obvious inference that Anthropic has spent an astronomical amount of money beyond the amount written above on this endeavor, we can see that these are not costs that AISLE or many other companies would be able to afford for just any arbitrary reason.
The claim is not that Anthropic is lying in some simplistic fashion. It is that there is significant and predictable reductionism in the interpretations which this announcement generated, which serve to hype the company up at the expense of the truth.
I’ll try to state again my broader theory, which I think is aligned largely with the OP and with the AISLE article, since it seems the point of view of the commentary from these sources is still not being understood.
1. AI-application design (harnesses/scaffolds as they are referred to in the articles) is extremely important to the capability of the AI system. A well-designed harness can enable capability in a relatively less intelligent model that will elude more intelligent models.
Some examples—
It was with the advent of ChatGPT and the underlying helpful assistant post-training that AI exploded into consumer use in the first place. The critical development was at the application layer. Model intelligence had been (to my understanding) steadily advancing up until that point and beyond it.
Claude Code (and Cursor to a significant extent before it) pioneered the coding agent harness, which has massively expanded the utility of LLMs for economically productive work. Throughout the period leading up to agentic coding and beyond it, model intelligence steadily advanced; however, the critical difference occurred with the development of the application.
We saw something similar with OpenClaw several months ago, and likewise, AISLE is the first known-to-me bug-finding application using LLMs (though I’m sure there are others that spawned in the same timeline, and for that matter, Cursor even has something like it in their development platform).
Please note some things about the above:
Model intelligence steadily advanced throughout these periods. The paradigm shift in each case was the application-layer.
In each of the cases listed above, it only took a short while for competitors to replicate the application design.
2. Application design often can mislead users into mistaking what is actually a well-designed narrow loop that coaxes intended behavior out of an LLM for more general intelligence capabilities.
Again, to return to Claude Code and its superlative success compared to other types of LLM economic activity, such as data analysis, financial analysis, etc., computer code has many advantages over these other types of work:
It has to compile.
It is possible to write arbitrary automated tests to verify and explore the functionality of computer code.
Creating a harness that leverages these features was a brilliant innovation, but the domain of coding is far closer to that of a chess game than many other types of knowledge work. It was a more tractable problem for various reasons, that then creates the illusion of generalized capabilities which thus far have not manifested themselves in the broader economy.
My claim to be clear is that:
Mythos is almost certainly going to represent an advancement on current public model capabilities
The bug discoveries are likely more explained in terms of:
The scaffold Anthropic built to deploy the model for this task
The amount of compute they threw at it
The model advancements themselves are only a part of the story and not the largest part either
My prediction is that we will shortly have another DeepSeek moment in the near future where someone successfully builds an open source scaffold that does something like what Mythos and AISLE are doing and then its off to the races as far as cybersecurity goes.
That is on the one hand quite scary but, as the OP said, “I suspect that it may be possible to develop machine learning models that can automatically perform any task with known solutions without this implying any superintelligence takeoff.”
Thanks for the detailed reply, I understand your point clearly now I think!
But $20,000 for *all* of the OpenBSD bugs (not just the published ones) doesn’t sound like that much to spend on inference compute to me. If AISLE could have spent the same and made an equally impressive announcement, unearthing enough bugs at once that government ministers around the world start issuing statements about it, then shouldn’t they have been able to find the investors to fund that? That would have been incredible publicity for them.
The crux for me seems to be whether they have made equally impressive announcements, as you suggest they might have done. Maybe they’re just worse at marketing. I don’t know enough to evaluate that claim properly, but that does seem the relevant question here: have Anthropic been able to use Mythos to go significantly beyond what the best harnesses could already achieve with existing models for the same inference spend? I thought the answer was a clear yes, and I didn’t find the original linked AISLE writeup very convincing at all. Your comment has made me more uncertain, but has still not convinced me, and I’d be really interested to read something more in depth on that question. (Maybe we also would disagree about what the word ‘significantly’ means here, since I guess you are acknowledging it probably represents some improvement).
(Also, I’d push back a bit on your characterization of AI progress. I agree the scaffolding is extremely important, but in my experience the “paradigm shifts” in capability over the last two and a half years I’ve been working with them have come from the models)
(And extra comment: the fact that cybersecurity capabilities might not imply imminent superintelligence takeoff seems an entirely independent point that I don’t necessarily disagree with)
I do think the models are the foundation of capability, and I have overstated my case, as I tend to do. What I want to say is that, I think model intelligence has largely steadily scaled, and that when a new application is developed (possible due to sufficient model advances), there is a sudden increase in experienced capability by consumers which feels like a giant leap in model development. That flood of new ability can be attributed to the application inasmuch as it opened the flood gates, but of course, the model is the thing functioning under the hood. To the point about hypey-discourse, I guess I’m just griping about the tendency to allow this optical illusion to influence people’s tone and assessment of progress.
It is hard to tell about the AISLE and Anthropic situation because of the very different size of the organizations and the lack of insider knowledge about either of them. To me, the requirement that AISLE replicate Anthropic’s findings in whole or in part feels like an unnecessary one to justify their claims. The way I take it is that AISLE’s activity has shown that with a proper system, it is already possible with publicly available models to do the sort of bug detection work that made headlines with the Mythos release. That is not to deny that Mythos + system is not an improvement over AISLE’s work. Assessing the nature of that improvement is hard for the aforementioned reasons about org scale differences and the general complexity of the thing being compared. It seems all parties agree that Mythos is a big step up in its ability to write exploits. I see no reason to challenge that.
I think its very hard to articulate critiques of hype, and that simultaneously I tend to write in an over-vehement and pugnacious way that makes me quite vulnerable to whatever arguments I would make against someone, so I kind of regret my engagement here, though I do think its true that there is a sort of ineffable tendency to amplify what feel-to-me to be likely reductionisms about model capabilities and how AI systems are engineered.
I took OP as trying to establish that the signal on progress to AGI is quite noisy, and expressing a frustration with narratives that feel too clean or reductionistic about progress. That’s highly subjective though. As you note, we probably can’t even really define what constitutes significant progress between us, though I suspect we could come to largely agree about the amount of progress made, just not what word to use to describe it.
I do think a fair test of my view point will be if in one year’s time we see a proliferation of products/services that do this sort deep bug-finding pipeline. My intuition on this is that cybersecurity is going to go through something similar to what software engineering did last year, driven by the rising tide of model quality in conjunction with a more acute set of innovations in the application layer.
[Edit: I don’t think my prediction proves anything actually, since it’s coming to pass could reflect many different underlying causalities]
That makes a lot of sense, thanks.
I’m sorry you’ve said you regret your engagement, since I’ve found your comments helpful (the link to AISLE’s OpenSSL zero days has shifted my view on this a fair bit).
I guess this whole discussion does just feel like a classic example of “All debates are bravery debates”.
Given that you are criticising the epistemics of EAs taking AGI very seriously, I think it’s reasonable to hold this post to a higher epistemic standard than a typical EA forum post. Apologies if this comes across as combative—I spent some time trying to tone it down with Claude and struggled to get something that wasn’t just hedged/weak sauce. I am excited about more discussion of the capabilities of AI systems on the EA forum and would like more people to write up their takes on the current situation.
…...
I think you are applying more rigour to the bullish case than the bearish one. For example, you say:
I think this is misleading for a few reasons:
AISLE is not an “independent” entity—their whole business depends on Mythos and frontier models not being as big a deal as harnesses
That analysis does not “find” many of the same vulns—they were presented to the LLMs selectively
They don’t give a false positive rate, so it’s not clear that the LLMs classifications have much validity
On the claim that Anthropic talks about risks from their own models primarily to create hype: I find this hard to square with the evidence. Talking about how your B2B product might be extremely dangerous, or publishing lengthy documents critically assessing your own product and admitting to errors that would be difficult to identify independently (e.g. accidentally training against the CoT), is not a common marketing tactic. It feels like your model implies that companies should only release materials optimised for short-term interests, which doesn’t predict the real differences in how AI companies approach releases.
Benchmarks are interpreted uncritically
The benchmark contamination arguments are worth engaging with in principle, but I’m not sure they’re doing much work in practice—I don’t think many people in EA are actually updating heavily on raw benchmark scores right now. METR, arguably EA’s favourite benchmarking org, has been pretty vocal about their own benchmarks being saturated, so I think the community is reasonably aware of these limitations already.
Negative results are ignored
I’m genuinely uncertain what you want Anthropic and other AI companies to do here. Do you think “genuine intelligence” is easy to measure and well-defined? The more concrete concepts being used as proxies—coding ability, economic value generated, uplift—seem defensible on their own terms rather than as misleading substitutes for something more fundamental.
On “fundamental limits of LLMs” more broadly: these arguments have been made confidently by prominent researchers since the advent of LLMs and have not had a great track record. That doesn’t make them wrong, but it’s worth noting.
.....
I think this post would be much stronger if it applied its standards more symmetrically. It would also help to have a more concrete conclusion. The current takeaway is essentially “further research is needed”, which is a claim you can make about most areas of research (so much so that it’s been banned from multiple journals), but I don’t have a great sense of what research would actually convince you that the “AI hype” is reasonable.
Thank you for this write-up, I was thinking exactly the same when listening to that latest podcast on mythos.
Hi James. Thanks for the valuable post.
Executive summary: The author argues that current discourse around AI capabilities is overly credulous, relying on selective reporting, weak benchmarks, and ignored limitations, which leads to unjustified hype and flawed extrapolations about future impacts.
Key points:
The author argues that company model releases function as advertising and should be treated with skepticism rather than as objective evidence of capabilities.
They claim that reporting on models like Claude Mythos is often selective and misleading, for example overstating exploit success rates without noting reliance on specific, now-fixed bugs.
The author argues that some commentators extrapolate beyond available evidence, such as inferring likely sandbox escape or massive future revenues without sufficient justification.
They suggest alternative interpretations are neglected, including that unreleased models may be hyped ahead of IPOs or that improved tools could help humans better constrain AI systems.
The author claims AI benchmarks are often invalid measures of capability, lacking rigorous validation and relying on untested assumptions about what they measure.
They argue benchmark scores are compromised by contamination, memorization, and exploitable flaws, sometimes allowing high scores without solving tasks.
The author claims benchmarks also fail to measure generalization because training and test data are not representative of broad domains, leading to overfitting.
They argue that negative results and limitations—such as reliance on spurious heuristics, issues with chain-of-thought reasoning, and regressions on adversarial benchmarks—are under-discussed.
The author interprets responses to such limitations (e.g., dismissing adversarial benchmarks) as prioritizing practical performance over assessing genuine general intelligence.
They conclude that extrapolations to scenarios like rapid superintelligence takeover require additional assumptions and are not justified by current evidence.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.