On the take by AISLE, maybe I’m missing something here, but if their headline claim was correct (that the harness is more important than the model), shouldn’t they have been able to find the vulnerabilities that Anthropic hasn’t published? Or find hundreds more similarly impactful ones?
Re-discovering the ones Anthropic had already published seems much less impressive, because there are lots of ways to cheat, and from their write up it sounded to me like they were essentially admitting that they had cheated.
Of course Anthropic could be lying about the existence or significance of the vulnerabilities they haven’t published. But they have committed in advance to what those vulnerabilities are (I think they have already made some kind of cryptographic commitment to their unpublished write ups..?) which seems impressive to me.
Either they have used the new model to find significant vulnerabilities in every major OS and browser that are too dangerous to be released, or they haven’t. If they have, it seems genuinely scary and impressive (not just marketing hype), because I’m not aware people working on fancy harnessing have had similar results (or have they?) And if they haven’t, then it’s a very weird marketing ploy, because they’re going to get found out very quickly!
AISLE has been operating their agentic system for I think about six months and have found numerous vulnerabilities in highly vetted software themselves of basically the same flavor as the Anthropic announcement. They are not cranks on this topic. See this post for an example.
I think you are misunderstanding the purpose of the specific exercise and the broader claims in the AISLE article. The point of the examples on the isolated code snippets is to show models of various levels of size and architecture are quite capable of discovering the bugs. Indeed, model size and architecture seemingly has a complicated relationship to the ability to recognize bugs of various types. They do not attempt to demonstrate how they go about the larger task of exploring and partitioning a codebase for this sort of narrow task, but if you read more of their other posts, you will see that is the exact sort of product they have built and that other clever AI-app developers will probably be producing in the short term future.
As to why they didn’t just find all of the thousands of unpublished bugs themselves, I think you should consider the following:
Anthropic has a huge amount of resources at their disposal. Project Glasswing is providing free compute to partners to the tune of $100,000,000. Per Carl Brown, HackerOne’s bug bounty program paid out about $80,000,000 in total last year.
Per Anthropic’s writeup, they spent $20,000 in compute to discover the OpenBSD bugs.
Even without making the obvious inference that Anthropic has spent an astronomical amount of money beyond the amount written above on this endeavor, we can see that these are not costs that AISLE or many other companies would be able to afford for just any arbitrary reason.
The claim is not that Anthropic is lying in some simplistic fashion. It is that there is significant and predictable reductionism in the interpretations which this announcement generated, which serve to hype the company up at the expense of the truth.
I’ll try to state again my broader theory, which I think is aligned largely with the OP and with the AISLE article, since it seems the point of view of the commentary from these sources is still not being understood.
1. AI-application design (harnesses/scaffolds as they are referred to in the articles) is extremely important to the capability of the AI system. A well-designed harness can enable capability in a relatively less intelligent model that will elude more intelligent models.
Some examples—
It was with the advent of ChatGPT and the underlying helpful assistant post-training that AI exploded into consumer use in the first place. The critical development was at the application layer. Model intelligence had been (to my understanding) steadily advancing up until that point and beyond it.
Claude Code (and Cursor to a significant extent before it) pioneered the coding agent harness, which has massively expanded the utility of LLMs for economically productive work. Throughout the period leading up to agentic coding and beyond it, model intelligence steadily advanced; however, the critical difference occurred with the development of the application.
We saw something similar with OpenClaw several months ago, and likewise, AISLE is the first known-to-me bug-finding application using LLMs (though I’m sure there are others that spawned in the same timeline, and for that matter, Cursor even has something like it in their development platform).
Please note some things about the above:
Model intelligence steadily advanced throughout these periods. The paradigm shift in each case was the application-layer.
In each of the cases listed above, it only took a short while for competitors to replicate the application design.
2. Application design often can mislead users into mistaking what is actually a well-designed narrow loop that coaxes intended behavior out of an LLM for more general intelligence capabilities.
Again, to return to Claude Code and its superlative success compared to other types of LLM economic activity, such as data analysis, financial analysis, etc., computer code has many advantages over these other types of work:
It has to compile.
It is possible to write arbitrary automated tests to verify and explore the functionality of computer code.
Creating a harness that leverages these features was a brilliant innovation, but the domain of coding is far closer to that of a chess game than many other types of knowledge work. It was a more tractable problem for various reasons, that then creates the illusion of generalized capabilities which thus far have not manifested themselves in the broader economy.
My claim to be clear is that:
Mythos is almost certainly going to represent an advancement on current public model capabilities
The bug discoveries are likely more explained in terms of:
The scaffold Anthropic built to deploy the model for this task
The amount of compute they threw at it
The model advancements themselves are only a part of the story and not the largest part either
My prediction is that we will shortly have another DeepSeek moment in the near future where someone successfully builds an open source scaffold that does something like what Mythos and AISLE are doing and then its off to the races as far as cybersecurity goes.
That is on the one hand quite scary but, as the OP said, “I suspect that it may be possible to develop machine learning models that can automatically perform any task with known solutions without this implying any superintelligence takeoff.”
Thanks for the detailed reply, I understand your point clearly now I think!
But $20,000 for *all* of the OpenBSD bugs (not just the published ones) doesn’t sound like that much to spend on inference compute to me. If AISLE could have spent the same and made an equally impressive announcement, unearthing enough bugs at once that government ministers around the world start issuing statements about it, then shouldn’t they have been able to find the investors to fund that? That would have been incredible publicity for them.
The crux for me seems to be whether they have made equally impressive announcements, as you suggest they might have done. Maybe they’re just worse at marketing. I don’t know enough to evaluate that claim properly, but that does seem the relevant question here: have Anthropic been able to use Mythos to go significantly beyond what the best harnesses could already achieve with existing models for the same inference spend? I thought the answer was a clear yes, and I didn’t find the original linked AISLE writeup very convincing at all. Your comment has made me more uncertain, but has still not convinced me, and I’d be really interested to read something more in depth on that question. (Maybe we also would disagree about what the word ‘significantly’ means here, since I guess you are acknowledging it probably represents some improvement).
(Also, I’d push back a bit on your characterization of AI progress. I agree the scaffolding is extremely important, but in my experience the “paradigm shifts” in capability over the last two and a half years I’ve been working with them have come from the models)
(And extra comment: the fact that cybersecurity capabilities might not imply imminent superintelligence takeoff seems an entirely independent point that I don’t necessarily disagree with)
I do think the models are the foundation of capability, and I have overstated my case, as I tend to do. What I want to say is that, I think model intelligence has largely steadily scaled, and that when a new application is developed (possible due to sufficient model advances), there is a sudden increase in experienced capability by consumers which feels like a giant leap in model development. That flood of new ability can be attributed to the application inasmuch as it opened the flood gates, but of course, the model is the thing functioning under the hood. To the point about hypey-discourse, I guess I’m just griping about the tendency to allow this optical illusion to influence people’s tone and assessment of progress.
It is hard to tell about the AISLE and Anthropic situation because of the very different size of the organizations and the lack of insider knowledge about either of them. To me, the requirement that AISLE replicate Anthropic’s findings in whole or in part feels like an unnecessary one to justify their claims. The way I take it is that AISLE’s activity has shown that with a proper system, it is already possible with publicly available models to do the sort of bug detection work that made headlines with the Mythos release. That is not to deny that Mythos + system is not an improvement over AISLE’s work. Assessing the nature of that improvement is hard for the aforementioned reasons about org scale differences and the general complexity of the thing being compared. It seems all parties agree that Mythos is a big step up in its ability to write exploits. I see no reason to challenge that.
I think its very hard to articulate critiques of hype, and that simultaneously I tend to write in an over-vehement and pugnacious way that makes me quite vulnerable to whatever arguments I would make against someone, so I kind of regret my engagement here, though I do think its true that there is a sort of ineffable tendency to amplify what feel-to-me to be likely reductionisms about model capabilities and how AI systems are engineered.
I took OP as trying to establish that the signal on progress to AGI is quite noisy, and expressing a frustration with narratives that feel too clean or reductionistic about progress. That’s highly subjective though. As you note, we probably can’t even really define what constitutes significant progress between us, though I suspect we could come to largely agree about the amount of progress made, just not what word to use to describe it.
I do think a fair test of my view point will be if in one year’s time we see a proliferation of products/services that do this sort deep bug-finding pipeline. My intuition on this is that cybersecurity is going to go through something similar to what software engineering did last year, driven by the rising tide of model quality in conjunction with a more acute set of innovations in the application layer.
[Edit: I don’t think my prediction proves anything actually, since it’s coming to pass could reflect many different underlying causalities]
I’m sorry you’ve said you regret your engagement, since I’ve found your comments helpful (the link to AISLE’s OpenSSL zero days has shifted my view on this a fair bit).
I wanted to follow up on this thread and bring in some additional evidence on the whole AISLE thing. It isn’t definitive or anything, but both AISLE and Mythos have been used to scan curl and the results are interesting.
- AISLE identified five CVEs and 24 bugs (plus two more CVEs in a dependency). - Mythos identified 1 CVE and potentially 20 bugs.
Now, Mythos scanned after AISLE. We don’t know what would have happened if Mythos had come first. But here’s some quotes by the maintainer of curl about the Mythos results:
> curl is certainly getting better thanks to this report, but counted by the volume of issues found, all the previous AI tools we have used have resulted in larger bugfix amounts.
> I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing.
> Any project that has not scanned their source code with AI powered tooling will likely find huge number of flaws, bugs and possible vulnerabilities with this new generation of tools. Mythos will, and so will many of the others.
Daniel’s blog post has some not great English, and also his tone is a bit less than objective. And of course, this is not a rigorous scientific comparison. But he is an expert in software and writing secure software, and he has demonstrated a willingness to change his mind on AI for cybersecurity,[1] so I think he’s worth listening to.
Interestingly, the same harness is not as successful at recreating AISLE’s own results, but this is all a bit selective.
To my earlier claim that harnesses and pipelines enabling Mythos like results can and will be built in the near future, AISLE open-sourced nano-analyzer, the harness used in this blog to discover maybe as many as 40 bugs in FreeBSD, so it would appear that my prediction was fulfilled at the time of writing...
AISLE quotes $100 in spend to find the 40 bugs in FreeBSD and to recreate a Mythos result. That is presumably going to be 1/100th or less of compute spend by Anthropic. To me, this causes me to think that the compute spend by Anthropic is playing an important role in the whole assessment of the Mythos results. My belief now is that Anthropic’s high compute spend is probably due to an inefficient pipeline, which is somewhat of a reversal of my earlier belief that the pipeline was the key determinant in model capabilities. However, this doesn’t mean that I think model capabilities are hugely improved—it just means that I think that with a better pipeline, they would have spent less to get these results, given that AISLE was able to find some of these results for $100 with dumber models.
On the take by AISLE, maybe I’m missing something here, but if their headline claim was correct (that the harness is more important than the model), shouldn’t they have been able to find the vulnerabilities that Anthropic hasn’t published? Or find hundreds more similarly impactful ones?
Re-discovering the ones Anthropic had already published seems much less impressive, because there are lots of ways to cheat, and from their write up it sounded to me like they were essentially admitting that they had cheated.
Of course Anthropic could be lying about the existence or significance of the vulnerabilities they haven’t published. But they have committed in advance to what those vulnerabilities are (I think they have already made some kind of cryptographic commitment to their unpublished write ups..?) which seems impressive to me.
Either they have used the new model to find significant vulnerabilities in every major OS and browser that are too dangerous to be released, or they haven’t. If they have, it seems genuinely scary and impressive (not just marketing hype), because I’m not aware people working on fancy harnessing have had similar results (or have they?) And if they haven’t, then it’s a very weird marketing ploy, because they’re going to get found out very quickly!
So, a couple of things to note.
AISLE has been operating their agentic system for I think about six months and have found numerous vulnerabilities in highly vetted software themselves of basically the same flavor as the Anthropic announcement. They are not cranks on this topic. See this post for an example.
I think you are misunderstanding the purpose of the specific exercise and the broader claims in the AISLE article. The point of the examples on the isolated code snippets is to show models of various levels of size and architecture are quite capable of discovering the bugs. Indeed, model size and architecture seemingly has a complicated relationship to the ability to recognize bugs of various types. They do not attempt to demonstrate how they go about the larger task of exploring and partitioning a codebase for this sort of narrow task, but if you read more of their other posts, you will see that is the exact sort of product they have built and that other clever AI-app developers will probably be producing in the short term future.
As to why they didn’t just find all of the thousands of unpublished bugs themselves, I think you should consider the following:
Anthropic has a huge amount of resources at their disposal. Project Glasswing is providing free compute to partners to the tune of $100,000,000. Per Carl Brown, HackerOne’s bug bounty program paid out about $80,000,000 in total last year.
Per Anthropic’s writeup, they spent $20,000 in compute to discover the OpenBSD bugs.
Even without making the obvious inference that Anthropic has spent an astronomical amount of money beyond the amount written above on this endeavor, we can see that these are not costs that AISLE or many other companies would be able to afford for just any arbitrary reason.
The claim is not that Anthropic is lying in some simplistic fashion. It is that there is significant and predictable reductionism in the interpretations which this announcement generated, which serve to hype the company up at the expense of the truth.
I’ll try to state again my broader theory, which I think is aligned largely with the OP and with the AISLE article, since it seems the point of view of the commentary from these sources is still not being understood.
1. AI-application design (harnesses/scaffolds as they are referred to in the articles) is extremely important to the capability of the AI system. A well-designed harness can enable capability in a relatively less intelligent model that will elude more intelligent models.
Some examples—
It was with the advent of ChatGPT and the underlying helpful assistant post-training that AI exploded into consumer use in the first place. The critical development was at the application layer. Model intelligence had been (to my understanding) steadily advancing up until that point and beyond it.
Claude Code (and Cursor to a significant extent before it) pioneered the coding agent harness, which has massively expanded the utility of LLMs for economically productive work. Throughout the period leading up to agentic coding and beyond it, model intelligence steadily advanced; however, the critical difference occurred with the development of the application.
We saw something similar with OpenClaw several months ago, and likewise, AISLE is the first known-to-me bug-finding application using LLMs (though I’m sure there are others that spawned in the same timeline, and for that matter, Cursor even has something like it in their development platform).
Please note some things about the above:
Model intelligence steadily advanced throughout these periods. The paradigm shift in each case was the application-layer.
In each of the cases listed above, it only took a short while for competitors to replicate the application design.
2. Application design often can mislead users into mistaking what is actually a well-designed narrow loop that coaxes intended behavior out of an LLM for more general intelligence capabilities.
Again, to return to Claude Code and its superlative success compared to other types of LLM economic activity, such as data analysis, financial analysis, etc., computer code has many advantages over these other types of work:
It has to compile.
It is possible to write arbitrary automated tests to verify and explore the functionality of computer code.
Creating a harness that leverages these features was a brilliant innovation, but the domain of coding is far closer to that of a chess game than many other types of knowledge work. It was a more tractable problem for various reasons, that then creates the illusion of generalized capabilities which thus far have not manifested themselves in the broader economy.
My claim to be clear is that:
Mythos is almost certainly going to represent an advancement on current public model capabilities
The bug discoveries are likely more explained in terms of:
The scaffold Anthropic built to deploy the model for this task
The amount of compute they threw at it
The model advancements themselves are only a part of the story and not the largest part either
My prediction is that we will shortly have another DeepSeek moment in the near future where someone successfully builds an open source scaffold that does something like what Mythos and AISLE are doing and then its off to the races as far as cybersecurity goes.
That is on the one hand quite scary but, as the OP said, “I suspect that it may be possible to develop machine learning models that can automatically perform any task with known solutions without this implying any superintelligence takeoff.”
Thanks for the detailed reply, I understand your point clearly now I think!
But $20,000 for *all* of the OpenBSD bugs (not just the published ones) doesn’t sound like that much to spend on inference compute to me. If AISLE could have spent the same and made an equally impressive announcement, unearthing enough bugs at once that government ministers around the world start issuing statements about it, then shouldn’t they have been able to find the investors to fund that? That would have been incredible publicity for them.
The crux for me seems to be whether they have made equally impressive announcements, as you suggest they might have done. Maybe they’re just worse at marketing. I don’t know enough to evaluate that claim properly, but that does seem the relevant question here: have Anthropic been able to use Mythos to go significantly beyond what the best harnesses could already achieve with existing models for the same inference spend? I thought the answer was a clear yes, and I didn’t find the original linked AISLE writeup very convincing at all. Your comment has made me more uncertain, but has still not convinced me, and I’d be really interested to read something more in depth on that question. (Maybe we also would disagree about what the word ‘significantly’ means here, since I guess you are acknowledging it probably represents some improvement).
(Also, I’d push back a bit on your characterization of AI progress. I agree the scaffolding is extremely important, but in my experience the “paradigm shifts” in capability over the last two and a half years I’ve been working with them have come from the models)
(And extra comment: the fact that cybersecurity capabilities might not imply imminent superintelligence takeoff seems an entirely independent point that I don’t necessarily disagree with)
I do think the models are the foundation of capability, and I have overstated my case, as I tend to do. What I want to say is that, I think model intelligence has largely steadily scaled, and that when a new application is developed (possible due to sufficient model advances), there is a sudden increase in experienced capability by consumers which feels like a giant leap in model development. That flood of new ability can be attributed to the application inasmuch as it opened the flood gates, but of course, the model is the thing functioning under the hood. To the point about hypey-discourse, I guess I’m just griping about the tendency to allow this optical illusion to influence people’s tone and assessment of progress.
It is hard to tell about the AISLE and Anthropic situation because of the very different size of the organizations and the lack of insider knowledge about either of them. To me, the requirement that AISLE replicate Anthropic’s findings in whole or in part feels like an unnecessary one to justify their claims. The way I take it is that AISLE’s activity has shown that with a proper system, it is already possible with publicly available models to do the sort of bug detection work that made headlines with the Mythos release. That is not to deny that Mythos + system is not an improvement over AISLE’s work. Assessing the nature of that improvement is hard for the aforementioned reasons about org scale differences and the general complexity of the thing being compared. It seems all parties agree that Mythos is a big step up in its ability to write exploits. I see no reason to challenge that.
I think its very hard to articulate critiques of hype, and that simultaneously I tend to write in an over-vehement and pugnacious way that makes me quite vulnerable to whatever arguments I would make against someone, so I kind of regret my engagement here, though I do think its true that there is a sort of ineffable tendency to amplify what feel-to-me to be likely reductionisms about model capabilities and how AI systems are engineered.
I took OP as trying to establish that the signal on progress to AGI is quite noisy, and expressing a frustration with narratives that feel too clean or reductionistic about progress. That’s highly subjective though. As you note, we probably can’t even really define what constitutes significant progress between us, though I suspect we could come to largely agree about the amount of progress made, just not what word to use to describe it.
I do think a fair test of my view point will be if in one year’s time we see a proliferation of products/services that do this sort deep bug-finding pipeline. My intuition on this is that cybersecurity is going to go through something similar to what software engineering did last year, driven by the rising tide of model quality in conjunction with a more acute set of innovations in the application layer.
[Edit: I don’t think my prediction proves anything actually, since it’s coming to pass could reflect many different underlying causalities]
That makes a lot of sense, thanks.
I’m sorry you’ve said you regret your engagement, since I’ve found your comments helpful (the link to AISLE’s OpenSSL zero days has shifted my view on this a fair bit).
I guess this whole discussion does just feel like a classic example of “All debates are bravery debates”.
I wanted to follow up on this thread and bring in some additional evidence on the whole AISLE thing. It isn’t definitive or anything, but both AISLE and Mythos have been used to scan curl and the results are interesting.
- AISLE identified five CVEs and 24 bugs (plus two more CVEs in a dependency).
- Mythos identified 1 CVE and potentially 20 bugs.
Now, Mythos scanned after AISLE. We don’t know what would have happened if Mythos had come first. But here’s some quotes by the maintainer of curl about the Mythos results:
> curl is certainly getting better thanks to this report, but counted by the volume of issues found, all the previous AI tools we have used have resulted in larger bugfix amounts.
> I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing.
> Any project that has not scanned their source code with AI powered tooling will likely find huge number of flaws, bugs and possible vulnerabilities with this new generation of tools. Mythos will, and so will many of the others.
Quotes taken from here: https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-vulnerability/
Daniel’s blog post has some not great English, and also his tone is a bit less than objective. And of course, this is not a rigorous scientific comparison. But he is an expert in software and writing secure software, and he has demonstrated a willingness to change his mind on AI for cybersecurity,[1] so I think he’s worth listening to.
AISLE wrote a little news release about their findings vis-a-vis Mythos in the curl project: https://aisle.com/blog/curl-adopts-aisle-after-its-ai-agents-discovered-5-cves
AISLE also wrote up another blog post building on the earlier one. In this case, they showcase a simple pipeline that is able to recreate a Mythos result without any steering towards the relevant snippet of code: https://aisle.com/blog/system-over-model-zero-day-discovery-at-the-jagged-frontier
Interestingly, the same harness is not as successful at recreating AISLE’s own results, but this is all a bit selective.
To my earlier claim that harnesses and pipelines enabling Mythos like results can and will be built in the near future, AISLE open-sourced nano-analyzer, the harness used in this blog to discover maybe as many as 40 bugs in FreeBSD, so it would appear that my prediction was fulfilled at the time of writing...
AISLE quotes $100 in spend to find the 40 bugs in FreeBSD and to recreate a Mythos result. That is presumably going to be 1/100th or less of compute spend by Anthropic. To me, this causes me to think that the compute spend by Anthropic is playing an important role in the whole assessment of the Mythos results. My belief now is that Anthropic’s high compute spend is probably due to an inefficient pipeline, which is somewhat of a reversal of my earlier belief that the pipeline was the key determinant in model capabilities. However, this doesn’t mean that I think model capabilities are hugely improved—it just means that I think that with a better pipeline, they would have spent less to get these results, given that AISLE was able to find some of these results for $100 with dumber models.
Here’s Daniel’s highly negative blog post on AI cybersecurity work from last Summer: https://daniel.haxx.se/blog/2025/07/14/death-by-a-thousand-slops/ Here’s where he changed his mind just a few months later after ZeroPath started to systematically find bugs using AI: https://daniel.haxx.se/blog/2025/10/10/a-new-breed-of-analyzers/