I almost clarified that I know some models technically are multi-modal, but my impression is that the visual reasoning abilities of the current models are very limited, so I’m not at all surprised they’re limited. Among other illustrations of this impression, occasionally I’ve found they struggle to properly describe what is happening in an image beyond a relatively general level.
Marcel D
Again, I’d be interested to actually see humans attempt the test by viewing the raw JSON, without being allowed to see/generate any kind of visualization of the JSON. I suspect that most people will solve it by visualizing and manipulating it in their head, as one typically does with these kinds of problems. Perhaps you (a person with syntax in their username) would find this challenge quite easy! Personally, I don’t think I could reliably do it without substantial practice, especially if I’m prohibited from visualizing it.
Just because an LLM can convert something to a grid representation/visualization does not mean it can itself actually “visualize” the thing. A pure-text model will lack the ability to observe anything visually. Just because a blind human can write out some mathematical function that they can input into a graphing calculator, that does not mean that the human necessarily can visualize what the function’s shape will take, even if the resulting graph is shown to everyone else.
I wouldn’t be surprised if that’s correct (though I haven’t seen the tests), but that wasn’t my complaint. A moderately smart/trained human can also probably convert from JSON to a description of the grid, but there’s a substantial difference in experience from seeing even a list of grid square-color labels vs. actually visualizing it and identifying the patterns. I would strike a guess that humans who are only given a list of square color labels (not just the raw JSON) would perform significantly worse if they are not allowed to then draw out the grids.
And I would guess that even if some people do it well, they are doing it well because they convert from text to visualization.
Can anyone point me to a good analysis of the ARC test’s legitimacy/value? I was a bit surprised when I listened to the podcast, as they made it seem like a high-quality, general-purpose test, but then I was very disappointed to see it’s just a glorified visual pattern abstraction test. Maybe I missed some discussion of it in the podcasts I listened to, but it just doesn’t seem like people pushed back hard enough on the legitimacy of comparing “language model that is trying to identify abstract geometric patterns through a JSON file” vs. “humans that are just visually observing/predicting the patterns.”
Like, is it wrong to demand that humans should have to do this test purely by interpreting the JSON (with no visual aide)?
I spent way too much time organizing my thoughts on AI loss-of-control (“x-risk”) debates without any feedback today, so I’m publishing perhaps one of my favorite snippets/threads:
A lot of debates seem to boil down to under-acknowledged and poorly-framed disagreements about questions like “who bears the burden of proof.” For example, some skeptics say “extraordinary claims require extraordinary evidence” when dismissing claims that the risk is merely “above 1%”, whereas safetyists argue that having >99% confidence that things won’t go wrong is the “extraordinary claim that requires extraordinary evidence.”
I think that talking about “burdens” might be unproductive. Instead, it may be better to frame the question more like “what should we assume by default, in the absence of definitive ‘evidence’ or arguments, and why?” “Burden” language is super fuzzy (and seems a bit morally charged), whereas this framing at least forces people to acknowledge that some default assumptions are being made and consider why.
To address that framing, I think it’s better to ask/answer questions like “What reference class does ‘building AGI’ belong to, and what are the base rates of danger for that reference class?” This framing at least pushes people to make explicit claims about what reference class building AGI belongs to, which should make it clearer that it doesn’t belong in your “all technologies ever” reference class.
In my view, the “default” estimate should not be “roughly zero until proven otherwise,” especially given that there isn’t consensus among experts and the overarching narrative of “intelligence proved really powerful in humans, misalignment even among humans is quite common (and is already often observed in existing models), and we often don’t get technologies right on the first few tries.”
I definitely think beware is too strong. I would recommend “discount” or “be skeptical” or something similar.
Venus is an extreme example of an Earth-like planet with a very different climate. There is nothing in physics or chemistry that says Earth’s temperature could not one day exceed 100 C.
[...]
[Regarding ice melting -- ] That will take time, but very little time on a cosmic scale, maybe a couple of thousand years.I’ll be blunt, remarks like these undermine your credibility. But regardless, I just don’t have any experience or contributions to make on climate change, other than re-emphasizing my general impression that, as a person who cares a lot about existential risk and has talked to various other people who also care a lot about existential risk, there seems to be very strong scientific evidence suggesting that extinction is unlikely.
Everything is going more or less as the scientists predicted, if anything, it’s worse.
I’m not that focused on climate science, but my understanding is that this is a bit misleading in your context—that there were some scientists in the (90s/2000s?) who forecasted doom or at least major disaster within a few decades due to feedback loops or other dynamics which never materialized. More broadly, my understanding is that forecasting climate has proven very difficult, even if some broad conclusions (e.g., “the climate is changing,” “humans contribute to climate change”) have held up. Additionally, it seems that many engineers/scientists underestimated the pace of alternative energy technology (e.g., solar).
That aside, I would be excited to see someone work on this project, and I still have not discovered any such database.
Forecasting With LLMs—An Open and Promising Research Direction
I don’t find this response to be a compelling defense of what you actually wrote:
since AIs would “get old” too [...] they could also have reason to not expropriate the wealth of vulnerable old agents because they too will be in such a vulnerable position one day
It’s one thing if the argument is “there will be effective enforcement mechanisms which prevent theft,” but the original statement still just seems to imagine that norms will be a non-trivial reason to avoid theft, which seems quite unlikely for a moderately rational agent.
Ultimately, perhaps much of your scenario was trying to convey a different idea from what I see as the straightforward interpretation, but I think it makes it hard for me to productively engage with it, as it feels like engaging with a motte-and-bailey.
Apologies for being blunt, but the scenario you lay out is full of claims that just seem to completely ignore very facially obvious rebuttals. This would be less bad if you didn’t seem so confident, but as written the perspective strikes me as naive and I would really like an explanation/defense.
Take for example:
Furthermore, since AIs would “get old” too, in the sense of becoming obsolete in the face of new generations of improved AIs, they could also have reason to not expropriate the wealth of vulnerable old agents because they too will be in such a vulnerable position one day, and thus would prefer not to establish a norm of expropriating the type of agent they may one day become.
Setting aside the debatable assumptions about AIs getting “old,” this just seems to completely ignore the literature on collective action problems. If the scenario were such that any one AI agent can expect to get away with defecting (expropriation from older agents) and the norm-breaking requires passing a non-small threshold of such actions, a rational agent will recognize that their defection has minimal impact on what the collective will do, so they may as well do it before others do.
There are multiple other problems in your post, but I don’t think it’s worth the time going through them all. I just felt compelled to comment because I was baffled by the karma on this post, unless it was just people liking it because they agreed with the beginning portion…?
Sure! (I just realized the point about the MNIST dataset problems wasn’t fully explained in my shared memo, but I’ve fixed that now)
Per the assessment section, some of the problems with assuming that FRVT demonstrates NIST’s capabilities for evaluation of LLMs/etc. include:
Facial recognition is a relatively “objective” test—i.e., the answers can be linked to some form of “definitive” answer or correctness metric (e.g., name/identity labels). In contrast, many of the potential metrics of interest with language models (e.g., persuasiveness, knowledge about dangerous capabilities) may not have a “definitive” evaluation method, where following X procedure reliably evaluates a response (and does so in a way that onlookers would look silly to dispute).
The government arguably had some comparative advantage in specific types of facial image data, due to collecting millions of these images with labels. The government doesn’t have a comparative advantage in, e.g., text data.
The government has not at all kept pace with private/academic benchmarks for most other ML capabilities, such as non-face image recognition (e.g., Common Objects in Context) and LLMs (e.g., SuperGLUE).
It’s honestly not even clear to me whether FRVT’s technical quality truly is the “gold standard” in comparison to the other public training/test datasets for facial recognition (e.g., MegaFace); it seems plausible that the value of FRVT is largely just that people can’t easily cheat on it (unlike datasets where the test set is publicly available) because of how the government administers it.
For the MNIST case, I now have the following in my memo:
Even NIST’s efforts with handwriting recognition were of debatable quality: Yann LeCun’s widely-used MNIST is a modification of NIST’s datasets, in part because NIST’s approach used census bureau employees’ handwriting for the training set and high school students’ handwriting for the test set.[1]
- ^
Some may argue this assumption was justified at the time because it required that models could “generalize” beyond the training set. However, popular usage appears to have favored MNIST’s approach. Additionally, it is externally unclear that one could effectively generalize from the handwriting of a narrow and potentially unrepresentative segment of society—professional bureaucrats—to high schoolers’, and the assumption that this would be necessary (e.g., due to the inability to get more representative data) seems unrealistic.
Seeing the drama with the NIST AI Safety Institute and Paul Christiano’s appointment and this article about the difficulty of rigorously/objectively measuring characteristics of generative AI, I figured I’d post my class memo from last October/November.
The main point I make is that NIST may not be well suited to creating measurements for complex, multi-dimensional characteristics of language models—and that some people may be overestimating the capabilities of NIST because they don’t recognize how incomparable the Facial Recognition Vendor Test is to this situation of subjective metrics for GenAI and they don’t realize NIST arguably even botched MNIST (which was actually produced by Yann LeCun by recompiling NIST’s datasets). Moreover, government is slow, while AI is fast. Instead, I argue we should consider an alternative model such as federal funding for private/academic benchmark development (e.g., prize competitions).
I wasn’t sure if this warranted a full post, especially since it feels a bit late; LMK if you think otherwise!
From Laboratories to Language Models: Can AI Support Rigor in the Jungle of Policy Analysis? (Linkpost)
I probably should have been more clear, my true “final” paper actually didn’t focus on this aspect of the model: the offense-defense balance was the original motivation/purpose of my cyber model, but I eventually became far more interested in using the model to test how large language models could improve agent-based modeling by controlling actors in the simulation. I have a final model writeup which explains some of the modeling choices in more detail and talks about the original offense/defense purpose in more detail.
(I could also provide the model code which is written in Python and, last I checked, runs fine, but I don’t expect people would find it to be that valuable unless they really want to dig into this further, especially given that it might have bugs.)
If offence and defence both get faster, but all the relative speeds stay the same, I don’t see how that in itself favours offence
Funny you should say this, it so happens that I just submitted a final paper last night for an agent-based model which was meant to test exactly this kind of claim for the impacts of improving “technology” (AI) in cybersecurity. Granted, the model was extremely simple + incomplete, but the theoretical results explain how this could possible.
In short, when assuming a fixed number of vulnerabilities in an attack surface, while attackers’ and defenders’ budgets are very small there may be many more vulnerabilities that go unnoticed. For example, suppose they together can only explore 10% of the attack surface, but vulnerabilities are only in 1% of the surface. Thus, even if atk/def budgets increase by the same factor (e.g., 10x), it increases the likelihood that vulnerabilities are found either by the attacker or defender.
The following results are admittedly not very reliable (I didn’t do any formal verification/validation beyond spot checks), but the point of showing these graphs is not “here are the definitive numbers” but more an illustrative “here is what the pattern of relationships between attack surface, atk/def budgets, and theft rate could look like”.
Notice how as the attack surface increases the impact of multiplying the attackers and defenders’ budgets causes more convergence. With a hypothetical 1x1 attack surface (grid) for each actor, the budget multiplication should have no effect on loss rates, because all vulnerabilities are found and it’s just a matter of who found them first, which is not affected by budget multiplication. However, with a hypothetical infinite by infinite grid, the multiplication of budgets strictly benefits the attacker, because the defenders’ will ~never check the same squares that the attacker checks.
(ultimately my model makes many unrealistic assumptions and may have had bugs, but this seemed like a decent intuition seed—not a true “conclusion” which can be carelessly applied elsewhere.)
Thank you so much for articulating a bunch of the points I was going to make!
I would probably just further drive home the last paragraph: it’s really obvious that the “number of people a lone maniac can kill in given time” (in America) has skyrocketed with the development of high fire-rate weapons (let alone knowledge of explosives). It could be true that the O/D balance for states doesn’t change (I disagree) while the O/D balance for individuals skyrockets.
Has anyone thought about trying to convince anti-regulatory figures (e.g., Marc Andreessen) in the new admin’s orbit to speak out against the regulatory capture of banning cultivated meat? Has anyone tried painting cultivated meat as “Little Tech”?