A non-anthropomorphized view of LLMs

Link post

When I talk about AI safety, especially to people unfamiliar to the area, I sometimes feel uncomfortably like I’m evoking terminator/​sci-fi. This post was useful grounding for feelings, and seems more reflective of how technical AI safety people talk about AI threats.

I also think this framing is useful for encouraging us to be specific about whether we’re concerned about LLM architectures, or if you believe risks will have to come after algorithmic improvements beyond transformers, neural nets etc.

Alignment and safety for LLMs mean that we should be able to quantify and bound the probability with which certain undesirable sequences are generated. The trouble is that we largely fail at describing ā€œundesirableā€ except by example, which makes calculating bounds difficult.

For a given LLM (without random seed) and sequence, it is trivial to calculate the probability of the sequence to be generated. So if we had a way of somehow summing or integrating over these probabilities, we could say with certainty ā€œthis model will generate an undesirable sequence once every N model evaluationsā€. We can’t, currently, and that sucks, but at the heart, this is the mathematical and computational problem we’d need to solve.

If you talk to AI-Safety skeptics who say AI is non-existentially threatening, because its not conscious, capable of reasoning etc. I think this might be useful to send to them.

Estimated reading time, 8 minutes. I don’t think its essential to understand the maths at the beginning.

Interesting HN discussion about the post here.